0
点赞
收藏
分享

微信扫一扫

fast overview of pyspark

凶猛的小白兔 2023-05-07 阅读 43

fast overview of pyspark

install spark on mac and set up JAVA_HOME

# if you have jdk at folder /Library/Java/JavaVirtualMachines
# set JAVA_HOME like I do and replace the jdk with your version
# at ~/.zshrc file add
export OPENJDK_JAVA_17_HOME="/Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home"
export JAVA_HOME=$OPENJDK_JAVA_17_HOME
alias openjdk17='export JAVA_HOME=$OPENJDK_JAVA_17_HOME'
export PATH="/usr/local/bin:/usr/local/sbin:$PATH"
#############################分割线
# install spark
brew install spark
# test pyspark by type in pyspark
pyspark # I make it 

the main concept of spark

  • is RDD (弹性分布式数据集)
  • the operator on RDD: transform and move, view transform is some kind of filter and map this return the same type data but move does not
# at pyspark shell
input = sc.textFile("RM.md") # at this folder and parse `this RM.md`(the current you are looking at) file 
warn = input.filter(lambda x: "jdk" in x)
error = input.filter(lambda x:"spark" in x)
bad = warn.union(error) 
# the above function like filter union is transform and it won't switch the datatype of or bad or warn or error, which are call the RDD

bad.take(1)
bad.count()
# while these two functions will switch the datatype and make it or str or int

# notice that transform is lazy value it won't result when you don't get it
# and mostly we deal with RDD with function usually takes function as parameter 
at then you can make lambda function for this parameter


mainly RDD operation function

  • map() takes a function and every element of A RDD use that function(and that function return) just like python map(), usually with collect() function

  • filter() takes a bool function as you already know

  • union() takes another same datatype and return same type

  • intersectino() takes same type and return same type

  • subtract() as you may know what I means

  • reduce() take a function that is for calc and return a number value or something act like sum them up

  • take(n) return RDD the front n elements, place n elements them to a list

  • top(n) behave like take(n) n has type int

  • count() return the count of the RDD element

persistence: save to disk or something act like filesystem (done now you can work with pyspark)

举报

相关推荐

0 条评论