fast overview of pyspark
install spark on mac and set up JAVA_HOME
# if you have jdk at folder /Library/Java/JavaVirtualMachines
# set JAVA_HOME like I do and replace the jdk with your version
# at ~/.zshrc file add
export OPENJDK_JAVA_17_HOME="/Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home"
export JAVA_HOME=$OPENJDK_JAVA_17_HOME
alias openjdk17='export JAVA_HOME=$OPENJDK_JAVA_17_HOME'
export PATH="/usr/local/bin:/usr/local/sbin:$PATH"
#############################分割线
# install spark
brew install spark
# test pyspark by type in pyspark
pyspark # I make it
the main concept of spark
- is RDD (弹性分布式数据集)
- the operator on RDD: transform and move, view transform is some kind of filter and map this return the same type data but move does not
# at pyspark shell
input = sc.textFile("RM.md") # at this folder and parse `this RM.md`(the current you are looking at) file
warn = input.filter(lambda x: "jdk" in x)
error = input.filter(lambda x:"spark" in x)
bad = warn.union(error)
# the above function like filter union is transform and it won't switch the datatype of or bad or warn or error, which are call the RDD
bad.take(1)
bad.count()
# while these two functions will switch the datatype and make it or str or int
# notice that transform is lazy value it won't result when you don't get it
# and mostly we deal with RDD with function usually takes function as parameter
at then you can make lambda function for this parameter
mainly RDD operation function
-
map() takes a function and every element of A RDD use that function(and that function return) just like python map(), usually with collect() function
-
filter() takes a bool function as you already know
-
union() takes another same datatype and return same type
-
intersectino() takes same type and return same type
-
subtract() as you may know what I means
-
reduce() take a function that is for calc and return a number value or something act like sum them up
-
take(n) return RDD the front n elements, place n elements them to a list
-
top(n) behave like take(n) n has type int
-
count() return the count of the RDD element