fast overview of pyspark

install spark on mac and set up JAVA_HOME

# if you have jdk at folder /Library/Java/JavaVirtualMachines
# set JAVA_HOME like I do and replace the jdk with your version
# at ~/.zshrc file add
export OPENJDK_JAVA_17_HOME="/Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home"
export JAVA_HOME=$OPENJDK_JAVA_17_HOME
alias openjdk17='export JAVA_HOME=$OPENJDK_JAVA_17_HOME'
export PATH="/usr/local/bin:/usr/local/sbin:$PATH"
#############################分割线
# install spark
brew install spark
# test pyspark by type in pyspark
pyspark # I make it

the main concept of spark

is RDD (弹性分布式数据集)
the operator on RDD: transform and move, view transform is some kind of filter and map this return the same type data but move does not

# at pyspark shell
input = sc.textFile("RM.md") # at this folder and parse `this RM.md`(the current you are looking at) file 
warn = input.filter(lambda x: "jdk" in x)
error = input.filter(lambda x:"spark" in x)
bad = warn.union(error) 
# the above function like filter union is transform and it won't switch the datatype of or bad or warn or error, which are call the RDD

bad.take(1)
bad.count()
# while these two functions will switch the datatype and make it or str or int

# notice that transform is lazy value it won't result when you don't get it
# and mostly we deal with RDD with function usually takes function as parameter 
at then you can make lambda function for this parameter

mainly RDD operation function

map() takes a function and every element of A RDD use that function(and that function return) just like python map(), usually with collect() function
filter() takes a bool function as you already know
union() takes another same datatype and return same type
intersectino() takes same type and return same type
subtract() as you may know what I means
reduce() take a function that is for calc and return a number value or something act like sum them up
take(n) return RDD the front n elements, place n elements them to a list
top(n) behave like take(n) n has type int
count() return the count of the RDD element

persistence: save to disk or something act like filesystem (done now you can work with pyspark)