Pyspark中pyspark.sql.functions常用方法(4)-CFANZ编程社区

一：什么是集合

集合是一个无序的，存放不重复元素的序列。

和列表等序列不同的是，集合中的元素没有顺序，每一次编译其中的元素顺序都会改变。其次，集合中的元素都是严格不相同的，也就是说每种数据在集合中都是唯一的。

由于集合的这种特点，在需要进行数据去重，也就是单个数据的个数对于最终实现的功能没有影响时，集合是一种很好的选择。同样的，字典也可以实现去重的操作，同时也可以进行计数。

二：集合操作

1：创建集合

元组使用小括号创建，列表使用中括号创建，字典使用大括号创建，那么集合呢？

集合同样也是使用大括号进行创建，由于和字典的结构不同，所以编译器可以智能识别创建的是字典还是集合。

s1 = {'apple', 'cat', 'human', 123456}

print("s1 = ", s1)

>>>s1 =  {123456, 'apple', 'human', 'cat'}

2：集合计算

集合的计算包括集合之间的交、并、补、判断子集等操作。

在Python中每种操作不仅有相对应的计算符号，还有相对应的方法。具体操作如下

s1 = {'apple', 'cat', 'human', 123456}
s2 = {'apple', 'cat', 'cat', 888888}

# s1中存在，s2中不存在的元素
print("Element in s1 but not in s2:", s1 - s2)
print("Element in s1 but not in s2:", s1.difference(s2))

# s2中存在，s2中不存在的元素
print("Element in s2 but not in s1:", s2 - s1)
print("Element in s2 but not in s1:", s2.difference(s1))

# s1并s2
print("All element in s1 and s2:", s1 | s2)
print("All element in s1 and s2:", s1.union(s2))

# s1交s2
print("Element both in s1 and s2:", s1 & s2)
print("Element both in s1 and s2:", s1.intersection(s2))

# s1和s2的对称差
print("Element not both in s1 and s2:", s1 ^ s2)
print("Element not both in s1 and s2:", s1.symmetric_difference(s2))

# 判断s1和s2是否有交集
print("Is Element s1 and s2 have intersection?", not s1.isdisjoint(s2))

# 判断s1是否为s2子集
print("Does s1 is the subset of s2?", s1.issubset(s2))
s3 = {888888, 'cat'}
print("Does s3 is the subset of s2?", s3.issubset(s2))

3：元素操作

# 添加元素
s1.add('element')
print("New set s1:", s1)

s2.update(['element', 'list'])
print("New set s2:", s2)

# 移除元素
s1.remove('element')
print("New set s1:", s1)

s2.discard(123456)
print("New set s2:", s2)

s1.pop()
print("New set s1:", s1)

需要注意的是，使用remove操作，如果原集合中没有要删除的元素，编译器会报错。而使用dicard则不会报错。而pop操作则是删除集合最左端的数据，由于每次编译时集合中的数据都是随机顺序，所以pop操作实际上是一种随机删除。

4：整体操作

对于整个集合有整体复制和整体删除操作。

s1 = {'apple', 'cat', 'human', 123456}
s2 = {'apple', 'cat', 'cat', 888888}
print("Length of s1:", len(s1), "\nLength of s2:", len(s2))

# 集合复制
s3 = s2.copy()

# 集合清除
s2.clear()
print("New set s2:", s2)