0
点赞
收藏
分享

微信扫一扫

[Spark基础]-- spark rdd collect操作官方解释


官方原文如下

Printing elements of an RDD

Another common idiom is attempting to print out the elements of an RDD using ​​rdd.foreach(println)​​​ or ​​rdd.map(println)​​​. On a single machine, this will generate the expected output and print all the RDD’s elements. However, in ​​cluster​​​ mode, the output to ​​stdout​​​ being called by the executors is now writing to the executor’s ​​stdout​​​ instead, not the one on the driver, so ​​stdout​​​ on the driver won’t show these! To print all elements on the driver, one can use the ​​collect()​​​ method to first bring the RDD to the driver node thus: ​​rdd.collect().foreach(println)​​​. This can cause the driver to run out of memory, though, because ​​collect()​​​ fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the ​​take()​​​: ​​rdd.take(100).foreach(println)​​.

 

主要意思是:

打印一个弹性分布式数据集元素,使用时要注意不要导致内存溢出!

建议使用 ​​take()​​​: ​​rdd.take(100).foreach(println),​

​而不使用rdd.collect().foreach(println)。​

​因为后者会导致内存溢出!!​

 

 

 

举报

相关推荐

0 条评论