应用场景
- 为了能在jupyter中开发spark程序,博文记录在jupyter 中配置spark 开发环境过程。
- 参考很多博客无法有效搭建 jupyter 中spark开发环境!
必备组件
- spark 下载
- spark-2.3.0-bin-hadoop2.7.tgz
- Apache Toree
-
Apache Toree has one main goal: provide the foundation for interactive applications to connect and use Apache Spark.
- 下载地址
- apache/incubator-toree GitHub源代码
- incubator/toree tar包下载
- 标注 :
- 系统环境未安装Scala、Hadoop
[root@localhost bin]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)
安装命令
- 在线 install
- anaconda 环境变量已在Linux PATH环境变量中,没有则配置;或者切换至Anaconda bin目录下,利用pip命令安装、配置。
# your-spark-home : spark安装包路径
pip install toree
jupyter toree install
- 离线 install
- 下载 GitHub 源代码、tar包均可实现离线安装。
- 源代码安装
-
/root/anaconda2/bin/python setup.py install
-
jupyter toree install --spark_home=your-spark-home
测试代码
- 测试环境是否搭建成功
import org.apache.spark.sql.SparkSession
object sparkSqlDemo {
val sparkSession = SparkSession.builder().
master("local[1]")
.appName("spark session example")
.getOrCreate()
def main(args: Array[String]) {
val input = sparkSession.read.json("cars1.json")
input.createOrReplaceTempView("Cars1")
val result = sparkSession.sql("select * from Cars1")
result.show()
}
}
sparkSqlDemo.main(Array()) # 调用方法
- 执行结果
扩展 : 安装多内核
- Installing Multiple Kernels
- Options
-
--interpreters=<Unicode> (ToreeInstall.interpreters)
Default: 'Scala'
A comma separated list of the interpreters to install. The names of the
interpreters are case sensitive.
jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL
References
- Jupyter Notebook通过toree配置PySpark开发环境及其工作原理. 推荐
- 基于pyspark 和scala spark的jupyter notebook 安装. 推荐
- Apache Spark in Python: Beginner’s Guide. 推荐
- toree-0.2.0.dev1.tar.gz
- Apache Toree
- hadoop-common-2.2.0-bin
FAQs
- How do I visualize data?
- only one sparkcontext maybe running in this jvm