jupyter spark环境配置(在线、离线均可实现)-CFANZ编程社区

应用场景

为了能在jupyter中开发spark程序，博文记录在jupyter 中配置spark 开发环境过程。
参考很多博客无法有效搭建 jupyter 中spark开发环境！

必备组件

spark 下载

spark-2.3.0-bin-hadoop2.7.tgz

Apache Toree

Apache Toree has one main goal: provide the foundation for interactive applications to connect and use Apache Spark.
下载地址

apache/incubator-toree GitHub源代码
incubator/toree tar包下载

标注 :

系统环境未安装Scala、Hadoop

[root@localhost bin]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)

安装命令

在线 install

anaconda 环境变量已在Linux PATH环境变量中，没有则配置；或者切换至Anaconda bin目录下，利用pip命令安装、配置。

# your-spark-home : spark安装包路径
pip install toree
jupyter toree install

离线 install

下载 GitHub 源代码、tar包均可实现离线安装。
源代码安装

/root/anaconda2/bin/python setup.py install
jupyter toree install --spark_home=your-spark-home

测试代码

测试环境是否搭建成功

import org.apache.spark.sql.SparkSession

object sparkSqlDemo {
    val sparkSession = SparkSession.builder().
        master("local[1]")
        .appName("spark session example")
        .getOrCreate()

    def main(args: Array[String]) {
        val input = sparkSession.read.json("cars1.json")
        input.createOrReplaceTempView("Cars1")
        val result = sparkSession.sql("select * from Cars1")
        result.show()
    }
}

sparkSqlDemo.main(Array()) # 调用方法

执行结果

jupyter spark环境配置(在线、离线均可实现)_spark

扩展 : 安装多内核

Installing Multiple Kernels

Options
--interpreters=<Unicode> (ToreeInstall.interpreters) Default: 'Scala' A comma separated list of the interpreters to install. The names of the interpreters are case sensitive.

jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL

References

Jupyter Notebook通过toree配置PySpark开发环境及其工作原理. 推荐
基于pyspark 和scala spark的jupyter notebook 安装. 推荐
Apache Spark in Python: Beginner’s Guide. 推荐
toree-0.2.0.dev1.tar.gz
Apache Toree
hadoop-common-2.2.0-bin

FAQs

How do I visualize data?
only one sparkcontext maybe running in this jvm