应用场景
- 在Jupter中,使用 Python语言进行数据分析是一种潮流/趋势。
- 如何在 Jupyter 中引入 Spark ,从而进行大数据清洗、挖掘等是值得研究的问题。
- 技术方案的选择有很多,然由于多方面原因终究要探索出适合自己的~
实现方案
- 方案 1
- 利用 Apache Toree 在Jupyter 中引入 Spark, 从而建立Scala,PySpark,SparkR,SQL内核
- 内核配置命令
-
jupyter toree install --interpreters=PySpark
- 不支持 Magics 命令, 使用非常首先,并且启动较慢。不建议(值得了解)
- 方案 2
- findspark. 推荐whl包
- spark-2.3.0-bin-hadoop2.7. tgz
- ~/.bashrcSPARK_HOME环境变量配置
- 推荐,启动快、支持 Magics, 使用便捷。
搭建过程
- α α
- findspark-1.1.0-py2.py3-none-any.whl
- spark-2.3.0-bin-hadoop2.7. tgz
- β β
- 安装 findspark
-
/root/anaconda2/bin/pip install findspark-1.1.0-py2.py3-none-any.whl
- 解压 spark-2.3.0-bin-hadoop2.7. tgz
-
tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz -C your-spark-home
- 配置 SPARK_HOME
export SPARK_HOME=your-spark-home
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
- γ γ 测试过程
- JSON 格式测试数据
[{"itemNo" : 1, "name" : "ferrari", "speed" : 259, "weight": 800}, {"itemNo" : 2, "name" : "jaguar", "speed" : 274, "weight":998}, {"itemNo" : 3, "name" : "mercedes", "speed" : 340, "weight": 1800}, {"itemNo" : 4, "name" : "audi", "speed" : 345, "weight": 875}, {"itemNo" : 5, "name" : "lamborghini", "speed" : 355, "weight": 1490},{"itemNo" : 6, "name" : "chevrolet", "speed" : 260, "weight": 900}, {"itemNo" : 7, "name" : "ford", "speed" : 250, "weight": 1061}, {"itemNo" : 8, "name" : "porche", "speed" : 320, "weight": 1490}, {"itemNo" : 9, "name" : "bmw", "speed" : 325, "weight": 1190}, {"itemNo" : 10, "name" : "mercedes-benz", "speed" : 312, "weight": 1567}]
- 测试代码
import findspark
import os
findspark.init()
%matplotlib inline
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json('./cars_datas.json')
filtered = df[['speed']]
- 测试效果
其他命令
- 查看当前内核列表
-
jupyter kernelspec list
- 卸载 Jupyter kernel
-
sudo jupyter kernelspec uninstall your-kernel-name
Refereneces
- jupyter spark环境配置(在线、离线均可实现). 对比文章
- Apache Spark in Python: Beginner’s Guide. 推荐
- Jupyter pyspark : no module named pyspark
- How to use SparkSession in Apache Spark 2.0. 推荐
- Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. 推荐
- Spark-Tutorial provides a quick introduction to using Spark. It demonstrates the basic functionality of RDD and DataFrame API. GitHub
- Welcome to Spark Python API Docs. 2.2.0
- 如何将PySpark导入Python
Links
- PySpark实战指南-InstallingSpark.pdf