0
点赞
收藏
分享

微信扫一扫

jupyter pyspark 开发环境搭建(在线、离线)


应用场景

  • 在Jupter中,使用 Python语言进行数据分析是一种潮流/趋势。
  • 如何在 Jupyter 中引入 Spark ,从而进行大数据清洗、挖掘等是值得研究的问题。
  • 技术方案的选择有很多,然由于多方面原因终究要探索出适合自己的~

实现方案

  • 方案 1
  • ​​利用 Apache Toree 在Jupyter 中引入 Spark, 从而建立Scala,PySpark,SparkR,SQL内核​​
  • 内核配置命令
  • ​jupyter toree install --interpreters=PySpark​
  • 不支持 Magics 命令, 使用非常首先,并且启动较慢。不建议(值得了解)
  • 方案 2
  • ​​findspark. 推荐whl包​​
  • ​​spark-2.3.0-bin-hadoop2.7. tgz​​
  • ~/.bashrc​​SPARK_HOME环境变量配置​​
  • 推荐,启动快、支持 Magics, 使用便捷。

搭建过程

  • α α
  • ​​findspark-1.1.0-py2.py3-none-any.whl​​
  • ​​spark-2.3.0-bin-hadoop2.7. tgz​​
  • β β
  • 安装 findspark
  • ​/root/anaconda2/bin/pip install findspark-1.1.0-py2.py3-none-any.whl​
  • 解压 spark-2.3.0-bin-hadoop2.7. tgz
  • ​tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz -C your-spark-home​
  • 配置 SPARK_HOME
    ​​​export SPARK_HOME=your-spark-home
    export PATH=$SPARK_HOME/bin:$PATH
    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'​
  • γ γ 测试过程
  • JSON 格式测试数据

[{"itemNo" : 1, "name" : "ferrari", "speed" : 259, "weight": 800},  {"itemNo" : 2, "name" : "jaguar", "speed" : 274, "weight":998},  {"itemNo" : 3, "name" : "mercedes", "speed" : 340, "weight": 1800},  {"itemNo" : 4, "name" : "audi", "speed" : 345, "weight": 875},  {"itemNo" : 5, "name" : "lamborghini", "speed" : 355, "weight": 1490},{"itemNo" : 6, "name" : "chevrolet", "speed" : 260, "weight": 900},  {"itemNo" : 7, "name" : "ford", "speed" : 250, "weight": 1061},  {"itemNo" : 8, "name" : "porche", "speed" : 320, "weight": 1490},  {"itemNo" : 9, "name" : "bmw", "speed" : 325, "weight": 1190},  {"itemNo" : 10, "name" : "mercedes-benz", "speed" : 312, "weight": 1567}]


  • 测试代码

import findspark
import os
findspark.init()

%matplotlib inline

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df = spark.read.json('./cars_datas.json')

filtered = df[['speed']]


  • 测试效果

其他命令

  • 查看当前内核列表
  • ​jupyter kernelspec list​
  • 卸载 Jupyter kernel
  • ​sudo jupyter kernelspec uninstall your-kernel-name​

Refereneces

  • ​​jupyter spark环境配置(在线、离线均可实现). 对比文章​​
  • ​​Apache Spark in Python: Beginner’s Guide. 推荐​​
  • ​​Jupyter pyspark : no module named pyspark​​
  • ​​How to use SparkSession in Apache Spark 2.0. 推荐​​
  • ​​Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. 推荐​​
  • ​​Spark-Tutorial provides a quick introduction to using Spark. It demonstrates the basic functionality of RDD and DataFrame API. GitHub​​
  • ​​Welcome to Spark Python API Docs. 2.2.0​​
  • ​​如何将PySpark导入Python​​

Links

  • ​​PySpark实战指南-InstallingSpark.pdf​​


举报

相关推荐

0 条评论