【PySpark入門】第２弾 PySparkの環境構築

はじめに

こんにちは。孔子の80代目子孫兼ディベロップメントサービス課の孔です。今日は久しぶりに新宿に行ってきました。夏といえば新宿御苑、都会の中で緑に包まれて散歩するのは気持ちいいですね。

前回のブログではPySparkがどのようなものか、概要をざっくり見てみました。「分散処理を用いて大容量のデータを分析するためのフレームワーク」であるSparkが、Pythonで使えるようAPIを提供しています。その「PythonでSparkを利用するためのAPI」のことをPySparkと呼ぶとの内容でしたね。（前回の記事はこちらのリンクをご参照ください）

これからこのPySpark入門シリーズで、実装いろいろなAPIを使ってPySparkの使い方を見ていきますが、そもそもそのPySparkを動かす基盤がないといけないですね。OS環境に合わせて、スタンドアロンで動くPySpark環境を構築することも可能ですが、こちらのシリーズではPySparkの操作方法などを中心に取り上げつつ、Sparkで使われる概念をさらっていくことを目的とします。もしクラスター構成でのSpark環境構築や、スタンドアロンな環境を構築する場合は、Spark公式ドキュメント(※1)を参考にして構築ください。

本ブログでは、手軽にPySparkのお試しができるDockerを使ってPySpark環境を構築する方法およびpipコマンドを使った実行環境構築方法をご紹介しますので、ご参考下されば幸いです。

Dockerを使った環境構築

準備

まずはDockerをインストールする必要があります。Dockerのインストールしてない方は、以下のリンクで説明していますので、ご参照いただければと思います。

docs.docker.com

Dockerのインストールが終わりましたら、準備は完了です。ちなみに私が今回使用する環境は以下となります。

OS：Windows 10 Docker Desktop：3.5.2 Docker Engine：20.10.7

イメージダウンロード

PySparkを動かすためのDockerイメージをダウンロードし、コンテナ環境を作成してみましょう。Dockerに関する概念や簡単な使い方は、参考サイトにいくつかのURLを残していますので、もし気になった方はご確認いただければと思います(※2, ※3)

今回のDockerイメージは、jupyter/pyspark-notebookになります。PySparkの環境およびJupyter(※4)環境も同時にセットしてくれるありがたいイメージなので、こちらを使用します。Dockerでこちらのイメージをダウンロードするコマンドは以下となります（容量が3.3GBくらいありますので、お気をつけてください）

$ docker pull jupyter/pyspark-notebook

この記事を作成した時点での最新バージョンは、以下のものとなります。このブログシリーズでは以下のバージョンを使用して行きます。

https://hub.docker.com/layers/jupyter/pyspark-notebook/ubuntu-20.04/images/sha256-a9d7a48492e5c76d0e23d4c58778f277134a4d2de5d4ece23fd4f0088ad266d6?context=explore

イメージのダウンロードが完了しましたら、コンテナ環境を作成してみましょう。以下のrunコマンドを実行すると、いくつかログが出力され、最後にJupyterを使用するためのURLが発行されます。

$ docker run -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes --name pyspark jupyter/pyspark-notebook
/usr/local/bin/start-notebook.sh: running hooks in /usr/local/bin/before-notebook.d
/usr/local/bin/start-notebook.sh: running /usr/local/bin/before-notebook.d/spark-config.sh
/usr/local/bin/start-notebook.sh: done running hooks in /usr/local/bin/before-notebook.d
Executing the command: jupyter lab
[I 2021-07-25 14:07:30.921 ServerApp] jupyterlab | extension was successfully linked.
[W 2021-07-25 14:07:30.925 NotebookApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-07-25 14:07:30.925 NotebookApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-07-25 14:07:30.925 NotebookApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2021-07-25 14:07:30.933 ServerApp] Writing Jupyter server cookie secret to /home/jovyan/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2021-07-25 14:07:31.079 ServerApp] nbclassic | extension was successfully linked.
[I 2021-07-25 14:07:31.102 ServerApp] nbclassic | extension was successfully loaded.
[I 2021-07-25 14:07:31.103 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.9/site-packages/jupyterlab
[I 2021-07-25 14:07:31.103 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 2021-07-25 14:07:31.107 ServerApp] jupyterlab | extension was successfully loaded.
[I 2021-07-25 14:07:31.108 ServerApp] Serving notebooks from local directory: /home/jovyan
[I 2021-07-25 14:07:31.108 ServerApp] Jupyter Server 1.9.0 is running at:
[I 2021-07-25 14:07:31.108 ServerApp] http://xxxx:8888/lab?token=xxxxxxx
[I 2021-07-25 14:07:31.108 ServerApp]  or http://127.0.0.1:8888/lab?token=xxxxxxx
[I 2021-07-25 14:07:31.108 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2021-07-25 14:07:31.111 ServerApp]

    To access the server, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/jpserver-8-open.html
    Or copy and paste one of these URLs:  # Jupyterを使用するためのURLが発行
        http://xxxxxxx:8888/lab?token=xxxxxxx
     or http://127.0.0.1:8888/lab?token=xxxxxxx

URLにアクセスすると、以下のような画面が表示されます。

f:id:swx-kong:20210725231934j:plain

LauncherタブのOther項目にあるTerminalをクリックし、表示されたタブから以下のコマンドを入力してみましょう。同じログが表示され、プロンプトが表示されたら問題なく構築官僚となります。

(base) jovyan@3eecdb8ae16d:~$ pyspark
Python 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/07/25 14:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/07/25 14:21:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.9.5 (default, Jun 19 2021 00:32:32)
Spark context Web UI available at http://3eecdb8ae16d:4041
Spark context available as 'sc' (master = local[*], app id = local-1627222886023).
SparkSession available as 'spark'.
>>>

f:id:swx-kong:20210725232223j:plain

最後に

新しい技術を学ぶ際に一番挫折するところ、環境構築ですよね～ちなみに私はWindows 10環境にPySpark環境をスタンドアロンで構築しようとして、２日間はまった記憶があります。それでは、面白くない話はここで終わりで、次回からは実際PySparkを使ってコードを書いてみましょう！書きながらPySparkで使用される主要な概念なども一緒に触れていけたらと思います。それでは、またお会いしましょう！