CDH集群上部署Python3环境及运行Pyspark作业

2023-09-26 约 768 字预计阅读 2 分钟

https://bing.ee123.net/img/rand?artid=89186733

CDH集群上部署Python3环境及运行Pyspark作业

Anaconda与Python版本对应关系表

Anaconda2/3	Python23	Python2
5.2.0	3.6.5	2.7.14
5.1.0	3.6.4	2.7.14
5.0.1	3.6.3	2.7.14
5.0.0	3.6.2	2.7.13
4.4.0	3.6.1	2.7.13
4.3.1	3.6.0	2.7.13
4.3.0	3.6.0	2.7.13
4.2.0	3.5.2	2.7.12
4.1.1	3.5.2	2.7.12
4.1.0	3.5.1	2.7.11
4.0.0	3.5.1	2.7.11

下载anaconda安装包

wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh

安装anaconda

bash Anaconda3-4.4.0-Linux-x86_64.sh

按回车键

[root@node00 ~]# bash Anaconda3-4.4.0-Linux-x86_64.sh 

Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>                                                                                            # (按回车键)
===================================
Anaconda End User License Agreement
===================================

输入yes

Copyright 2017, Continuum Analytics, Inc.
...      																						# 省略
kerberos (krb5, non-Windows platforms)
A network authentication protocol designed to provide strong authentication
for client/server applications by using secret-key cryptography.

cryptography
A Python library which exposes cryptographic recipes and primitives.

Do you approve the license terms? [yes|no]
>>> yes 																					  # 输入 yes
Anaconda3 will now be installed into this location:
/root/anaconda3

输入安装路径 /opt/cloudera/anaconda3

如果提示“tar (child): bzip2: Cannot exec: No such file or directory”，需要先安装bzip2。 sudo yum -y install bzip2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda3] >>> /opt/cloudera/anaconda3         # 输入安装路径 /opt/cloudera/anaconda3
PREFIX=/opt/cloudera/anaconda3
installing: python-3.6.1-2 ...
installing: _license-1.1-py36_1 ...

设置anaconda的PATH路径：

为了确保pyspark任务提交后使用python3，故输入no，重新设置PATH

installing: alabaster-0.7.10-py36_0 ...
...       																			# 省略
installing: zlib-1.2.8-3 ...
installing: anaconda-4.4.0-np112py36_0 ...
installing: conda-4.3.21-py36_0 ...
installing: conda-env-2.6.0-0 ...
Python 3.6.1 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /root/.bashrc ? [yes|no]
[no] >>> no															  # 输入 no

You may wish to edit your .bashrc or prepend the Anaconda3 install location:

$ export PATH=/opt/cloudera/anaconda3/bin:$PATH

Thank you for installing Anaconda3!

Share your notebooks and packages on Anaconda Cloud!
Sign up for free: https://anaconda.org

设置anaconda3的环境变量

[root@node00 ~]# echo "export PATH=/opt/cloudera/anaconda3/bin:$PATH" >> /etc/profile
[root@node00 ~]# source /etc/profile
[root@node00 ~]# env |grep PATH
PATH=/opt/cloudera/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games

验证Python版本

[root@node00 ~]# python
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

或

[root@node00 ~]# python -V
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)

在CM配置Spark的Python环境

export PYSPARK_PYTHON=/opt/cloudera/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/cloudera/anaconda3/bin/python

重启相关服务。

使用Pyspark命令测试

x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, 100*x, x**2))
print(x.collect())
print(y.collect())

root@bigdata-dev-41:/home/charles# pyspark
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_\   version 1.6.0
      /_/

Using Python version 3.6.1 (default, May 11 2017 13:09:58)
SparkContext available as sc, HiveContext available as sqlContext.
>>> x = sc.parallelize([1,2,3])
>>> y = x.flatMap(lambda x: (x, 100*x, x**2))
>>> print(x.collect())
[1, 2, 3]                                                                       
>>> print(y.collect())
[1, 100, 1, 2, 200, 4, 3, 300, 9]                                               
>>> 

参考：

目录

CDH集群上部署Python3环境及运行Pyspark作业

CDH集群上部署Python3环境及运行Pyspark作业