目录

CDH集群上部署Python3环境及运行Pyspark作业

目录

CDH集群上部署Python3环境及运行Pyspark作业

Anaconda与Python版本对应关系表

Anaconda2/3Python23Python2
5.2.03.6.52.7.14
5.1.03.6.42.7.14
5.0.13.6.32.7.14
5.0.03.6.22.7.13
4.4.03.6.12.7.13
4.3.13.6.02.7.13
4.3.03.6.02.7.13
4.2.03.5.22.7.12
4.1.13.5.22.7.12
4.1.03.5.12.7.11
4.0.03.5.12.7.11
  1. 下载anaconda安装包

    wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
  2. 安装anaconda

    bash Anaconda3-4.4.0-Linux-x86_64.sh

    按回车键

    [root@node00 ~]# bash Anaconda3-4.4.0-Linux-x86_64.sh 
    
    Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.)
    
    In order to continue the installation process, please review the license
    agreement.
    Please, press ENTER to continue
    >>>                                                                                            # (按回车键)
    ===================================
    Anaconda End User License Agreement
    ===================================

    输入yes

    Copyright 2017, Continuum Analytics, Inc.
    ...      																						# 省略
    kerberos (krb5, non-Windows platforms)
    A network authentication protocol designed to provide strong authentication
    for client/server applications by using secret-key cryptography.
    
    cryptography
    A Python library which exposes cryptographic recipes and primitives.
    
    Do you approve the license terms? [yes|no]
    >>> yes 																					  # 输入 yes
    Anaconda3 will now be installed into this location:
    /root/anaconda3

    输入安装路径 /opt/cloudera/anaconda3

    如果提示“tar (child): bzip2: Cannot exec: No such file or directory”,需要先安装bzip2。 sudo yum -y install bzip2

      - Press ENTER to confirm the location
      - Press CTRL-C to abort the installation
      - Or specify a different location below
    
    [/root/anaconda3] >>> /opt/cloudera/anaconda3         # 输入安装路径 /opt/cloudera/anaconda3
    PREFIX=/opt/cloudera/anaconda3
    installing: python-3.6.1-2 ...
    installing: _license-1.1-py36_1 ...

    设置anaconda的PATH路径:

    为了确保pyspark任务提交后使用python3,故输入no,重新设置PATH

    installing: alabaster-0.7.10-py36_0 ...
    ...       																			# 省略
    installing: zlib-1.2.8-3 ...
    installing: anaconda-4.4.0-np112py36_0 ...
    installing: conda-4.3.21-py36_0 ...
    installing: conda-env-2.6.0-0 ...
    Python 3.6.1 :: Continuum Analytics, Inc.
    creating default environment...
    installation finished.
    Do you wish the installer to prepend the Anaconda3 install location
    to PATH in your /root/.bashrc ? [yes|no]
    [no] >>> no															  # 输入 no
    
    You may wish to edit your .bashrc or prepend the Anaconda3 install location:
    
    $ export PATH=/opt/cloudera/anaconda3/bin:$PATH
    
    Thank you for installing Anaconda3!
    
    Share your notebooks and packages on Anaconda Cloud!
    Sign up for free: https://anaconda.org
  3. 设置anaconda3的环境变量

    [root@node00 ~]# echo "export PATH=/opt/cloudera/anaconda3/bin:$PATH" >> /etc/profile
    [root@node00 ~]# source /etc/profile
    [root@node00 ~]# env |grep PATH
    PATH=/opt/cloudera/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
  4. 验证Python版本

    [root@node00 ~]# python
    Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 

    [root@node00 ~]# python -V
    Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
  5. 在CM配置Spark的Python环境

    export PYSPARK_PYTHON=/opt/cloudera/anaconda3/bin/python
    export PYSPARK_DRIVER_PYTHON=/opt/cloudera/anaconda3/bin/python

    https://i-blog.csdnimg.cn/blog_migrate/4c985369e1a4ea7454e0c5c225048001.png

    重启相关服务。

  6. 使用Pyspark命令测试

    x = sc.parallelize([1,2,3])
    y = x.flatMap(lambda x: (x, 100*x, x**2))
    print(x.collect())
    print(y.collect())
    root@bigdata-dev-41:/home/charles# pyspark
    Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/_,_/_/ /_/_\   version 1.6.0
          /_/
    
    Using Python version 3.6.1 (default, May 11 2017 13:09:58)
    SparkContext available as sc, HiveContext available as sqlContext.
    >>> x = sc.parallelize([1,2,3])
    >>> y = x.flatMap(lambda x: (x, 100*x, x**2))
    >>> print(x.collect())
    [1, 2, 3]                                                                       
    >>> print(y.collect())
    [1, 100, 1, 2, 200, 4, 3, 300, 9]                                               
    >>> 

    参考: