官术网_书友最值得收藏!

Data wrangling with iPython

I found iPython to be the best way to learn Spark. It is also a very good choice for data scientists and data engineers to explore, model, and reason with data.

  • The exploration step includes understanding the data, experimenting with multiple transformations, extracting features for aggregation, and machine learning as well as ETL strategies
  • The modeling and reason (of relationships and distributions between the variables) steps require fast iteration over the data and extracted features with different algorithms, experimenting with different parameters and arriving at a set of ML algorithms to develop an analytics app

The iPython installation for your system (depending on OS, CPU, and so on) is best described at the iPython site, http://ipython.org/install.html and https://ipython.readthedocs.org/en/stable/install/install.html. The iPython command shell requires the Jupyter notebook system, and then the iPython libraries. Of course, you also would need to have Python installed in your system.

Once iPython is working, starting the Spark development with iPython is very easy. The iPython IDE hooks up to pyspark and the interface is via the web browser as follows:

  • Use cd into the directory where your notebooks are; for example, assuming that you have downloaded GitHub's fdps-v3 into your home directory, enter as follows:
 cd ~/fdps-v3
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ~/Downloads/spark-2.0.0-preview/bin/pyspark
  • I have spark in my Downloads directory. If you have spark in your /opt directory, the command would be as follows:
 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /opt/spark/bin/pyspark
  • What you are doing is invoking pyspark via the iPython IDE.
  • You will see the IDE on the browser as shown in the following screenshot:
主站蜘蛛池模板: 阜阳市| 曲阜市| 龙州县| 香河县| 洪洞县| 鄂伦春自治旗| 柳江县| 鹤峰县| 大田县| 大厂| 万安县| 贡山| 夏河县| 临颍县| 图木舒克市| 南江县| 霍邱县| 新疆| 呼伦贝尔市| 时尚| 临沭县| 青铜峡市| 莱西市| 平塘县| 洞头县| 崇信县| 科技| 临江市| 桂林市| 衡阳市| 宁强县| 岳普湖县| 河北区| 霍林郭勒市| 始兴县| 南召县| 奉贤区| 原平市| 新民市| 洛浦县| 靖州|