- Hands-On Big Data Analytics with PySpark
- Rudy Lai Bart?omiej Potaczek
- 103字
- 2021-06-24 15:52:34
Getting data into Spark
- Next, load the KDD cup data into PySpark using sc, as shown in the following command:
raw_data = sc.textFile("./kddcup.data.gz")
- In the following command, we can see that the raw data is now in the raw_data variable:
raw_data
This output is as demonstrated in the following code snippet:
./kddcup.data,gz MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0
If we enter the raw_data variable, it gives us details regarding kddcup.data.gz, where raw data underlying the data file is located, and tells us about MapPartitionsRDD.
Now that we know how to load the data into Spark, let's learn about parallelization with Spark RDDs.
推薦閱讀
- 數據之巔:數據的本質與未來
- MongoDB管理與開發精要
- 從0到1:數據分析師養成寶典
- Mastering Machine Learning with R(Second Edition)
- Spark大數據編程實用教程
- 數據挖掘原理與SPSS Clementine應用寶典
- 數字媒體交互設計(初級):Web產品交互設計方法與案例
- Apache Kylin權威指南
- Hadoop大數據開發案例教程與項目實戰(在線實驗+在線自測)
- 機器學習:實用案例解析
- Filecoin原理與實現
- Unity 4.x Game AI Programming
- Swift 2 By Example
- 數據質量管理:數據可靠性與數據質量問題解決之道
- 元宇宙基石:Web3.0與分布式存儲