官术网_书友最值得收藏!

Getting data into Spark

  1. Next, load the KDD cup data into PySpark using sc, as shown in the following command:
raw_data = sc.textFile("./kddcup.data.gz")

  1. In the following command, we can see that the raw data is now in the raw_data variable:
raw_data

This output is as demonstrated in the following code snippet:

./kddcup.data,gz MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

If we enter the raw_data variable, it gives us details regarding kddcup.data.gz, where raw data underlying the data file is located, and tells us about MapPartitionsRDD.

Now that we know how to load the data into Spark, let's learn about parallelization with Spark RDDs.

主站蜘蛛池模板: 新干县| 吴江市| 海淀区| 清涧县| 青川县| 泗阳县| 论坛| 志丹县| 西华县| 东乡县| 瓦房店市| 保亭| 马尔康县| 金门县| 文水县| 昆明市| 永嘉县| 喜德县| 台安县| 泗水县| 聂拉木县| 文成县| 铅山县| 晴隆县| 乐清市| 济阳县| 廉江市| 岳普湖县| 铁岭县| 盐池县| 江源县| 黑山县| 鹤峰县| 桐柏县| 都江堰市| 贵州省| 方城县| 海宁市| 南汇区| 芦溪县| 沈阳市|