官术网_书友最值得收藏!

Data preprocessing

Taking into account the goals of data preparation, Scala was chosen as an easy and interactive way to manipulate data:

val priceDataFileName: String = "bitstampUSD_1-min_data_2012-01-01_to_2017-10-20.csv"

val spark = SparkSession
.builder()
.master("local[*]")
.config("spark.sql.warehouse.dir", "E:/Exp/")
.appName("Bitcoin Preprocessing")
.getOrCreate()

val data = spark.read.format("com.databricks.spark.csv").option("header", "true").load(priceDataFileName)
data.show(10)
>>>
Figure 5: A glimpse of the Bitcoin historical price dataset
println((data.count(), data.columns.size))

>>>

(3045857, 8)

In the preceding code, we load data from the file downloaded from Kaggle and look at what is inside. There are 3045857 rows in the dataset and 8 columns, described before. Then we create the Delta column, containing the difference between closing and opening prices (that is, to consider only that data where meaningful trading has started to occur):

val dataWithDelta = data.withColumn("Delta", data("Close") - data("Open"))

The following code labels our data by assigning 1 to the rows the Delta value of which was positive; it assigns 0 otherwise:

import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._

val dataWithLabels = dataWithDelta.withColumn("label", when($"Close" - $"Open" > 0, 1).otherwise(0))
rollingWindow(dataWithLabels, 22, outputDataFilePath, outputLabelFilePath)

This code transforms the original dataset into time series data. It takes the Delta values of WINDOW_SIZE rows (22 in this experiment) and makes a new row out of them. In this way, the first row has Delta values from t0 to t21, and the second one has values from t1 to t22. Then we create the corresponding array with labels (1 or 0).

Finally, we save X and Y into files where 612000 rows were cut off from the original dataset; 22 means rolling window size and 2 classes represents that labels are binary 0 and 1:

val dropFirstCount: Int = 612000

def rollingWindow(data: DataFrame, window: Int, xFilename: String, yFilename: String): Unit = {
var i = 0
val xWriter = new BufferedWriter(new FileWriter(new File(xFilename)))
val yWriter = new BufferedWriter(new FileWriter(new File(yFilename)))
val zippedData = data.rdd.zipWithIndex().collect()
System.gc()
val dataStratified = zippedData.drop(dropFirstCount)//slice 612K

while (i < (dataStratified.length - window)) {
val x = dataStratified
.slice(i, i + window)
.map(r => r._1.getAs[Double]("Delta")).toList
val y = dataStratified.apply(i + window)._1.getAs[Integer]("label")
val stringToWrite = x.mkString(",")
xWriter.write(stringToWrite + "n")
yWriter.write(y + "n")
i += 1

if (i % 10 == 0) {
xWriter.flush()
yWriter.flush()
}
}
xWriter.close()
yWriter.close()
}

In the preceding code segment:

val outputDataFilePath: String = "output/scala_test_x.csv"
val outputLabelFilePath: String = "output/scala_test_y.csv"
主站蜘蛛池模板: 濮阳市| 高要市| 朝阳区| 康定县| 内江市| 永城市| 浑源县| 临汾市| 甘孜县| 涟水县| 浦县| 安吉县| SHOW| 揭阳市| 海安县| 沙坪坝区| 深圳市| 且末县| 宜城市| 台中县| 临漳县| 洪洞县| 高淳县| 遂川县| 清远市| 天等县| 甘孜县| 福建省| 当涂县| 莫力| 于田县| 贡嘎县| 扎兰屯市| 贡嘎县| 松阳县| 衡阳县| 都昌县| 榆中县| 拜泉县| 治县。| 高要市|