書名： Mastering Machine Learning with Spark 2.x
作者名： Alex Tellez Max Pumperla Michal Malohlava
本章字?jǐn)?shù)： 119字
更新時(shí)間： 2021-07-02 18:46:05

Data munging

Raw data for problems often comes from multiple sources with different and often incompatible formats. The beauty of the Spark programming model is its ability to define data operations that process the incoming data and transform it into a regular form that can be used for further feature engineering and model building. This process is commonly referred to as data munging and is where much of the battle is won with respect to data science projects. We keep this section intentionally brief because the best way to showcase the power--and necessity!--of data munging is by example. So, take heart; we have plenty of practice to go through in this book, which emphasizes this essential process.

官术网_书友最值得收藏!

Mastering Machine Learning with Spark 2.x

Data munging