官术网_书友最值得收藏!

Chapter 2. Data Preparation for Spark ML

Machine learning professionals and data scientists often spend 70% or 80% of their time preparing data for their machine learning projects. Data preparation can be very hard work, but it is necessary and extremely important as it affects everything to follow. Therefore, in this chapter, we will cover all the necessary data preparation parts for our machine learning, which often runs from data accessing, data cleaning, datasets joining, and then to feature development so as to get our datasets ready to develop ML models on Spark. Specifically, we will discuss the following six data preparation tasks mentioned before and then end our chapter with a discussion of repeatability and automation:

  • Accessing and loading datasets
    • Publicly available datasets for ML
    • Loading datasets into Spark easily
    • Exploring and visualizing data with Spark
  • Data cleaning
    • Dealing with missing cases and incompleteness
    • Data cleaning on Spark
    • Data cleaning made easy
  • Identity matching
    • Dealing with identity issues
    • Data matching on Spark
    • Data matching made better
  • Data reorganizing
    • Data reorganizing tasks
    • Data reorganizing on Spark
    • Data reorganizing made easy
  • Joining data
    • Spark SQL to join datasets
    • Joining data with Spark SQL
    • Joining data made easy
  • Feature extraction
    • Feature extraction challenges
    • Feature extraction on Spark
    • Feature extraction made easy
  • Repeatability and automation
    • Dataset preprocessing workflows
    • Spark pipelines for preprocessing
    • Dataset preprocessing automation
主站蜘蛛池模板: 句容市| 河北区| 罗城| 昂仁县| 沙河市| 韶关市| 称多县| 宁海县| 夹江县| 新竹县| 铁岭县| 安阳县| 水城县| 清原| 晋中市| 铜陵市| 邓州市| 定安县| 屯昌县| 工布江达县| 沧州市| 江源县| 南澳县| 衡南县| 长泰县| 新巴尔虎右旗| 卢氏县| 岳西县| 杭锦后旗| 涿鹿县| 民乐县| 邛崃市| 都昌县| 通辽市| 金塔县| 加查县| 启东市| 中西区| 壤塘县| 肇源县| 天全县|