- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 503字
- 2021-07-02 20:09:23
Chapter 3. Data Wrangling
I assume that by now you are at ease with importing datasets from various sources and exploring the look and feel of the data. Handling missing values, creating dummy variables and plots are some tasks that an analyst (predictive modeller) does with almost all the datasets to make them model-worthy. So, for an aspiring analyst it will be better to master these tasks, as well.
Next in the line of items to master in order to juggle data like a pro is data wrangling. Put simply, it is just a fancy word for the slicing and dicing of data. If you compare the entire predictive modelling process to a complex operation/surgery to be performed on a patient, then the preliminary analysis with a stethoscope and diagnostic checks on the patient is the data cleaning and exploration process, zeroing down on the ailing area and deciding which body part to operate on is data wrangling, and performing the surgery/operation is the modelling process.

A surgeon can vouch for the fact that zeroing down on a specific body part is the most critical piece of the puzzle to crack down before one gets to the root of the ailment. The same is the case with data wrangling. The data is not always at one place or in one table, maybe the information you need for your model is scattered across different datasets. What does one do in such cases? One doesn't always need the entire data. Many a times, one needs only a column or a few rows or a combination of a few rows and columns. How to do all this jugglery? This is the crux of this chapter. Apart from this, the chapter tries to provide the reader with all the props needed in their tryst with predictive modelling.
At the end of the chapter, the reader should be comfortable with the following functions:
- Sub-set a dataset: Slicing and dicing data, selecting few rows and columns based on certain conditions that is similar to filtering in Excel
- Generating random numbers: Generating random numbers is an important tool while performing simulations and creating dummy data frames
- Aggregating data: A technique that helps to group the data by categories in the categorical variable
- Sampling data: This is very important before venturing into the actual modelling; piding a dataset between training and testing data is essential
- Merging/appending/concatenating datasets: This is the solution of the problem that arises when the data required for the purpose of modelling is scattered over different datasets
We will be using a variety of public datasets in this chapter. Another good way of demonstrating these concepts is to use dummy datasets created using random numbers. In fact, random numbers are used heavily for this purpose. We will be using a mix of both public datasets and dummy datasets, created using random numbers.
Let us now kick-start the chapter by learning about subsetting a dataset. As it unfolds, one will realize how ubiquitous and indispensable this is.
- 公有云容器化指南:騰訊云TKE實戰(zhàn)與應用
- 數(shù)據(jù)庫應用實戰(zhàn)
- 數(shù)據(jù)庫技術與應用教程(Access)
- Python絕技:運用Python成為頂級數(shù)據(jù)工程師
- 數(shù)據(jù)庫原理及應用教程(第4版)(微課版)
- 大數(shù)據(jù)算法
- The Game Jam Survival Guide
- 網(wǎng)站數(shù)據(jù)庫技術
- 探索新型智庫發(fā)展之路:藍迪國際智庫報告·2015(下冊)
- Instant Autodesk AutoCAD 2014 Customization with .NET
- 區(qū)塊鏈+:落地場景與應用實戰(zhàn)
- 商業(yè)智能工具應用與數(shù)據(jù)可視化
- 信息融合中估計算法的性能評估
- MySQL數(shù)據(jù)庫實用教程
- MySQL 8.0從入門到實戰(zhàn)