The need for data transformation

The crucial question is this: why do we need data to be transformed for data science? There are two principal reasons for this. The first of these reasons is to obtain datasets or small amounts of datasets because data science models are commonly based on the statistical population dataset. We can do JOINs in our data before they are analyzed or used for machine learning training, for example, but this often leads to unnecessary complications in the model, and it could also have a performance impact on the training time.

The second reason is a bit more complicated. The world is full of data, and the volume of it is always growing. The previous Chapter 3, Data Sources for Analytics, showed a lot of data sources and data creation methods. Let's summarize the increase of data from a different point of view. We can think about data from the perspective of the speed of data contention, as well as from the perspective of the data model used for its storage and manipulation.

First of all, a very traditional and also rather slow method of data creation is simply manually writing it. There are many systems such as accounting, stores, and so on, which are developed using a client-server model with many types of relational databases, such as Microsoft SQL Server, Oracle, Teradata, and others. Some of the systems are rather old, created, and used for maybe ten or more years. This leads to some inaccurate historical data because systems were often not designed to check data quality sufficiently, and so the genesis of development was not continuous.

Another type of data is produced either by machines or whole production lines. This type of data production is rapidly increasing, but the data could be a bit more simple than in the previous case. In this case, data is usually the same kind and describes the same measure or measures. The use of relational databases is also very common here.

Furthermore, a very modern way of generating data is through IoT or other similar applications. First of all, the main challenge is the speed of the data generation. This is because many simple devices create records at the same time. When it comes to applications such as this, the database should not be relational in most cases, because data processing on a database site is too slow. The speed of processing is the reason we are using the NoSQL concept rather than relational databases, including servers such as MongoDB or CosmosDB. NoSQL databases are intended to be used for very fast data processing in applications such as gaming or telemetry data, which acquired extremely quickly. The concept of NoSQL can be challenging , especially for T-SQL developers who are not familiar with data storage.

The speed of data contention and the data storage model are not only potential concerns for data scientists. Many organizations also use several systems that are not integrated with one another. This often leads to big problems, such as duplicated data describing the same entity as well as data stored in different formats with differing quality and accessibility.

Let's summarize the factors that have an impact on the need for data transformations, as described in the previous paragraphs:

Speed of data creation
Data models used for data manipulation and storage
Data accuracy
The combination of more data sources needed to get the eligible data for data science modeling

Let's now summarize some data science task requirements in accordance with the data source characteristics given in the preceding bullet list:

Incoming data from data sources should be regularly used for machine-learning models training
What constitutes an acceptable data delay is between the time that a new record is added to the source data and the time when a new record is analyzed using a trained machine-learning model

Previous paragraphs have given some examples of why we should not entirely believe in data sources as well as looking at which challenges could be met when data scientists want to tackle them. In the following chapters, we will explore the possibilities of technologies helping us to extract data from its source and transforming it to fit our needs.

The crucial question here is this: how do we use T-SQL to transform, clean, and deduplicate data correctly and efficiently? Unfortunately, this question does not have one simple answer because no one-size-fits-all solution exists. Consequently, in the next section, we will go through several architectures helping us to get data into a consolidated reliable schema for the next data science tasks.

官术网_书友最值得收藏!

Hands-On Data Science with SQL Server 2017

The need for data transformation