- Hands-On Data Science with SQL Server 2017
- Marek Chmel Vladimír Mu?n?
- 276字
- 2021-06-10 19:14:03
Data Transforming and Cleaning with T-SQL
`Data comes from a wide range of sources. It can be relational or non-relational, the connectivity can be unstable, and there are also many other issues when data has to be extracted from data sources. This is why developers, statisticians, and data scientists should never entirely believe in the quality of the source data. This chapter explains the techniques for data transformation and cleansing using Transact-SQL (T-SQL) language.
The following topics will be covered in this chapter:
- The need for data transformation: This section presents the main goal of data transformation for data science purposes and, using examples, also provides several cases of what could happen to incoming data.
- Database architectures for data transformations: Data transformations can vary from very simple to very complex. That's why it's necessary to find the right architecture to find the most reliable set of transform tasks.
- Transforming data: This includes accuracy checks, deduplication, high-watermark for incremental loads, and so on. There are also many other actions that could be seen as transformations.
- Denormalizing data: As a lot of data comes from relational databases, its format is strongly normalized. Denormalization is a part of data transformation, which is useful for fitting data better for analytical purposes.
- Using views and stored procedures: Views and stored procedures are very common database objects. This is the same when these objects are used for data transformations.
- Performance considerations: It would not be feasible to transform data longer than the analysis itself is executed. Another aspect of performance is the impact on source systems. That's why it's very important to be aware of data transformation performance.