- Machine Learning in Java
- AshishSingh Bhatia Bostjan Kaluza
- 246字
- 2021-06-10 19:29:57
Data transformation
Data transformation techniques tame the dataset to a format that a machine learning algorithm expects as input and may even help the algorithm to learn faster and achieve better performance. It is also known as data munging or data wrangling. Standardization, for instance, assumes that data follows Gaussian distribution and transforms the values in such a way that the mean value is 0 and the deviation is 1, as follows:
Normalization, on the other hand, scales the values of attributes to a small, specified range, usually between 0 and 1:
Many machine learning toolboxes automatically normalize and standardize the data for you.
The last transformation technique is discretization, which divides the range of a continuous attribute into intervals. Why should we care? Some algorithms, such as decision trees and Naive Bayes prefer discrete attributes. The most common ways to select the intervals are as follows:
- Equal width: The interval of continuous variables is divided into k equal width intervals
- Equal frequency: Supposing there are N instances, each of the k intervals contains approximately N or k instances
- Min entropy: This approach recursively splits the intervals until the entropy, which measures disorder, decreases more than the entropy increase, introduced by the interval split (Fayyad and Irani, 1993)
The first two methods require us to specify the number of intervals, while the last method sets the number of intervals automatically; however, it requires the class variable, which means it won't work for unsupervised machine learning tasks.
- 大學(xué)計(jì)算機(jī)信息技術(shù)導(dǎo)論
- Mastering Matplotlib 2.x
- Python Algorithmic Trading Cookbook
- Mastering Elastic Stack
- 可編程控制器技術(shù)應(yīng)用(西門子S7系列)
- RPA:流程自動(dòng)化引領(lǐng)數(shù)字勞動(dòng)力革命
- 計(jì)算機(jī)網(wǎng)絡(luò)原理與技術(shù)
- Enterprise PowerShell Scripting Bootcamp
- 寒江獨(dú)釣:Windows內(nèi)核安全編程
- Hands-On SAS for Data Analysis
- 新一代人工智能與語音識(shí)別
- 菜鳥起飛電腦組裝·維護(hù)與故障排查
- Effective Business Intelligence with QuickSight
- Spark Streaming實(shí)時(shí)流式大數(shù)據(jù)處理實(shí)戰(zhàn)
- 商務(wù)智能