官术网_书友最值得收藏!

  • Machine Learning in Java
  • AshishSingh Bhatia Bostjan Kaluza
  • 394字
  • 2021-06-10 19:29:57

Data reduction

Data reduction deals with abundant attributes and instances. The number of attributes corresponds to the number of dimensions in our dataset. Dimensions with low prediction power contribute very little to the overall model, and cause a lot of harm. For instance, an attribute with random values can introduce some random patterns that will be picked up by a machine learning algorithm. It may happen that data contains a large number of missing values, wherein we have to find the reason for missing values in large numbers, and on that basis, it may fill it with some alternate value or impute or remove the attribute altogether. If 40% or more values are missing, then it may be advisable to remove such attributes, as this will impact the model performance.

The other factor is variance, where the constant variable may have low variance, which means the data is very close to each other or there is not very much variation in the data.

To deal with this problem, the first set of techniques removes such attributes and selects the most promising ones. This process is known as feature selection, or attributes selection, and includes methods such as ReliefF, information gain, and the Gini index. These methods are mainly focused on discrete attributes.

Another set of tools, focused on continuous attributes, transforms the dataset from the original dimensions into a lower-dimensional space. For example, if we have a set of points in three-dimensional space, we can make a projection into a two-dimensional space. Some information is lost, but in a situation where the third dimension is irrelevant, we don't lose much, as the data structure and relationships are almost perfectly preserved. This can be performed by the following methods:

  • Singular value decomposition (SVD)
  • Principal component analysis (PCA)
  • Backward/forward feature elimination
  • Factor analysis
  • Linear discriminant analysis (LDA)
  • Neural network autoencoders

The second problem in data reduction is related to too many instances; for example, they can be duplicates or come from a very frequent data stream. The main idea is to select a subset of instances in such a way that distribution of the selected data still resembles the original data distribution, and more importantly, the observed process. Techniques to reduce the number of instances involve random data sampling, stratification, and others. Once the data is prepared, we can start with the data analysis and modeling.

主站蜘蛛池模板: 芮城县| 昭苏县| 年辖:市辖区| 吉安县| 商洛市| 克什克腾旗| 马关县| 大港区| 东明县| 蒙山县| 禄丰县| 米泉市| 三都| 宁波市| 垦利县| 漳州市| 丰县| 齐齐哈尔市| 南阳市| 益阳市| 宁德市| 常州市| 长子县| 图木舒克市| 山阳县| 清原| 凤翔县| 原阳县| 壶关县| 和平县| 沁水县| 宁南县| 济阳县| 维西| 夏邑县| 恭城| 寿光市| 天镇县| 三台县| 娄烦县| 盐池县|