官术网_书友最值得收藏!

Pearson correlation example

Let's return to our example of shoppers on the e-commerce store and consider how we might use the Pearson correlation coefficient to select data features. Consider the following example data, which records purchase amounts for shoppers given their time spent on site and the amount of money they had spent on purchases previously:

Of course, in a real application of this problem you may have thousands or hundreds of thousands of rows, and dozens of columns, each representing a different dimension of data.

Let's now select features for this data manually. The purchase amount column is our output data, or the data that we want our algorithm to predict given other features. In this exercise, we can choose to train the model using both time on site and previous purchase amount, time on site alone, or previous purchase amount alone.

When using a filter method for feature selection we consider one feature at a time, so we must look at time on site's relation to purchase amount independently of past purchase amount's relation to purchase amount. One manual approach to this problem would be to chart each of our two candidate features against the Purchase Amount column, and calculate a correlation coefficient to determine how strongly each feature is related to the purchase amount data.

First, we'll chart time on site versus purchase amount, and use our spreadsheet tool to calculate the Pearson correlation coefficient:

Even a simple visual inspection of the data hints to the fact that there is only a small relationship—if any at all—between time on site and purchase amount. Calculating the Pearson correlation coefficient yields a correlation of about +0.1, a very weak, essentially insignificant correlation between the two sets of data.

However, if we chart the past purchase amount versus current purchase amount, we see a very different relationship:

In this case, our visual inspection tells us that there is a linear but somewhat noisy relationship between the past purchase amount and the current purchase amount. Calculating the correlation coefficient gives us a correlation value of +0.9, quite a strong linear relationship!

This type of analysis tells us that we can ignore the time on site data when training our model, as there seems to be little to no statistical significance in that information. By ignoring time on site data, we can reduce the number of dimensions we need to train our model on by one, allowing our model to better generalize data and also improve performance.

If we had 48 other numerical dimensions to consider, we could simply calculate the correlation coefficient for each of them and discard each dimension whose correlation falls beneath some threshold. Not every feature can be analyzed using correlation coefficients, however, so you can only apply the Pearson algorithm to those features where such a statistical analysis makes sense; it would not make sense to use Pearson correlation to analyze a feature that lists recently browsed product category, for instance. You can, and should, use other types of feature selection filters for different dimensions representing different types of data. Over time, you will develop a toolkit of analysis techniques that can apply to different types of data.

Unfortunately, a thorough explanation of all the possible feature extraction and feature selection algorithms and tools is not possible here; you will have to research various techniques and determine which ones fit the shape and style of your features and data.

Some algorithms to consider for filter techniques are the Pearson and Spearman correlation coefficients, the chi-squared test, and information gain algorithms such as the Kullback–Leibler pergence.

Approaches to consider for wrapper techniques are optimization techniques such as genetic algorithms, tree-search algorithms such as best-first search, stochastic techniques such as random hill-climb algorithms, and heuristic techniques such as recursive feature elimination and simulated annealing. All of these techniques aim to select the best set of features that optimize the output of your model, so any optimization technique can be a candidate, however, genetic algorithms are quite effective and popular.

Feature extraction has many algorithms to consider, and generally focuses on cross-correlation of features in order to determine new features that minimize some error function; that is, how can two or more features be combined such that a minimum amount of data is lost. Relevant algorithms include PCA, partial least squares, and autoencoding. In NLP, latent semantic analysis is popular. Image processing has many specialized feature extraction algorithms, such as edge detection, corner detection, and thresholding, and further specializations based on problem domain such as face identification or motion detection.

主站蜘蛛池模板: 清河县| 合水县| 长治县| 清镇市| 商丘市| 定日县| 五指山市| 宁德市| 大化| 建阳市| 临澧县| 磴口县| 衡山县| 普安县| 休宁县| 左贡县| 高阳县| 台安县| 丰镇市| 民权县| 那曲县| 宕昌县| 措勤县| 广州市| 宁津县| 西畴县| 舞钢市| 漳平市| 鸡西市| 岑溪市| 定南县| 易门县| 浦江县| 呼伦贝尔市| 乐至县| 泰安市| 巢湖市| 枣庄市| 朝阳县| 文安县| 太仆寺旗|