SQL Server has integrated machine learning with the version of 2016, when the first R services were introduced and with the 2017 version, when Python language support was added to SQL Server, and the feature was renamed machine-learning services. Machine-learning services can be used to solve complex problems and the tasks and algorithms are chosen based on the data and the expected prediction.
There are two main flavors of machine learning:
Supervised
Unsupervised
The main difference between the two is that supervised learning is performed by having a ground truth available. This means that the predictive model is building the capabilities based on the prior knowledge of what the output values for a sample should be. Unsupervised learning, on the other hand, does not have any sort of this outputs and the goal is to find natural structures present in the datasets. This can mean grouping datapoints into clusters or finding different ways of looking at complex data so that it appears simpler or more organized.
Each type of machine learning is using different algorithms to solve the problem. Typical supervised learning algorithms would include the following:
Classification
Regression
When the data is being used to predict a category, supervised learning is also called classification. This is the case when assigning an image as a picture of either a car or a boat, for example. When there are only two options, it's called two-classclassification. When there are more categories, such as when predicting the winner of the World Cup, this problem is known as multi-class classification.
The major algorithms used here would include the following:
Logistic regression
Decision tree/forest/jungle
Support vector machine
Bayes point machine
Regression algorithms can be used for both types of variables, continuous and discrete, to predict a value, where for continuous values you'll apply Linear Regression and on discrete values, Logistic Regression.In linear regression, the relationship between the input variable (x) and output variable (y) is expressed as an equation of the form y = a + bx. Thus, the goal of linear regression is to find out the values of coefficients a and b and fit a line that is nearest to most of the points. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible. It is a fast and simple technique and a good first algorithm to try.
Logistic regression is best suited for binary classification (datasets where y = 0 or 1, where 1 denotes the default class. So, if you're predicting whether an event will occur, the instance of event occurring will be classified as 1. It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
There are numerous kinds of regressions available, depending on the Python or R package that is used; for example, we could use any of the following:
Bayesian linear regression
Linear regression
Poisson regression
Decision forest regression
Many others
We'll focus on implementing the machine-learning algorithms and models in the following chapters with SQL Server Machine Learning Services to train and predict the values from the dataset.