官术网_书友最值得收藏!

Different Types of Data Science Problems

Much of your time as a data scientist is likely to be spent wrangling data: figuring out how to get it, getting it, examining it, making sure it's correct and complete, and joining it with other types of data. pandas will facilitate this process for you. However, if you aspire to be a machine learning data scientist, you will need to master the art and science of predictive modeling. This means using a mathematical model, or idealized mathematical formulation, to learn the relationships within the data, in the hope of making accurate and useful predictions when new data comes in.

For this purpose, data is typically organized in a tabular structure, with features and a response variable. For example, if you want to predict the price of a house based on some characteristics about it, such as area and number of bedrooms, these attributes would be considered the features and the price of the house would be the response variable. The response variable is sometimes called the target variable or dependent variable, while the features may also be called the independent variables.

If you have a dataset of 1,000 houses including the values of these features and the prices of the houses, you can say you have 1,000 samples of labeled data, where the labels are the known values of the response variable: the prices of different houses. Most commonly, the tabular data structure is organized so that different rows are different samples, while features and the response occupy different columns, along with other metadata such as sample IDs, as shown in Figure 1.9.

Figure 1.9: Labeled data (the house prices are the known target variable)

Regression Problem

Once you have trained a model to learn the relationship between the features and response using your labeled data, you can then use it to make predictions for houses where you don't know the price, based on the information contained in the features. The goal of predictive modeling in this case is to be able to make a prediction that is close to the true value of the house. Since we are predicting a numerical value on a continuous scale, this is called a regression problem.

Classification Problem

On the other hand, if we were trying to make a qualitative prediction about the house, to answer a yes or no question such as "will this house go on sale within the next five years?" or "will the owner default on the mortgage?", we would be solving what is known as a classification problem. Here, we would hope to answer the yes or no question correctly. The following figure is a schematic illustrating how model training works, and what the outcomes of regression or classification models might be:

Figure 1.10: Schematic of model training and prediction for regression and classification

Classification and regression tasks are called supervised learning, which is a class of problems that relies on labeled data. These problems can be thought of a needing "supervision" by the known values of the target variable. By contrast, there is also unsupervised learning, which relates to more open-ended questions of trying to find some sort of structure in a dataset that does not necessarily have labels. Taking a broader view, any kind of applied math problem, including fields as varied as optimization, statistical inference, and time series modeling, may potentially be considered an appropriate responsibility for a data scientist.

主站蜘蛛池模板: 白银市| 尚志市| 奉节县| 洞口县| 乌拉特后旗| 林芝县| 晋宁县| 伽师县| 白朗县| 六枝特区| 元谋县| 和田县| 佛冈县| 郓城县| 乐平市| 齐齐哈尔市| 临沂市| 丰县| 沁源县| 噶尔县| 登封市| 黔南| 永州市| 崇礼县| 神木县| 双牌县| 巴林左旗| 繁峙县| 白玉县| 运城市| 潜江市| 高尔夫| 长沙市| 九江市| 涪陵区| 绍兴县| 沁源县| 即墨市| 方山县| 鹤壁市| 新丰县|