牛牛和妞妞标拼音怎么标

書名： The Data Science Workshop
作者名： Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
本章字數： 1123字
更新時間： 2021-06-11 18:27:18

Application of Data Science

As mentioned in the introduction, data science is a multidisciplinary approach to analyzing and identifying complex patterns and extracting valuable insights from data. Running a data science project usually involves multiple steps, including the following:

Defining the business problem to be solved
Collecting or extracting existing data
Analyzing, visualizing, and preparing data
Training a model to spot patterns in data and make predictions
Assessing a model's performance and making improvements
Communicating and presenting findings and gained insights
Deploying and maintaining a model

As its name implies, data science projects require data, but it is actually more important to have defined a clear business problem to solve first. If it's not framed correctly, a project may lead to incorrect results as you may have used the wrong information, not prepared the data properly, or led a model to learn the wrong patterns. So, it is absolutely critical to properly define the scope and objective of a data science project with your stakeholders.

There are a lot of data science applications in real-world situations or in business environments. For example, healthcare providers may train a model for predicting a medical outcome or its severity based on medical measurements, or a high school may want to predict which students are at risk of dropping out within a year's time based on their historical grades and past behaviors. Corporations may be interested to know the likelihood of a customer buying a certain product based on his or her past purchases. They may also need to better understand which customers are more likely to stop using existing services and churn. These are examples where data science can be used to achieve a clearly defined goal, such as increasing the number of patients detected with a heart condition at an early stage or reducing the number of customers canceling their subscriptions after six months. That sounds exciting, right? Soon enough, you will be working on such interesting projects.

What Is Machine Learning?

When we mention data science, we usually think about machine learning, and some people may not understand the difference between them. Machine learning is the field of building algorithms that can learn patterns by themselves without being programmed explicitly. So machine learning is a family of techniques that can be used at the modeling stage of a data science project.

Machine learning is composed of three different types of learning:

Supervised learning
Unsupervised learning
Reinforcement learning

Supervised Learning

Supervised learning refers to a type of task where an algorithm is trained to learn patterns based on prior knowledge. That means this kind of learning requires the labeling of the outcome (also called the response variable, dependent variable, or target variable) to be predicted beforehand. For instance, if you want to train a model that will predict whether a customer will cancel their subscription, you will need a dataset with a column (or variable) that already contains the churn outcome (cancel or not cancel) for past or existing customers. This outcome has to be labeled by someone prior to the training of a model. If this dataset contains 5,000 observations, then all of them need to have the outcome being populated. The objective of the model is to learn the relationship between this outcome column and the other features (also called independent variables or predictor variables). Following is an example of such a dataset:

Figure 1.1: Example of customer churn dataset

The Cancel column is the response variable. This is the column you are interested in, and you want the model to predict accurately the outcome for new input data (in this case, new customers). All the other columns are the predictor variables.

The model, after being trained, may find the following pattern: a customer is more likely to cancel their subscription after 12 months and if their average monthly spent is over $50. So, if a new customer has gone through 15 months of subscription and is spending $85 per month, the model will predict this customer will cancel their contract in the future.

When the response variable contains a limited number of possible values (or classes), it is a classification problem (you will learn more about this in Chapter 3, Binary Classification, and Chapter 4, Multiclass Classification with RandomForest). The model will learn how to predict the right class given the values of the independent variables. The churn example we just mentioned is a classification problem as the response variable can only take two different values: yes or no.

On the other hand, if the response variable can have a value from an infinite number of possibilities, it is called a regression problem.

An example of a regression problem is where you are trying to predict the exact number of mobile phones produced every day for some manufacturing plants. This value can potentially range from 0 to an infinite number (or a number big enough to have a large range of potential values), as shown in Figure 1.2.

Figure 1.2: Example of a mobile phone production dataset

In the preceding figure, you can see that the values for Daily output can take any value from 15000 to more than 50000. This is a regression problem, which we will look at in Chapter 2, Regression.

Unsupervised Learning

Unsupervised learning is a type of algorithm that doesn't require any response variables at all. In this case, the model will learn patterns from the data by itself. You may ask what kind of pattern it can find if there is no target specified beforehand.

This type of algorithm usually can detect similarities between variables or records, so it will try to group those that are very close to each other. This kind of algorithm can be used for clustering (grouping records) or dimensionality reduction (reducing the number of variables). Clustering is very popular for performing customer segmentation, where the algorithm will look to group customers with similar behaviors together from the data. Chapter 5, Performing Your First Cluster Analysis, will walk you through an example of clustering analysis.

Reinforcement Learning

Reinforcement learning is another type of algorithm that learns how to act in a specific environment based on the feedback it receives. You may have seen some videos where algorithms are trained to play Atari games by themselves. Reinforcement learning techniques are being used to teach the agent how to act in the game based on the rewards or penalties it receives from the game.

For instance, in the game Pong, the agent will learn to not let the ball drop after multiple rounds of training in which it receives high penalties every time the ball drops.

Note

Reinforcement learning algorithms are out of scope and will not be covered in this book.

官术网_书友最值得收藏!

The Data Science Workshop

Application of Data Science

What Is Machine Learning?

Supervised Learning

Unsupervised Learning

Reinforcement Learning