官术网_书友最值得收藏!

Application of Data Science

As mentioned in the introduction, data science is a multidisciplinary approach to analyzing and identifying complex patterns and extracting valuable insights from data. Running a data science project usually involves multiple steps, including the following:

  1. Defining the business problem to be solved
  2. Collecting or extracting existing data
  3. Analyzing, visualizing, and preparing data
  4. Training a model to spot patterns in data and make predictions
  5. Assessing a model's performance and making improvements
  6. Communicating and presenting findings and gained insights
  7. Deploying and maintaining a model

As its name implies, data science projects require data, but it is actually more important to have defined a clear business problem to solve first. If it's not framed correctly, a project may lead to incorrect results as you may have used the wrong information, not prepared the data properly, or led a model to learn the wrong patterns. So, it is absolutely critical to properly define the scope and objective of a data science project with your stakeholders.

There are a lot of data science applications in real-world situations or in business environments. For example, healthcare providers may train a model for predicting a medical outcome or its severity based on medical measurements, or a high school may want to predict which students are at risk of dropping out within a year's time based on their historical grades and past behaviors. Corporations may be interested to know the likelihood of a customer buying a certain product based on his or her past purchases. They may also need to better understand which customers are more likely to stop using existing services and churn. These are examples where data science can be used to achieve a clearly defined goal, such as increasing the number of patients detected with a heart condition at an early stage or reducing the number of customers canceling their subscriptions after six months. That sounds exciting, right? Soon enough, you will be working on such interesting projects.

What Is Machine Learning?

When we mention data science, we usually think about machine learning, and some people may not understand the difference between them. Machine learning is the field of building algorithms that can learn patterns by themselves without being programmed explicitly. So machine learning is a family of techniques that can be used at the modeling stage of a data science project.

Machine learning is composed of three different types of learning:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

Supervised Learning

Supervised learning refers to a type of task where an algorithm is trained to learn patterns based on prior knowledge. That means this kind of learning requires the labeling of the outcome (also called the response variable, dependent variable, or target variable) to be predicted beforehand. For instance, if you want to train a model that will predict whether a customer will cancel their subscription, you will need a dataset with a column (or variable) that already contains the churn outcome (cancel or not cancel) for past or existing customers. This outcome has to be labeled by someone prior to the training of a model. If this dataset contains 5,000 observations, then all of them need to have the outcome being populated. The objective of the model is to learn the relationship between this outcome column and the other features (also called independent variables or predictor variables). Following is an example of such a dataset:

Figure 1.1: Example of customer churn dataset

The Cancel column is the response variable. This is the column you are interested in, and you want the model to predict accurately the outcome for new input data (in this case, new customers). All the other columns are the predictor variables.

The model, after being trained, may find the following pattern: a customer is more likely to cancel their subscription after 12 months and if their average monthly spent is over $50. So, if a new customer has gone through 15 months of subscription and is spending $85 per month, the model will predict this customer will cancel their contract in the future.

When the response variable contains a limited number of possible values (or classes), it is a classification problem (you will learn more about this in Chapter 3, Binary Classification, and Chapter 4, Multiclass Classification with RandomForest). The model will learn how to predict the right class given the values of the independent variables. The churn example we just mentioned is a classification problem as the response variable can only take two different values: yes or no.

On the other hand, if the response variable can have a value from an infinite number of possibilities, it is called a regression problem.

An example of a regression problem is where you are trying to predict the exact number of mobile phones produced every day for some manufacturing plants. This value can potentially range from 0 to an infinite number (or a number big enough to have a large range of potential values), as shown in Figure 1.2.

Figure 1.2: Example of a mobile phone production dataset

In the preceding figure, you can see that the values for Daily output can take any value from 15000 to more than 50000. This is a regression problem, which we will look at in Chapter 2, Regression.

Unsupervised Learning

Unsupervised learning is a type of algorithm that doesn't require any response variables at all. In this case, the model will learn patterns from the data by itself. You may ask what kind of pattern it can find if there is no target specified beforehand.

This type of algorithm usually can detect similarities between variables or records, so it will try to group those that are very close to each other. This kind of algorithm can be used for clustering (grouping records) or dimensionality reduction (reducing the number of variables). Clustering is very popular for performing customer segmentation, where the algorithm will look to group customers with similar behaviors together from the data. Chapter 5, Performing Your First Cluster Analysis, will walk you through an example of clustering analysis.

Reinforcement Learning

Reinforcement learning is another type of algorithm that learns how to act in a specific environment based on the feedback it receives. You may have seen some videos where algorithms are trained to play Atari games by themselves. Reinforcement learning techniques are being used to teach the agent how to act in the game based on the rewards or penalties it receives from the game.

For instance, in the game Pong, the agent will learn to not let the ball drop after multiple rounds of training in which it receives high penalties every time the ball drops.

Note

Reinforcement learning algorithms are out of scope and will not be covered in this book.

主站蜘蛛池模板: 筠连县| 抚松县| 株洲县| 宜城市| 新野县| 兴隆县| 舞阳县| 鸡东县| 夏河县| 巴南区| 张家口市| 乌兰县| 大渡口区| 襄城县| 曲阳县| 肥西县| 射洪县| 怀集县| 新郑市| 昭平县| 抚顺市| 武义县| 马尔康县| 巴马| 迁西县| 亳州市| 含山县| 平果县| 凯里市| 綦江县| 宜昌市| 信宜市| 来宾市| 巴楚县| 北海市| 温州市| 闽侯县| 嘉荫县| 乌恰县| 宜丰县| 定西市|