官术网_书友最值得收藏!

Train-test-splitting your data

In machine learning, our goal is to create a program that is able to perform tasks it has never been explicitly taught to perform. The way we do that is to use data we have collected to train or fit a mathematical or statistical model. The data used to fit the model is referred to as training data. The resulting trained model is then used to predict future, previously-unseen data. In this way, the program is able to manage new situations without human intervention.

One of the major challenges for a machine learning practitioner is the danger of overfitting – creating a model that performs well on the training data but is not able to generalize to new, previously-unseen data. In order to combat the problem of overfitting, machine learning practitioners set aside a portion of the data, called test data, and use it only to assess the performance of the trained model, as opposed to including it as part of the training dataset. This careful setting aside of testing sets is key to training classifiers in cybersecurity, where overfitting is an omnipresent danger. One small oversight, such as using only benign data from one locale, can lead to a poor classifier.

There are various other ways to validate model performance, such as cross-validation. For simplicity, we will focus mainly on train-test splitting.

主站蜘蛛池模板: 茶陵县| 曲沃县| 仙桃市| 泊头市| 新龙县| 敦化市| 五指山市| 丰县| 宁都县| 盐津县| 博湖县| 河东区| 宿迁市| 沁阳市| 沁阳市| 贵港市| 汉阴县| 金秀| 高雄县| 靖州| 加查县| 铅山县| 永新县| 抚州市| 锦州市| 五寨县| 泾源县| 久治县| 越西县| 宜昌市| 金湖县| 博野县| 怀来县| 安宁市| 西峡县| 河津市| 二连浩特市| 阳城县| 贵南县| 额济纳旗| 皋兰县|