官术网_书友最值得收藏!

Engineering data versus model variety

Having a large choice of algorithms for your predictions is always a good thing, but at the end of the day, domain knowledge and the ability to extract meaningful features from clean data is often what wins the game.

Kaggle is a well-known platform for predictive analytics competitions, where the best data scientists across the world compete to make predictions on complex datasets. In these predictive competitions, gaining a few decimals on your prediction score is what makes the difference between earning the prize or being just an extra line on the public leaderboard among thousands of other competitors. One thing Kagglers quickly learn is that choosing and tuning the model is only half the battle. Feature extraction or how to extract relevant predictors from the dataset is often the key to winning the competition.

In real life, when working on business related problems, the quality of the data processing phase and the ability to extract meaningful signal out of raw data is the most important and time consuming part of building an efficient predictive model. It is well know that "data preparation accounts for about 80% of the work of data scientists" (http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/). Model selection and algorithm optimization remains an important part of the work but is often not the deciding factor when implementation is concerned.

A solid and robust implementation that is easy to maintain and connects to your ecosystem seamlessly is often preferred to an overly complex model developed and coded in-house, especially when the scripted model only produces small gains when compared to a service based implementation.

主站蜘蛛池模板: 汝阳县| 万山特区| 阿合奇县| 西平县| 建瓯市| 全椒县| 获嘉县| 营口市| 稷山县| 乌恰县| 土默特左旗| 马鞍山市| 新巴尔虎右旗| 武平县| 绿春县| 高尔夫| 鹤山市| 洛浦县| 玉林市| 邳州市| 巧家县| 平舆县| 威远县| 淮北市| 镇巴县| 宁南县| 大厂| 萝北县| 湄潭县| 永新县| 耿马| 天台县| 浦东新区| 新宁县| 茂名市| 什邡市| 赤水市| 施秉县| 南宁市| 珠海市| 板桥市|