官术网_书友最值得收藏!

Building datasets

Data scientists often need hundreds of thousands of data points in order to build, train, and test machine learning models. In some cases, this data is already pre-packaged and ready for consumption. Most of the time, the scientist would need to venture out on their own and build a custom dataset. This is often done by building a web scraper to collect raw data from various sources of interest, and refining it so it can be processed later on. These web scrapers also need to periodically collect fresh data to update their predictive models with the most relevant information.

A common use case that data scientists run into is determining how people feel about a specific subject, known as sentiment analysis. Through this process, a company could look for discussions surrounding one of their products, or their overall presence, and gather a general consensus. In order to do this, the model must be trained on what a positive comment and a negative comment are, which could take thousands of individual comments in order to make a well-balanced training set. Building a web scraper to collect comments from relevant forums, reviews, and social media sites would be helpful in constructing such a dataset.

These are just a few examples of web scrapers that drive large business such as Google, Mozenda, and Cheapflights.com. There are also companies that will scrape the web for whatever available data you need, for a fee. In order to run scrapers at such a large scale, you would need to use a language that is fast, scalable, and easy to maintain.

主站蜘蛛池模板: 安远县| 北安市| 信宜市| 岳西县| 云安县| 孟津县| 古丈县| 上杭县| 调兵山市| 韶关市| 安图县| 温泉县| 平度市| 苏尼特左旗| 清水河县| 金阳县| 建瓯市| 江川县| 景泰县| 永福县| 赞皇县| 清新县| 淮北市| 察隅县| 通榆县| 百色市| 尉犁县| 鸡泽县| 菏泽市| 忻城县| 富民县| 横山县| 高碑店市| 永修县| 万安县| 濮阳县| 商河县| 扬州市| 仪征市| 新河县| 宝坻区|