官术网_书友最值得收藏!

A day in the life of a data scientist

This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!

So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.

Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:

If you can open your dataset using Excel, you are working with small data.

主站蜘蛛池模板: 东乡族自治县| 砚山县| 池州市| 新丰县| 墨竹工卡县| 吉安市| 邓州市| 平江县| 儋州市| 大连市| 且末县| 宜川县| 南雄市| 石景山区| 邯郸县| 苍南县| 额济纳旗| 吐鲁番市| 普格县| 平江县| 灵台县| 盐城市| 昆明市| 襄垣县| 麻城市| 河津市| 宜春市| 项城市| 易门县| 黄大仙区| 湟中县| 中宁县| 大同县| 望城县| 汶川县| 剑河县| 东光县| 包头市| 建湖县| 静宁县| 浪卡子县|