官术网_书友最值得收藏!

A day in the life of a data scientist

This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!

So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.

Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:

If you can open your dataset using Excel, you are working with small data.

主站蜘蛛池模板: 融水| 北京市| 卓尼县| 富民县| 乌鲁木齐县| 南陵县| 肥城市| 荣昌县| 新宾| 中卫市| 三穗县| 东方市| 内乡县| 金门县| 克什克腾旗| 静海县| 榕江县| 镇原县| 屏南县| 囊谦县| 龙口市| 健康| 刚察县| 延津县| 应用必备| 冷水江市| 房山区| 宾阳县| 固原市| 福建省| 万山特区| 长泰县| 东台市| 会理县| 阿尔山市| 田东县| 镇远县| 昌平区| 通辽市| 伊川县| 海丰县|