官术网_书友最值得收藏!

Data versioning

As mentioned, machine learning models produce extremely different results depending on the training data you use, the choices of parameters, and the input data. It is essential to be able to reproduce results for collaborative, creative, and compliance reasons:

  • Collaboration: Despite what you see on social media, there are no data science and machine learning unicorns (that is, people with knowledge and capabilities in every area of data science and machine learning). We need to have our colleagues' reviews and improve on our work, and this is impossible if they aren't able to reproduce our model results and analyses.
  • Creativity: I don't know about you, but I have trouble remembering even what I did yesterday. We can't trust ourselves to always remember our reasoning and logic, especially when we are dealing with machine learning workflows. We need to track exactly what data we are using, what results we created, and how we created them. This is the only way we will be able to continually improve our models and techniques.
  • Compliance: Finally, we may not have a choice regarding data versioning and reproducibility in machine learning very soon. Laws are being passed around the world (for example, the General Data Protection Regulation (GDPR) in the European Union) that give users a right to an explanation for algorithmically made decisions. We simply cannot hope to comply with these rulings if we don't have a robust way of tracking what data we are processing and what results we are producing.

There are multiple open source data versioning projects. Some of these are focused on security and peer-to-peer distributed storage of data. Others are focused on data science workflows. In this book, we will focus on and utilize Pachyderm (http://pachyderm.io/), an open source framework for data versioning and data pipelining. Some of the reasons for this will be clear later in the book when we talk about production deploys and managing ML pipelines. For now, I will just summarize some of the features of Pachyderm that make it an attractive choice for data versioning in Go-based (and other) ML projects:

  • It has an convenient Go client, github.com/pachyderm/pachyderm/src/client
  • The ability to version any type and format of data
  • A flexible object store backing for the versioned data
  • Integration with a data pipelining system for driving versioned ML workflows
主站蜘蛛池模板: 如皋市| 巢湖市| 大渡口区| 寻乌县| 格尔木市| 治县。| 保亭| 临城县| 雷州市| 梧州市| 田阳县| 永顺县| 松滋市| 惠来县| 汉源县| 青海省| 乐东| 渭源县| 隆尧县| 基隆市| 英吉沙县| 裕民县| 丹巴县| 修水县| 云浮市| 宁化县| 曲阳县| 泸定县| 满城县| 虞城县| 罗江县| 福安市| 盘锦市| 香港 | 仁怀市| 华蓥市| 上林县| 新建县| 黄陵县| 阿克苏市| 湖北省|