官术网_书友最值得收藏!

Batch versus real time

In the previous sections, we outlined the common batch processing approach, where the model is retrained using all data or a subset of all data, periodically. As the preceding pipeline takes some time to complete, it might not be possible to use this approach to update models immediately as new data arrives.

While we will be mostly covering batch machine learning approaches in this book, there is a class of machine learning algorithms known as online learning; they update immediately as new data is fed into the model, thus enabling a real-time system. A common example is an online-optimization algorithm for a linear model, such as stochastic gradient descent. We can learn this algorithm using examples. The advantages of these methods are that the system can react very quickly to new information and also that the system can adapt to changes in the underlying behavior (that is, if the characteristics and distribution of the input data are changing over time, which is almost always the case in real-world situations).

However, online-learning models come with their own unique challenges in a production context. For example, it might be difficult to ingest and transform data in real-time. It can also be complex to properly perform model selection in a purely online setting. The latency of the online training and the model selection and deployment phases might be too high for true real-time requirements (for example, in online advertising, latency requirements are measured in single-digit milliseconds). Finally, batch-oriented frameworks might make it awkward to handle real-time processes of a streaming nature.

Fortunately, Spark's real-time stream processing is a good potential fit for real-time machine learning workflows. We will explore Spark Streaming and online learning in Chapter 11, Real-time Machine Learning with Spark Streaming

Due to the complexities inherent in a true real-time machine learning system, in practice, many systems target near real-time operations. This is essentially a hybrid approach where models are not necessarily updated immediately as new data arrives; instead, the new data is collected into mini batches of a small set of training data. These mini batches can be fed to an online-learning algorithm. In many cases, this approach is combined with a periodic batch process that might recompute the model on the entire dataset and perform more complex processing and model selection. This can help ensure that the real-time model does not degrade over time.

Another similar approach involves making approximate updates to a more complex model as new data arrives while recomputing the entire model in a batch process periodically. In this way, the model can learn from new data, with a short delay (usually measured in seconds or, perhaps, a few minutes), but will become more and more inaccurate over time due to the approximation applied. The periodic recomputation takes care of this by retraining the model on all available data.

主站蜘蛛池模板: 桂林市| 安庆市| 新乡县| 观塘区| 库尔勒市| 恭城| 英吉沙县| 松潘县| 竹溪县| 宣城市| 南汇区| 鹿泉市| 荃湾区| 科技| 太康县| 肇源县| 来宾市| 新竹县| 湄潭县| 神池县| 邓州市| 玉田县| 龙海市| 绥江县| 吉木萨尔县| 万载县| 揭东县| 昂仁县| 赫章县| 从江县| 巴南区| 昭苏县| 鄂尔多斯市| 林州市| 宁德市| 济宁市| 绵竹市| 延津县| 亳州市| 武威市| 贡嘎县|