官术网_书友最值得收藏!

Introduction to scikit-learn

Since you have already picked up this book, you probably don't need me to convince you why machine learning is important. However, you may still have doubts about why to use scikit-learn in particular. You may encounter names such as TensorFlow, PyTorch, and Spark more often during your daily news consumption than scikit-learn. So, let me convince you of my preference for the latter.

It plays well with the Python data ecosystem

scikit-learn is a Python toolkit built on top of NumPy, SciPy, and Matplotlib. These choices mean that it fits well into your daily data pipeline. As a data scientist, Python is most likely your language of choice since it is good for both offline analysis and real-time implementations. You will also be using tools such as pandas to load data from your database, which allows you to perform a vast amount of transformation to your data. Since both pandas and scikit-learn are built on top of NumPy, they play very well with each other. Matplotlib is the de facto data visualization tool for Python, which means you can use its sophisticated data visualization capabilities to explore your data and unravel your model's ins and outs.

Since it is an open source tool that is heavily used in the community, it is very common to see other data tools use an almost identical interface to scikit-learn. Many of these tools are built on top of the same scientific Python libraries, and they are collectively known as SciKits (short for SciPyToolkits)—hence, the scikit prefix in scikit-learn. For example, scikit-image is a library for image processing, while categorical-encoding and imbalanced-learn are separate libraries for data preprocessing that are built as add-ons to scikit-learn.

We are going to use some of these tools in this book, and you will notice how easy it is to integrate these different tools into your workflow when using scikit-learn.

Being a key player in the Python data ecosystem is what makes scikit-learn the de facto toolset for machine learning. This is the tool that you will most likely hand your job application assignment to, as well as use for Kaggle competitions and to solve most of your professional day-to-day machine learning problems for your job.

Practical level of abstraction

scikit-learn implements a vast amount of machine learning, data processing, and model selection algorithms. These implementations are abstract enough, so you only need to apply minor changes when switching from one algorithm to another. This is a key feature since you will need to quickly iterate between different algorithms when developing a model to pick the best one for your problem. Having that said, this abstraction doesn't shield you from the algorithms' configurations. In other words, you are still in full control of your hyperparameters and settings.

When not to use scikit-learn

Most likely, the reasons to not use scikit-learn will include combinations of deep learning or scale. scikit-learn's implementation of neural networks is limited. Unlike scikit-learn, TensorFlow and PyTorch allow you to use a custom architecture, and they support GPUs for a massive training scale. All of scikit-learn's implementations run in memory on a single machine. I'd say that way more than 90% of businesses are at a scale where these constraints are fine. Data scientists can still fit their data in memory in large enough machines thanks to the cloud optionsavailable. They can cleverly engineer workarounds to deal with scaling issues, but if these limitations become something that they can no longer deal with, then they will need other tools to do the trick for them.

There are solutions being developed that allow scikit-learn to scale to multiple machines, such as Dask. Many scikit-learn algorithms allow parallel execution using joblib, which natively provides thread-based and process-based parallelism. Dask can scale these joblib-backed algorithms out to a cluster of machines by providing an alternative joblib backend.
主站蜘蛛池模板: 晋城| 遵化市| 葵青区| 金阳县| 开阳县| 双江| 连州市| 沽源县| 色达县| 富阳市| 甘谷县| 江陵县| 普洱| 枣阳市| 临高县| 韶山市| 睢宁县| 秦安县| 佛学| 新乡市| 邢台市| 武功县| 垣曲县| 桃江县| 阿合奇县| 营口市| 辽宁省| 饶阳县| 永新县| 枣强县| 宝坻区| 东明县| 九江市| 塔河县| 惠安县| 松阳县| 海伦市| 大连市| 延边| 叶城县| 凤庆县|