官术网_书友最值得收藏!

Understanding the basics of data and machine learning

When we talk about data, we are generally dealing with tabular data, that is, data that is organized into rows and columns. Think of this as being able to be opened in a spreadsheet technology such as Microsoft Excel. Each row of data, otherwise known as an observation, represents a single instance/example of a problem. If our data belongs to the domain of day-trading in the stock market, an observation might represent an hour’s worth of changes in the overall market and price.

For example, when dealing with the domain of network security, an observation could represent a possible attack or a packet of data sent over a wireless system.

The following shows sample tabular data in the domain of cyber security and more specifically, network intrusion:

We see that each row or observation consists of a network connection and we have four attributes of the observation: DateTime, Protocol, Urgent, and Malicious. While we will not pe into these specific attributes, we will simply notice the structure of the data given to us in a tabular format.

Because we will, for the most part, consider our data to be tabular, we can also look at specific instances where the matrix of data has only one column/attribute. For example, if we are building a piece of software that is able to take in a single image of a room and output whether or not there is a human in that room. The data for the input might be represented as a matrix of a single column where the single column is simply a URL to a photo of a room and nothing else.

For example, considering the following table of table that has only a single column titled, Photo URL. The values of the table are URLs (these are fake and do not lead anywhere and are purely for example) of photos that are relevant to the data scientist:

The data that is inputted into the system might only be a single column, such as in this case. In our ability to create a system that can analyze images, the input might simply be a URL to the image in question. It would be up to us as data scientists to engineer features from the URL.

As data scientists, we must be ready to ingest and handle data that might be large, small, wide, narrow (in terms of attributes), sparse in completion (there might be missing values), and be ready to utilize this data for the purposes of machine learning. Now’s a good time to talk more about that. Machine learning algorithms belong to a class of algorithms that are defined by their ability to extract and exploit patterns in data to accomplish a task based on historical training data. Vague, right? machine learning can handle many types of tasks, and therefore we will leave the definition of machine learning as is and pe a bit deeper.

We generally separate machine learning into two main types, supervised and unsupervised learning. Each type of machine learning algorithm can benefit from feature engineering, and therefore it is important that we understand each type.

主站蜘蛛池模板: 青川县| 酒泉市| 青冈县| 杭锦后旗| 罗江县| 河北省| 襄垣县| 黄骅市| 望都县| 霍州市| 根河市| 鄂尔多斯市| 通城县| 若羌县| 武川县| 灌阳县| 深水埗区| 托克托县| 乳源| 曲麻莱县| 清原| 望谟县| 黄龙县| 祁阳县| 抚松县| 房山区| 阜康市| 都江堰市| 广汉市| 许昌市| 越西县| 沂源县| 肥乡县| 工布江达县| 临高县| 城口县| 休宁县| 洪洞县| 巧家县| 乐安县| 黄龙县|