官术网_书友最值得收藏!

Janitorial work

A large part of doing data science work is focused on cleanup. In productionized systems, this data would typically be fetched directly from the database, already relatively clean (high -quality production data science work requires a database of clean data). However, we're not in production mode yet. We're still in the model-building phase. It would be helpful to imagine writing a program solely for cleaning data.

Let's look at our requirements: starting with our data, each column is a variable—most of them are independent variables, except for the last column, which is the dependent variable. Some variables are categorical, and some are continuous. Our task is to write a function that will convert the data, currently [][]string to [][]float64.

To do that, we would require all the data to be converted into float64. For the continuous variables, it's an easy task: simply parse the string into a float. There are oddities that need to be handled, which I hope you had spotted by the time you opened the file in a spreadsheet. But the main pain is in converting categorical data to float64.

Fortunately for us, people much smarter than have figured this out decades ago. There exists an encoding scheme that allows categorical data to play nicely with linear regression algorithms.

主站蜘蛛池模板: 神池县| 昌乐县| 娄底市| 灌阳县| 泰和县| 扎囊县| 梨树县| 德阳市| 五常市| 酉阳| 延吉市| 广东省| 视频| 娄烦县| 伊春市| 广丰县| 太康县| 齐河县| 岳阳县| 资溪县| 建宁县| 象山县| 开封县| 信宜市| 义马市| 永靖县| 五华县| 庄河市| 昌吉市| 中方县| 台州市| 称多县| 临桂县| 瓦房店市| 科技| 昭平县| 睢宁县| 平武县| 海兴县| 商水县| 平阳县|