官术网_书友最值得收藏!

Janitorial work

A large part of doing data science work is focused on cleanup. In productionized systems, this data would typically be fetched directly from the database, already relatively clean (high -quality production data science work requires a database of clean data). However, we're not in production mode yet. We're still in the model-building phase. It would be helpful to imagine writing a program solely for cleaning data.

Let's look at our requirements: starting with our data, each column is a variable—most of them are independent variables, except for the last column, which is the dependent variable. Some variables are categorical, and some are continuous. Our task is to write a function that will convert the data, currently [][]string to [][]float64.

To do that, we would require all the data to be converted into float64. For the continuous variables, it's an easy task: simply parse the string into a float. There are oddities that need to be handled, which I hope you had spotted by the time you opened the file in a spreadsheet. But the main pain is in converting categorical data to float64.

Fortunately for us, people much smarter than have figured this out decades ago. There exists an encoding scheme that allows categorical data to play nicely with linear regression algorithms.

主站蜘蛛池模板: 兴仁县| 儋州市| 皋兰县| 灵武市| 桂平市| 神农架林区| 武定县| 临泽县| 祥云县| 辽中县| 镇江市| 东宁县| 洛扎县| 辰溪县| 长治市| 图木舒克市| 赫章县| 松江区| 大余县| 岚皋县| 荆州市| 金沙县| 巴马| 海城市| 西乌珠穆沁旗| 松原市| 凤台县| 三江| 山西省| 怀远县| 乌拉特中旗| 繁昌县| 响水县| 滁州市| 濮阳县| 伊春市| 松阳县| 兴宁市| 九龙县| 中江县| 临湘市|