官术网_书友最值得收藏!

Cleaning raw data with generator functions

One of the tasks that arise in exploratory data analysis is cleaning up raw source data. This is often done as a composite operation applying several scalar functions to each piece of input data to create a usable dataset.

Let's look at a simplified set of data. This data is commonly used to show techniques in exploratory data analysis. It's called Anscombe's quartet, and it comes from the article, Graphs in Statistical Analysis, by F. J. Anscombe that appeared in American Statistician in 1973. The following are the first few rows of a downloaded file with this dataset:

Anscombe's quartet 
I  II  III  IV 
x  y  x  y  x  y  x  y 
10.0  8.04  10.0  9.14       10.0  7.46  8.0  6.58 
8.0      6.95  8.0  8.14  8.0  6.77  8.0  5.76 
13.0  7.58  13.0  8.74  13.0  12.74  8.0  7.71 

Sadly, we can't trivially process this with the csv module. We have to do a little bit of parsing to extract the useful information from this file. Since the data is properly tab-delimited, we can use the csv.reader() function to iterate through the various rows. We can define a data iterator as follows:

import csv
from typing import IO, Iterator, List, Text, Union, Iterable
def row_iter(source: IO) -> Iterator[List[Text]]:
return csv.reader(source, delimiter="\t")

We simply wrapped a file in a csv.reader function to create an iterator over rows. The typing module provides a handy definition, IO, for file objects. The purpose of the csv.reader() function is to be an iterator over the rows. Each row is a list of text values. It can be helpful to define an additional type Row = List[Text], to make this more explicit. 

We can use this row_iter() function in the following context:

with open("Anscombe.txt") as source:
    print(list(row_iter(source)))  

While this will display useful information, the problem is the first three items in the resulting iterable aren't data. The Anscombe's quartet file starts with the following rows:

[["Anscombe's quartet"], 
['I', 'II', 'III', 'IV'],
['x', 'y', 'x', 'y', 'x', 'y', 'x', 'y'],

We need to filter these three non-data rows from the iterable. Here is a function that will neatly excise three expected title rows, and return an iterator over the remaining rows:

def head_split_fixed(
row_iter: Iterator[List[Text]]
) -> Iterator[List[Text]]:
title = next(row_iter) assert (len(title) == 1
and title[0] == "Anscombe's quartet") heading = next(row_iter) assert (len(heading) == 4
and heading == ['I', 'II', 'III', 'IV']) columns = next(row_iter) assert (len(columns) == 8
and columns == ['x','y', 'x','y', 'x','y', 'x','y']) return row_iter

This function plucks three rows from the source data, an iterator. It asserts that each row has an expected value. If the file doesn't meet these basic expectations, it's a sign that the file was damaged or perhaps our analysis is focused on the wrong file.

Since both the row_iter() and the head_split_fixed() functions expect an iterator as an argument value, they can be trivially combined, as follows:

with open("Anscombe.txt") as source:
    print(list(head_split_fixed(row_iter(source))))

We've simply applied one iterator to the results of another iterator. In effect, this defines a composite function. We're not done of course; we still need to convert the strings values to the float values, and we also need to pick apart the four parallel series of data in each row.

The final conversions and data extractions are more easily done with higher-order functions, such as map() and filter(). We'll return to those in Chapter 5Higher-Order Functions.

主站蜘蛛池模板: 涿州市| 三穗县| 霍邱县| 多伦县| 福清市| 蒙城县| 新宁县| 年辖:市辖区| 岑巩县| 沧源| 尚志市| 资溪县| 府谷县| 平山县| 方正县| 金门县| 诸暨市| 依兰县| 永吉县| 延吉市| 米脂县| 正宁县| 仁怀市| 米脂县| 湄潭县| 隆回县| 明溪县| 历史| 克山县| 沿河| 定日县| 荔浦县| 湖北省| 绍兴市| 张家川| 修武县| 寿光市| 基隆市| 全州县| 阿拉善右旗| 阿城市|