官术网_书友最值得收藏!

Importing data from tab-delimited files

Another very common format of flat datafile is the tab-delimited file. This can also come from an Excel export but can be the output of some custom software we must get our input from.

The good thing is that usually this format can be read in almost the same way as CSV files as the Python module csv supports the so-called dialects that enable us to use the same principles to read variations of similar file formats, one of them being the tab- delimited format.

Getting ready

Now you're already able to read CSV files. If not, please refer to the Importing data from CSV recipe first.

How to do it...

We will reuse the code from the Importing data from CSV recipe, where all we need to change is the dialect we are using as shown in the following code:

import csv

filename = 'ch02-data.tab'

data = []
try:
    with open(filename) as f:
        reader = csv.reader(f, dialect=csv.excel_tab)
       header = reader.next()
       data = [row for row in reader]
except csv.Error as e:
    print "Error reading CSV file at line %s: %s" % (reader.line_num, e)
    sys.exit(-1)
if header:
    print header
    print '==================='
 
for datarow in data:
    print datarow

How it works...

The dialect-based approach is very similar to what we already did in the Importing data from CSV recipe, except for the line where we instantiate the csv reader object, giving it the parameter dialect and specifying the excel_tab dialect that we want.

There's more...

A CSV-based approach will not work if the data is "dirty", that is, if there are certain lines not ending with just a new line character but have additional \t (Tab) markers. So we need to clean special lines separately before splitting them. The sample "dirty" tab-delimited file can be found in ch02-data-dirty.tab. The following code sample cleans data as it reads it:

datafile = 'ch02-data-dirty.tab'

with open(datafile, 'r') as f:
    for line in f:
        # remove next comment to see line before cleanup
        # print 'DIRTY: ', line.split('\t')

        # we remove any space in line start or end
        line = line.strip()

        # now we split the line by tab delimiter
        print line.split('\t')

We also see that there is another approach to do this—using the split('\t') function.

The advantage of using the csv module approach over split() is that we can reuse the same code for reading by just changing the dialect and detecting it with the file extension (.csv and .tab) or some other method (for example, using the csv.Sniffer class).

主站蜘蛛池模板: 东乌| 泸水县| 金平| 新宁县| 湛江市| 龙泉市| 安平县| 开鲁县| 响水县| 望城县| 泾阳县| 湖南省| 利川市| 武陟县| 鄄城县| 交城县| 正阳县| 和静县| 临夏市| 蕲春县| 东城区| 贵定县| 兴山县| 来安县| 武乡县| 郸城县| 达孜县| 布尔津县| 洛宁县| 阿拉善左旗| 汤原县| 林周县| 榆中县| 修水县| 天门市| 福安市| 青岛市| 海林市| 海门市| 泰安市| 横峰县|