官术网_书友最值得收藏!

Importing data from tab-delimited files

Another very common format of flat datafile is the tab-delimited file. This can also come from an Excel export but can be the output of some custom software we must get our input from.

The good thing is that usually this format can be read in almost the same way as CSV files as the Python module csv supports the so-called dialects that enable us to use the same principles to read variations of similar file formats, one of them being the tab- delimited format.

Getting ready

Now you're already able to read CSV files. If not, please refer to the Importing data from CSV recipe first.

How to do it...

We will reuse the code from the Importing data from CSV recipe, where all we need to change is the dialect we are using as shown in the following code:

import csv

filename = 'ch02-data.tab'

data = []
try:
    with open(filename) as f:
        reader = csv.reader(f, dialect=csv.excel_tab)
       header = reader.next()
       data = [row for row in reader]
except csv.Error as e:
    print "Error reading CSV file at line %s: %s" % (reader.line_num, e)
    sys.exit(-1)
if header:
    print header
    print '==================='
 
for datarow in data:
    print datarow

How it works...

The dialect-based approach is very similar to what we already did in the Importing data from CSV recipe, except for the line where we instantiate the csv reader object, giving it the parameter dialect and specifying the excel_tab dialect that we want.

There's more...

A CSV-based approach will not work if the data is "dirty", that is, if there are certain lines not ending with just a new line character but have additional \t (Tab) markers. So we need to clean special lines separately before splitting them. The sample "dirty" tab-delimited file can be found in ch02-data-dirty.tab. The following code sample cleans data as it reads it:

datafile = 'ch02-data-dirty.tab'

with open(datafile, 'r') as f:
    for line in f:
        # remove next comment to see line before cleanup
        # print 'DIRTY: ', line.split('\t')

        # we remove any space in line start or end
        line = line.strip()

        # now we split the line by tab delimiter
        print line.split('\t')

We also see that there is another approach to do this—using the split('\t') function.

The advantage of using the csv module approach over split() is that we can reuse the same code for reading by just changing the dialect and detecting it with the file extension (.csv and .tab) or some other method (for example, using the csv.Sniffer class).

主站蜘蛛池模板: 诸城市| 呼和浩特市| 昭觉县| 东阳市| 博罗县| 手游| 鄂尔多斯市| 五大连池市| 海安县| 驻马店市| 岚皋县| 文昌市| 前郭尔| 广河县| 新源县| 施秉县| 改则县| 江川县| 新丰县| 湖北省| 郎溪县| 清流县| 辽宁省| 桦南县| 永年县| 怀集县| 永济市| 开平市| 万荣县| 布拖县| 克拉玛依市| 灵宝市| 津南区| 上栗县| 冕宁县| 平罗县| 原阳县| 日喀则市| 炉霍县| 大化| 鄂尔多斯市|