官术网_书友最值得收藏!

Importing data from tab-delimited files

Another very common format of flat datafile is the tab-delimited file. This can also come from an Excel export but can be the output of some custom software we must get our input from.

The good thing is that usually this format can be read in almost the same way as CSV files as the Python module csv supports the so-called dialects that enable us to use the same principles to read variations of similar file formats, one of them being the tab- delimited format.

Getting ready

Now you're already able to read CSV files. If not, please refer to the Importing data from CSV recipe first.

How to do it...

We will reuse the code from the Importing data from CSV recipe, where all we need to change is the dialect we are using as shown in the following code:

import csv

filename = 'ch02-data.tab'

data = []
try:
    with open(filename) as f:
        reader = csv.reader(f, dialect=csv.excel_tab)
       header = reader.next()
       data = [row for row in reader]
except csv.Error as e:
    print "Error reading CSV file at line %s: %s" % (reader.line_num, e)
    sys.exit(-1)
if header:
    print header
    print '==================='
 
for datarow in data:
    print datarow

How it works...

The dialect-based approach is very similar to what we already did in the Importing data from CSV recipe, except for the line where we instantiate the csv reader object, giving it the parameter dialect and specifying the excel_tab dialect that we want.

There's more...

A CSV-based approach will not work if the data is "dirty", that is, if there are certain lines not ending with just a new line character but have additional \t (Tab) markers. So we need to clean special lines separately before splitting them. The sample "dirty" tab-delimited file can be found in ch02-data-dirty.tab. The following code sample cleans data as it reads it:

datafile = 'ch02-data-dirty.tab'

with open(datafile, 'r') as f:
    for line in f:
        # remove next comment to see line before cleanup
        # print 'DIRTY: ', line.split('\t')

        # we remove any space in line start or end
        line = line.strip()

        # now we split the line by tab delimiter
        print line.split('\t')

We also see that there is another approach to do this—using the split('\t') function.

The advantage of using the csv module approach over split() is that we can reuse the same code for reading by just changing the dialect and detecting it with the file extension (.csv and .tab) or some other method (for example, using the csv.Sniffer class).

主站蜘蛛池模板: 大丰市| 湖南省| 双鸭山市| 临潭县| 泾川县| 石屏县| 榆林市| 江达县| 新安县| 萨迦县| 大冶市| 杭锦后旗| 司法| 陆丰市| 桦川县| 安岳县| 广德县| 河西区| 毕节市| 东乌珠穆沁旗| 烟台市| 堆龙德庆县| 蒲江县| 东安县| 深水埗区| 监利县| 武山县| 阜宁县| 五莲县| 麦盖提县| 抚州市| 白河县| 伊吾县| 双江| 昆山市| 秭归县| 华蓥市| 大关县| 安阳市| 许昌县| 南雄市|