官术网_书友最值得收藏!

Data formats

When we are working with data for human consumption the easiest way to store it is through text files. In this section, we will present parsing examples of the most common formats such as CSV, JSON, and XML. These examples will be very helpful in the next chapters.

Tip

The dataset used for these examples is a list of Pokémon characters by National Pokedex number, obtained at the URL http://bulbapedia.bulbagarden.net/.

All the scripts and dataset files can be found in the author's GitHub repository available at the URL https://github.com/hmcuesta/PDA_Book/tree/master/Chapter3/.

CSV

CSV is a very simple and common open format for table, such as data, which can be exported and imported by most of the data analysis tools. CSV is a plain text format this means that the file is a sequence of characters, with no data that has to be interpreted instead, for example, binary numbers.

There are many ways to parse a CSV file from Python, and in a moment we will discuss two of them:

The first eight records of the CSV file (pokemon.csv) look as follows:

 id, typeTwo, name, type
 001, Poison, Bulbasaur, Grass
 002, Poison, Ivysaur, Grass
 003, Poison, Venusaur, Grass
 006, Flying, Charizard, Fire
 012, Flying, Butterfree, Bug
 013, Poison, Weedle, Bug
 014, Poison, Kakuna, Bug
 015, Poison, Beedrill, Bug
. . .
Parsing a CSV file with the csv module

Firstly, we need to import the csv module:

import csv

Then we open the file .csv and with the function csv.reader(f) we parse the file:

with open("pokemon.csv") as f:
    data = csv.reader(f)
    #Now we just iterate over the reader 

    for line in data:
        print(" id: {0} , typeTwo: {1}, name:  {2}, type: {3}"
              .format(line[0],line[1],line[2],line[3]))

Output:
[(1, b' Poison', b' Bulbasaur', b' Grass')
 (2, b' Poison', b' Ivysaur', b' Grass')
 (3, b' Poison', b' Venusaur', b' Grass')
 (6, b' Flying', b' Charizard', b' Fire')
 (12, b' Flying', b' Butterfree', b' Bug')
 . . .]
Parsing a CSV file using NumPy

Perform the following steps for parsing a CSV file:

  1. Firstly, we need to import the numpy library:
    import numpy as np
  2. NumPy provides us with the genfromtxt function, which receives four parameters. First, we need to provide the name of the file pokemon.csv. Then we skip first line as a header (skip_header). Next we need to specify the data type (dtype). Finally, we will define the comma as the delimiter.
    data = np.genfromtxt("pokemon.csv"
                            ,skip_header=1
                            ,dtype=None
                            ,delimiter=',')
  3. Then just print the result.
    print(data)
    
    Output:
    id: id , typeTwo: typeTwo, name: name, type: type
    id: 001 , typeTwo: Poison, name: Bulbasaur, type: Grass
    id: 002 , typeTwo: Poison, name: Ivysaur, type: Grass
    id: 003 , typeTwo: Poison, name: Venusaur, type: Grass
    id: 006 , typeTwo: Flying, name: Charizard, type: Fire
    . . .
    

JSON

JSON is a common format to exchange data. Although it is derived from JavaScript, Python provides us with a library to parse JSON.

Parsing a JSON file using json module

The first three records of the JSON file (pokemon.json) look as follows:

 [
    {
        "id": " 001",
        "typeTwo": " Poison",
        "name": " Bulbasaur",
        "type": " Grass"
    },
    {
        "id": " 002",
        "typeTwo": " Poison",
        "name": " Ivysaur",
        "type": " Grass"
    },
    {
        "id": " 003",
        "typeTwo": " Poison",
        "name": " Venusaur",
        "type": " Grass"
    },
. . .]

Firstly, we need to import the json module and pprint (pretty-print) module.

import json
from pprint import pprint

Then we open the file pokemon.json and with the function json.loads we parse the file.

with open("pokemon.json") as f:
    data = json.loads(f.read())

Finally, just print the result with the function pprint.

pprint(data)

Output:

[{'id': ' 001', 'name': ' Bulbasaur', 'type': ' Grass', 'typeTwo': ' Poison'},
 {'id': ' 002', 'name': ' Ivysaur', 'type': ' Grass', 'typeTwo': ' Poison'},
 {'id': ' 003', 'name': ' Venusaur', 'type': ' Grass', 'typeTwo': ' Poison'},
 {'id': ' 006', 'name': ' Charizard', 'type': ' Fire', 'typeTwo': ' Flying'},
 {'id': ' 012', 'name': ' Butterfree', 'type': ' Bug', 'typeTwo': ' Flying'}, . . . ]

XML

According with to World Wide Web Consortium (W3C) available at http://www.w3.org/XML/

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

The first three records of the XML file (pokemon.xml) look as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<pokemon>
  <row>
    <id> 001</id>
    <typeTwo> Poison</typeTwo>
    <name> Bulbasaur</name>
    <type> Grass</type>
  </row>
  <row>
    <id> 002</id>
    <typeTwo> Poison</typeTwo>
    <name> Ivysaur</name>
    <type> Grass</type>
  </row>
  <row>
    <id> 003</id>
    <typeTwo> Poison</typeTwo>
    <name> Venusaur</name>
    <type> Grass</type>
  </row>
. . .
</pokemon>
Parsing an XML file in Python using xml module

Firstly, we need to import the ElementTree object from xml module.

from xml.etree import ElementTree

Then we open the file "pokemon.xml" and with the function ElementTree.parse we parse the file.

with open("pokemon.xml") as f:
    doc = ElementTree.parse(f)

Finally, just print each 'row' element with the findall function:

 for node in doc.findall('row'):
     print("")
     print("id: {0}".format(node.find('id').text))
     print("typeTwo: {0}".format(node.find('typeTwo').text))
     print("name: {0}".format(node.find('name').text))
     print("type: {0}".format(node.find('type').text))
        
Output:

id: 001
typeTwo: Poison
name: Bulbasaur
type: Grass

id: 002
typeTwo: Poison
name: Ivysaur
type: Grass

id: 003
typeTwo: Poison
name: Venusaur
type: Grass

. . .

YAML

YAML Ain't Markup Language (YAML) is a human-friendly data serialization format. It's not as popular as JSON or XML but it was designed to be easily mapped to data types common to most high-level languages. A Python parser implementation called PyYAML is available in PyPI repository and its implementation is very similar to the JSON module.

The first three records of the YAML file (pokemon.yaml) look as follows:

Pokemon:
 -id : 001
typeTwo : Poison
name : Bulbasaur
type : Grass
 -id : 002
typeTwo : Poison
name : Ivysaur
type : Grass
 -id : 003
typeTwo : Poison
name : Venusaur
type : Grass
. . .
主站蜘蛛池模板: 大安市| 南皮县| 顺平县| 科技| 资阳市| 博客| 宁城县| 德庆县| 亚东县| 甘肃省| 关岭| 乐东| 思南县| 凤冈县| 广河县| 江川县| 米易县| 重庆市| 来宾市| 宜都市| 浏阳市| 桐庐县| 红桥区| 尼玛县| 上林县| 长兴县| 澄城县| 民和| 许昌市| 华亭县| 呼玛县| 尼木县| 万载县| 凯里市| 东乡| 嵊泗县| 松潘县| 高平市| 会理县| 阿克| 高青县|