官术网_书友最值得收藏!

  • Data Wrangling with Python
  • Dr. Tirthajyoti Sarkar Shubhadeep Roychowdhury
  • 1888字
  • 2021-06-11 13:40:26

Basic File Operations in Python

In the previous topic, we investigated a few advanced data structures and also learned neat and useful functional programming methods to manipulate them without side effects. In this topic, we will learn about a few operating system (OS)-level functions in Python. We will concentrate mainly on file-related functions and learn how to open a file, read the data line by line or all at once, and finally how to cleanly close the file we opened. We will apply a few of the techniques we have learned about on a file that we will read to practice our data wrangling skills further.

Exercise 22: File Operations

In this exercise, we will learn about the OS module of Python, and we will also see two very useful ways to write and read environment variables. The power of writing and reading environment variables is often very important while designing and developing data wrangling pipelines.

Note

In fact, one of the factors of the famous 12-factor app design is the very idea of storing configuration in the environment. You can check it out at this URL: https://12factor.net/config.

The purpose of the OS module is to give you ways to interact with operating system-dependent functionalities. In general, it is pretty low-level and most of the functions from there are not useful on a day-to-day basis, however, some are worth learning. os.environ is the collection Python maintains with all the present environment variables in your OS. It gives you the power to create new ones. The os.getenv function gives you the ability to read an environment variable:

  1. Import the os module.

    import os

  2. Set few environment variables:

    os.environ['MY_KEY'] = "MY_VAL"

    os.getenv('MY_KEY')

    The output is as follows:

    'MY_VAL'

    Print the environment variable when it is not set:

    print(os.getenv('MY_KEY_NOT_SET'))

    The output is as follows:

    None

  3. Print the os environment:

    print(os.environ)

    Note

    The output has not been added for security reasons.

    After executing the preceding code, you will be able to see that you have successfully printed the value of MY_KEY, and when you tried to print MY_KEY_NOT_SET, it printed None.

File Handling

In this exercise, we will learn about how to open a file in Python. We will learn about the different modes that we can use and what they stand for. Python has a built-in open function that we will use to open a file. The open function takes few arguments as input. Among them, the first one, which stands for the name of the file you want to open, is the only one that's mandatory. Everything else has a default value. When you call open, Python uses underlying system-level calls to open a file handler and will return it to the caller.

Usually, a file can be opened either for reading or for writing. If we open a file in one mode, the other operation is not supported. Whereas reading usually means we start to read from the beginning of an existing file, writing can mean either starting a new file and writing from the beginning or opening an existing file and appending to it. Here is a table showing you all the different modes Python supports for opening a file:

Figure 2.5 Modes to read a file

There also exists a deprecated mode, U, which in a Python3 environment does nothing. One thing we must remember here is that Python will always differentiate between t and b modes, even if the underlying OS doesn't. This is because in b mode, Python does not try to decode what it is reading and gives us back the bytes object instead, whereas in t mode, it does try to decode the stream and gives us back the string representation.

You can open a file for reading like so:

fd = open("Alice’s Adventures in Wonderland, by Lewis Carroll")

This is opened in rt mode. You can open the same file in binary mode if you want. To open the file in binary mode, use the rb mode:

fd = open("Alice’s Adventures in Wonderland, by Lewis Carroll",

"rb")

fd

The output is as follows:

<_io.BufferedReader name='Alice's Adventures in Wonderland, by Lewis Carroll'>

This is how we open a file for writing:

fd = open("interesting_data.txt", "w")

fd

The output is as follows:

<_io.TextIOWrapper name='interesting_data.txt' mode='w' encoding='cp1252'>

Exercise 23: Opening and Closing a File

In this exercise, we will learn how to close an open file. It is very important that we close a file once we open it. A lot of system-level bugs can occur due to a dangling file handler. Once we close a file, no further operations can be performed on that file using that specific file handler:

  1. Open a file in binary mode:

    fd = open("Alice's Adventures in Wonderland, by Lewis Carroll",

    "rb")

  2. Close a file using close():

    fd.close()

  3. Python also gives us a closed flag with the file handler. If we print it before closing, then we will see False, whereas if we print it after closing, then we will see True. If our logic checks whether a file is properly closed or not, then this is the flag we want to use.

The with Statement

In this exercise, we will learn about the with statement in Python and how we can effectively use it in the context of opening and closing files.

The with command is a compound statement in Python. Like any compound statement, with also affects the execution of the code enclosed by it. In the case of with, it is used to wrap a block of code in the scope of what we call a Context Manager in Python. A detailed discussion of the context manager is out of the scope of this exercise and this topic in general, but it is sufficient to say that thanks to a context manager implemented inside the open call for opening a file in Python, it is guaranteed that a close call will automatically happen if we wrap it inside a with statement.

Note

There is an entire PEP for with at https://www.python.org/dev/peps/pep-0343/. We encourage you to look into it.

Opening a File Using the with Statement

Open a file using the with statement:

with open("Alice’s Adventures in Wonderland, by Lewis Carroll")as fd:

print(fd.closed)

print(fd.closed)

The output is as follows:

False

True

If we execute the preceding code, we will see that the first print will end up printing False, whereas the second one will print True. This means that as soon as the control goes out of the with block, the file descriptor is automatically closed.

Note

This is by far the cleanest and most Pythonic way to open a file and obtain a file descriptor for it. We encourage you to use this pattern whenever you need to open a file by yourself.

Exercise 24: Reading a File Line by Line

  1. Open a file and then read the file line by line and print it as we read it:

    with open("Alice’s Adventures in Wonderland, by Lewis Carroll",

    encoding="utf8") as fd:

    for line in fd:

    print(line)

    The output is as follows:

    Figure 2.6: Screenshot from the Jupyter notebook

  2. Looking at the preceding code, we can really see why it is important. With this small snippet of code, you can even open and read files that are many GBs in size, line by line, and without flooding or overrunning the system memory!

    There is another explicit method in the file descriptor object called readline, which reads one line at a time from a file.

  3. Duplicate the same for loop, just after the first one:

    with open("Alice’s Adventures in Wonderland, by Lewis Carroll",

    encoding="utf8") as fd:

    for line in fd:

    print(line)

    print("Ended first loop")

    for line in fd:

    print(line)

    The output is as follows:

Figure 2.7: Section of the open file

Exercise 25: Write to a File

We will end this topic on file operations by showing you how to write to a file. We will write a few lines to a file and read the file:

  1. Use the write function from the file descriptor object:

    data_dict = {"India": "Delhi", "France": "Paris", "UK": "London",

    "USA": "Washington"}

    with open("data_temporary_files.txt", "w") as fd:

    for country, capital in data_dict.items():

    fd.write("The capital of {} is {}\n".format(

    country, capital))

  2. Read the file using the following command:

    with open("data_temporary_files.txt", "r") as fd:

    for line in fd:

    print(line)

    The output is as follows:

    The capital of India is Delhi

    The capital of France is Paris

    The capital of UK is London

    The capital of USA is Washington

  3. Use the print function to write to a file using the following command:

    data_dict_2 = {"China": "Beijing", "Japan": "Tokyo"}

    with open("data_temporary_files.txt", "a") as fd:

    for country, capital in data_dict_2.items():

    print("The capital of {} is {}".format(

    country, capital), file=fd)

  4. Read the file using the following command:

    with open("data_temporary_files.txt", "r") as fd:

    for line in fd:

    print(line)

    The output is as follows:

    The capital of India is Delhi

    The capital of France is Paris

    The capital of UK is London

    The capital of USA is Washington

    The capital of China is Beijing

    The capital of Japan is Tokyo

    Note:

    In the second case, we did not add an extra newline character, \n, at the end of the string to be written. The print function does that automatically for us.

With this, we will end this topic. Just like the previous topics, we have designed an activity for you to practice your newly acquired skills.

Activity 4: Design Your Own CSV Parser

A CSV file is something you will encounter a lot in your life as a data practitioner. A CSV is a comma-separated file where data from a tabular format is generally stored and separated using commas, although other characters can also be used.

In this activity, we will be tasked with building our own CSV reader and parser. Although it is a big task if we try to cover all use cases and edge cases, along with escape characters and all, for the sake of this small activity, we will keep our requirements small. We will assume that there is no escape character, meaning that if you use a comma at any place in your row, it means you are starting a new column. We will also assume that the only function we are interested in is to be able to read a CSV file line by line where each read will generate a new dict with the column names as keys and row names as values.

Here is an example:

Figure 2.8 Table with sample data

We can convert the data in the preceding table into a Python dictionary, which would look as follows: {"Name": "Bob", "Age": "24", "Location": "California"}:

  1. Import zip_longest from itertools. Create a function to zip header, line and fillvalue=None.
  2. Open the accompanying sales_record.csv file from the GitHub link by using r mode inside a with block and first check that it is opened.
  3. Read the first line and use string methods to generate a list of all the column names.
  4. Start reading the file. Read it line by line.
  5. Read each line and pass that line to a function, along with the list of the headers. The work of the function is to construct a dict out of these two and fill up the key:values. Keep in mind that a missing value should result in None.

    Note

    The solution for this activity can be found on page 291.

主站蜘蛛池模板: 江安县| 文昌市| 阿拉善左旗| 河曲县| 华容县| 师宗县| 鄂托克旗| 桃江县| 海口市| 昂仁县| 安西县| 陆良县| 凤庆县| 遂宁市| 丹巴县| 苍梧县| 高州市| 定南县| 调兵山市| 蒙自县| 伊川县| 邯郸市| 麻江县| 延津县| 栾城县| 栾川县| 曲周县| 鄱阳县| 银川市| 桐柏县| 沾益县| 张北县| 获嘉县| 成都市| 炉霍县| 鸡泽县| 友谊县| 长沙县| 宜川县| 荃湾区| 怀远县|