- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 729字
- 2021-07-02 20:09:21
The read_csv method
The name of the method doesn't unveil its full might. It is a kind of misnomer in the sense that it makes us think that it can be used to read only CSV files, which is not the case. Various kinds of files, including .txt
files having delimiters of various kinds can be read using this method.
Let's learn a little bit more about the various arguments of this method in order to assess its true potential. Although the read_csv
method has close to 30 arguments, the ones listed in the next section are the ones that are most commonly used.
The general form of a read_csv
statement is something similar to:
pd.read_csv(filepath, sep=', ', dtype=None, header=None, skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)
Now, let us understand the significance and usage of each of these arguments one by one:
filepath
:filepath
is the complete address of the dataset or file that you are trying to read. The complete address includes the address of the directory in which the file is stored and the full name of the file with its extension. Remember to use a forward slash (/) in the directory address. Later in this chapter, we will see that the filepath can be a URL as well.sep
:sep
allows us to specify the delimiter for the dataset to read. By default, the method assumes that the delimiter is a comma (,). The various other delimiters that are commonly used are blank spaces ( ), tab (|), and are called space delimiter or tab demilited datasets. This argument of the method also takes regular expressions as a value.dtype
: Sometimes certain columns of the dataset need to be formatted to some other type, in order to apply certain operations successfully. One example is the date variables. Very often, they have a string type which needs to be converted to date type before we can use them to apply date-related operations. Thedtype
argument is to specify the data type of the columns of the dataset. Suppose, two columnsa
andb
, of the dataset need to be formatted to the typesint32
andfloat64
; it can be achieved by passing{'a':np.float64, 'b'.np.int32}
as the value ofdtype
. If not specified, it will leave the columns in the same format as originally found.header
: The value of aheader
argument can be aninteger
or alist
. Most of the times, datasets have a header containing the column names. The header argument is used to specify which row to be used as the header. By default, the first row is the header and it can be represented asheader =0
. If one doesn't specify the header argument, it is as good as specifyingheader=0
. If one specifiesheader=None
, the method will read the data without the header containing the column names.names
: The column names of a dataset can be passed off as a list using this argument. This argument will takelists
orarrays
as its values. This argument is very helpful in cases where there are many columns and the column names are available as a list separately. We can pass the list of column names as a value of this argument and the column names in the list will be applied.skiprows
: The value of askiprows
argument can be aninteger
or alist
. Using this argument, one can skip a certain number of rows specified as the value of this argument in the read data, for exampleskiprows=10
will read in the data from the 11th row and the rows before that will be ignored.index_col
: The value of anindex_col
argument can be aninteger
or asequence
. By default, no row labels will be applied. This argument allows one to use a column, as the row labels for the rows in a dataset.skip_blank_lines
: The value of askip_blank_lines
argument takes Boolean values only. If its value is specified asTrue
, the blank lines are skipped rather than interpreting them asNaN
(not allowed/missing values; we shall discuss them in detail soon) values. By default, its value is set toFalse
.na_filter
: The value of ana-filter
argument takes Boolean values only. It detects the markers for missing values (empty strings andNA
values) and removes them if set toFalse
. It can make a significant difference while importing large datasets.
推薦閱讀
- LibGDX Game Development Essentials
- 醫療大數據挖掘與可視化
- 數據要素五論:信息、權屬、價值、安全、交易
- Neural Network Programming with TensorFlow
- SQL優化最佳實踐:構建高效率Oracle數據庫的方法與技巧
- IPython Interactive Computing and Visualization Cookbook(Second Edition)
- 數據庫應用系統開發實例
- 聯動Oracle:設計思想、架構實現與AWR報告
- 深入理解InfluxDB:時序數據庫詳解與實踐
- 云工作時代:科技進化必將帶來的新工作方式
- 數據挖掘與數據化運營實戰:思路、方法、技巧與應用
- 數據庫基礎與應用
- 數據時代的品牌智造
- Hands-On Java Deep Learning for Computer Vision
- SQL應用開發參考手冊