- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 729字
- 2021-07-02 20:09:21
The read_csv method
The name of the method doesn't unveil its full might. It is a kind of misnomer in the sense that it makes us think that it can be used to read only CSV files, which is not the case. Various kinds of files, including .txt
files having delimiters of various kinds can be read using this method.
Let's learn a little bit more about the various arguments of this method in order to assess its true potential. Although the read_csv
method has close to 30 arguments, the ones listed in the next section are the ones that are most commonly used.
The general form of a read_csv
statement is something similar to:
pd.read_csv(filepath, sep=', ', dtype=None, header=None, skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)
Now, let us understand the significance and usage of each of these arguments one by one:
filepath
:filepath
is the complete address of the dataset or file that you are trying to read. The complete address includes the address of the directory in which the file is stored and the full name of the file with its extension. Remember to use a forward slash (/) in the directory address. Later in this chapter, we will see that the filepath can be a URL as well.sep
:sep
allows us to specify the delimiter for the dataset to read. By default, the method assumes that the delimiter is a comma (,). The various other delimiters that are commonly used are blank spaces ( ), tab (|), and are called space delimiter or tab demilited datasets. This argument of the method also takes regular expressions as a value.dtype
: Sometimes certain columns of the dataset need to be formatted to some other type, in order to apply certain operations successfully. One example is the date variables. Very often, they have a string type which needs to be converted to date type before we can use them to apply date-related operations. Thedtype
argument is to specify the data type of the columns of the dataset. Suppose, two columnsa
andb
, of the dataset need to be formatted to the typesint32
andfloat64
; it can be achieved by passing{'a':np.float64, 'b'.np.int32}
as the value ofdtype
. If not specified, it will leave the columns in the same format as originally found.header
: The value of aheader
argument can be aninteger
or alist
. Most of the times, datasets have a header containing the column names. The header argument is used to specify which row to be used as the header. By default, the first row is the header and it can be represented asheader =0
. If one doesn't specify the header argument, it is as good as specifyingheader=0
. If one specifiesheader=None
, the method will read the data without the header containing the column names.names
: The column names of a dataset can be passed off as a list using this argument. This argument will takelists
orarrays
as its values. This argument is very helpful in cases where there are many columns and the column names are available as a list separately. We can pass the list of column names as a value of this argument and the column names in the list will be applied.skiprows
: The value of askiprows
argument can be aninteger
or alist
. Using this argument, one can skip a certain number of rows specified as the value of this argument in the read data, for exampleskiprows=10
will read in the data from the 11th row and the rows before that will be ignored.index_col
: The value of anindex_col
argument can be aninteger
or asequence
. By default, no row labels will be applied. This argument allows one to use a column, as the row labels for the rows in a dataset.skip_blank_lines
: The value of askip_blank_lines
argument takes Boolean values only. If its value is specified asTrue
, the blank lines are skipped rather than interpreting them asNaN
(not allowed/missing values; we shall discuss them in detail soon) values. By default, its value is set toFalse
.na_filter
: The value of ana-filter
argument takes Boolean values only. It detects the markers for missing values (empty strings andNA
values) and removes them if set toFalse
. It can make a significant difference while importing large datasets.
推薦閱讀
- 計(jì)算機(jī)組成原理與接口技術(shù):基于MIPS架構(gòu)實(shí)驗(yàn)教程(第2版)
- Libgdx Cross/platform Game Development Cookbook
- 智能數(shù)據(jù)分析:入門、實(shí)戰(zhàn)與平臺構(gòu)建
- Python金融數(shù)據(jù)分析(原書第2版)
- 深入淺出 Hyperscan:高性能正則表達(dá)式算法原理與設(shè)計(jì)
- SQL應(yīng)用及誤區(qū)分析
- 數(shù)據(jù)分析師養(yǎng)成寶典
- Mastering ROS for Robotics Programming(Second Edition)
- 數(shù)據(jù)挖掘競賽實(shí)戰(zhàn):方法與案例
- Access數(shù)據(jù)庫開發(fā)從入門到精通
- Filecoin原理與實(shí)現(xiàn)
- 商業(yè)智能工具應(yīng)用與數(shù)據(jù)可視化
- 數(shù)據(jù)分析思維:產(chǎn)品經(jīng)理的成長筆記
- MySQL性能調(diào)優(yōu)與架構(gòu)設(shè)計(jì)
- C# 7 and .NET Core 2.0 High Performance