官术网_书友最值得收藏!

Working interactively with IPython

In this section, we will introduce Python interactive console, or IPython, a command-line shell that allows us to explore concepts and methods in an interactive way.

To run IPython, you call it from the command line:

Here we see IPython executing, and then the initial quick help. The most interesting part is the last line - it will allow you to import libraries and execute commands and will show the resulting objects. An additional and convenient feature of IPython is that you can redefine variables on the fly to see how the results differ with different inputs.

In the current examples, we are using the standard Python version for the most supported Linux distribution at the time of writing (Ubuntu 16.04). The examples should be equivalent for Python 3.

First of all, let's import pandas and load a sample .csv file (a very common format with one row per line, and registers). It contains a very famous dataset for classification problems with the dimensions of the attributes of 150 instances of iris plants, with a numerical column indicating the class (1, 2, or 3):

In [1]: import pandas as pd #Import the pandas library with pd alias

In this line, we import pandas in the usual way, making its method available for use with the import statement. The as modifier allows us to use a succinct name for all objects and methods in the library:

In [2]: df = pd.read_csv ("data/iris.csv") #import iris data as dataframe

In this line, we use the read_csv method, allowing pandas to guess the possible item separator for the .csv file, and storing it in a dataframe object.

Let's perform some simple exploration of the dataset:

In [3]: df.columns
Out[3]:
Index([u'Sepal.Length', u'Sepal.Width', u'Petal.Length', u'Petal.Width',
u'Species'],
dtype='object')

In [4]: df.head(3)
Out[4]:
5.1 3.5 1.4 0.2 setosa
0 4.9 3.0 1.4 0.2 setosa
1 4.7 3.2 1.3 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa

We are now able to see the column names of the dataset and explore the first n instances of it. Looking at the first registers, you can see the varying measures for the setosa iris class.

Now, let's access a particular subset of columns and display the first three elements:

In [19]: df[u'Sepal.Length'].head(3)
Out[19]:
0 5.1
1 4.9
2 4.7
Name: Sepal.Length, dtype: float64
Pandas includes many related methods for importing tabulated data formats, such as HDF5 (read_hdf), JSON (read_json), and Excel (read_excel). For a complete list of formats, visit http://pandas.pydata.org/pandas-docs/stable/io.html .

In addition to these simple exploration methods, we will now use pandas to get all the descriptive statistics concepts we've seen in order to characterize the distribution of the Sepal.Length column:

#Describe the sepal length column
print "Mean: " + str (df[u'Sepal.Length'].mean())
print "Standard deviation: " + str(df[u'Sepal.Length'].std())
print "Kurtosis: " + str(df[u'Sepal.Length'].kurtosis())
print "Skewness: " + str(df[u'Sepal.Length'].skew())

And here are the main metrics of this distribution:

Mean: 5.84333333333
Standard deviation: 0.828066127978
Kurtosis: -0.552064041316
Skewness: 0.314910956637

Now we will graphically evaluate the accuracy of these metrics by looking at the histogram of this distribution, this time using the built-in plot.hist method:

#Plot the data histogram to illustrate the measures
import matplotlib.pyplot as plt
%matplotlib inline
df[u'Sepal.Length'].plot.hist()
Histogram of the Iris Sepal Length

As the metrics show, the distribution is right skewed, because the skewness is positive, and it is of the plainly distributed type (has a spread much greater than 1), as the kurtosis metrics indicate.

主站蜘蛛池模板: 曲松县| 茌平县| 布拖县| 青神县| 甘肃省| 安宁市| 内丘县| 屯门区| 延吉市| 沛县| 子长县| 内江市| 长沙市| 页游| 溆浦县| 南投市| 上林县| 崇左市| 合江县| 遵义市| 阜新市| 马边| 吴川市| 成武县| 沛县| 金沙县| 洛川县| 松溪县| 无为县| 平安县| 邵武市| 积石山| 安图县| 陈巴尔虎旗| 驻马店市| 庆云县| 汉源县| 咸宁市| 嵊州市| 尉氏县| 呼和浩特市|