官术网_书友最值得收藏!

Putting data together

Finally, pandas DataFrames can be created by merging series or other list-like data. Note that scalars are transformed into lists, as follows:

In: import pandas as pd
my_own_dataset = pd.DataFrame({'Col1': range(5),
'Col2':
[1.0]*5,
'Col3': 1.0,
'Col4': 'Hello World!'})
my_own_dataset

Here is the output for my_own_dataset:

It can be easily said that for each of the columns you want to be stacked together, you provide their names (as the dictionary key) and values (as the dictionary value for that key). As seen in the preceding example, Col2 and Col3 are created in two different ways, but they provide the same resulting column of values. In this way, you can create a pandas DataFrame that contains multiple types of data with a very simple function.

In this process, please ensure that you don't mix lists of different sizes; otherwise, an exception will be raised, as shown here:

In: my_wrong_own_dataset = pd.DataFrame({'Col1': range(5), 
'Col2': 'string', 'Col3': range(2)})

Out: ...
ValueError: arrays must all be same length

In order to assemble entire already existing DataFrames, you have to use a different approach based on concatenation. The pandas package offers the concat command, which operates on pandas data structures (Series and DataFrames) by stacking rows when working on axis 0 (the default option) or stacking columns when concatenating on axis 1:

In: col5 = pd.Series([4, 3, 2, 1, 0])
col6 = pd.Series([0, 0, 1, 1, 1])
a_new_dataset = pd.concat([col5, col6], axis=1,
ignore_index = True,
keys=['Col5', 'Col6'])

my_new_dataset = pd.concat([my_own_dataset, a_new_dataset], axis=1)
my_new_dataset

The resulting dataset is a concatenation of the col5 and col6 series:

In the preceding example, we created a new DataFrame, a_new_dataset, based on two Series. We just stacked the two series together, regardless of their indexes, because we used the ignore_index parameter, which is set to True. If matching accordingly to the indexes is important for your project, just don't use the ignore_index parameter (its default value is False) and you'll have a new DataFrame based on the union of the two indexes or on only the index elements that match as a result.

Joining two distinct datasets on the basis of a common column is achieved in pd.concat by adding the parameter join='inner', which is e quivalent to a SQL inner join, (more on the topic about joins will be dealt with after the following example).

Matching based on indexes could sometimes not be enough for your needs. Sometimes, you may need to match different Series or DataFrames on specific columns or series of columns. In that case, you need the merge method, which can be run from every DataFrame.

In order to see the merge method in action, we will create a reference table containing some values to be matched based on Col5:

In: key = pd.Series([1, 2, 4])
value = pd.Series(['alpha', 'beta', 'gamma'])
reference_table = pd.concat([key, value], axis=1,
ignore_index = True,
keys=['Col5', 'Col7'])
reference_table

Here is the concatenation between key and value into a DataFrame:

The merge is operated by setting the how parameter to left, thus achieving a SQL left outer join. Apart from left, other possible settings of this parameter are as follows:

  • right: Equivalent to a SQL right outer join
  • outer: Equivalent to a SQL full outer join
  • inner: Equivalent to a SQL inner join (as previously mentioned)
In: my_new_dataset.merge(reference_table,
on='Col5', how='left')

The resulting DataFrame is a left outer join:

Getting back to our initial my_own_dataset, in order to check the type of data present in each column, you can check the output of the dtypes attribute:

In: my_own_dataset.dtypes

Out: Col1 int64
Col2 float64
Col3 float64
Col4 object
dtype: object

The last method seen in this example is very handy if you wish to check whether a datum is categorical, integer numerical, or floating point, and its precision. In fact, sometimes, it is possible to increase the processing speed by rounding up floats to integers and casting double-precision floats to single-precision floats, or by using only a single type of data. Let's see how you can cast the type in the following example. This example can also be seen as a broad example on how to reassign column data:

In:  my_own_dataset['Col1'] = my_own_dataset['Col1'].astype(float)
my_own_dataset.dtypes

Out: Col1 float64
Col2 float64
Col3 float64
Col4 object
dtype: object
You can also obtain information about your DataFrame structure and data types using the  info() as shown in this example: my_own_dataset.info().
主站蜘蛛池模板: 都匀市| 鱼台县| 鹿泉市| 九龙坡区| 收藏| 富平县| 化德县| 武强县| 庄浪县| 潞城市| 忻州市| 黑河市| 海林市| 永城市| 永修县| 浮梁县| 通辽市| 吉隆县| 惠东县| 敖汉旗| 榆树市| 昌图县| 区。| 武汉市| 河曲县| 中江县| 灵宝市| 宿迁市| 晋中市| 安化县| 黄大仙区| 莱阳市| 永胜县| 黎城县| 汤阴县| 金湖县| 隆昌县| 保定市| 田阳县| 府谷县| 长岭县|