Skip to content

================ by Jawad Haider

02 - DataFrames


Image
Copyright Qalmaqihir
For more information, visit us at www.github.com/qalmaqihir/



DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!

import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Selection and Indexing

Let’s learn the various methods to grab data from a DataFrame

df['W']
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64
# Pass a list of column names
df[['W','Z']]
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509
# SQL Syntax (NOT RECOMMENDED!)
df.W
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

type(df['W'])
pandas.core.series.Series

Creating a new column:

df['new'] = df['W'] + df['Y']
df
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762

Removing Columns

df.drop('new',axis=1)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Not inplace unless specified!
df
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
df.drop('new',axis=1,inplace=True)
df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Can also drop rows this way:

df.drop('E',axis=0)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057

Selecting Rows

df.loc['A']
W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label

df.iloc[2]
W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Selecting subset of rows and columns

df.loc['B','Y']
-0.8480769834036315
df.loc[['A','B'],['W','Y']]
W Y
A 2.706850 0.907969
B 0.651118 -0.848077

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df>0
W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
df[df>0]
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]['Y']
A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757

For two conditions you can use | and & with parenthesis:

df[(df['W']>0) & (df['Y'] > 1)]
W X Y Z
E 0.190794 1.978757 2.605967 0.683509

More Index Details

Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!

df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Reset to default 0,1...n index
df.reset_index()
index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States')
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
df
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States',inplace=True)
df
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509

DataFrame Summaries

There are a couple of ways to obtain summary data on DataFrames.
df.describe() provides summary statistics on all numerical columns.
df.info and df.dtypes displays the data type of all columns.

df.describe()
W X Y Z
count 5.000000 5.000000 5.000000 5.000000
mean 0.343858 0.453764 0.452287 0.431871
std 1.681131 1.061385 1.454516 0.594708
min -2.018168 -0.758872 -0.933237 -0.589001
25% 0.188695 -0.319318 -0.848077 0.503826
50% 0.190794 0.628133 0.528813 0.605965
75% 0.651118 0.740122 0.907969 0.683509
max 2.706850 1.978757 2.605967 0.955057
df.dtypes
W    float64
X    float64
Y    float64
Z    float64
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, CA to CO
Data columns (total 4 columns):
W    5 non-null float64
X    5 non-null float64
Y    5 non-null float64
Z    5 non-null float64
dtypes: float64(4)
memory usage: 200.0+ bytes

Great Job! Thats the end of this part.

Don't forget to give a star on github and follow for more curated Computer Science, Machine Learning materials