================
by Jawad Haider
02 - DataFrames
Copyright Qalmaqihir
For more information, visit us at
www.github.com/qalmaqihir/
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the
R programming language. We can think of a DataFrame as a bunch of Series
objects put together to share the same index. Let’s use pandas to
explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np . random . seed ( 101 )
df = pd . DataFrame ( randn ( 5 , 4 ), index = 'A B C D E' . split (), columns = 'W X Y Z' . split ())
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
Selection and Indexing
Let’s learn the various methods to grab data from a DataFrame
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column names
df [[ 'W' , 'Z' ]]
W
Z
A
2.706850
0.503826
B
0.651118
0.605965
C
-2.018168
-0.589001
D
0.188695
0.955057
E
0.190794
0.683509
# SQL Syntax (NOT RECOMMENDED!)
df . W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame Columns are just Series
pandas.core.series.Series
Creating a new column:
df [ 'new' ] = df [ 'W' ] + df [ 'Y' ]
W
X
Y
Z
new
A
2.706850
0.628133
0.907969
0.503826
3.614819
B
0.651118
-0.319318
-0.848077
0.605965
-0.196959
C
-2.018168
0.740122
0.528813
-0.589001
-1.489355
D
0.188695
-0.758872
-0.933237
0.955057
-0.744542
E
0.190794
1.978757
2.605967
0.683509
2.796762
Removing Columns
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
# Not inplace unless specified!
df
W
X
Y
Z
new
A
2.706850
0.628133
0.907969
0.503826
3.614819
B
0.651118
-0.319318
-0.848077
0.605965
-0.196959
C
-2.018168
0.740122
0.528813
-0.589001
-1.489355
D
0.188695
-0.758872
-0.933237
0.955057
-0.744542
E
0.190794
1.978757
2.605967
0.683509
2.796762
df . drop ( 'new' , axis = 1 , inplace = True )
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
Can also drop rows this way:
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
Selecting Rows
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
Or select based off of position instead of label
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
Selecting subset of rows and columns
df . loc [[ 'A' , 'B' ],[ 'W' , 'Y' ]]
W
Y
A
2.706850
0.907969
B
0.651118
-0.848077
Conditional Selection
An important feature of pandas is conditional selection using bracket
notation, very similar to numpy:
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
W
X
Y
Z
A
True
True
True
True
B
True
False
False
True
C
False
True
True
False
D
True
False
False
True
E
True
True
True
True
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
NaN
NaN
0.605965
C
NaN
0.740122
0.528813
NaN
D
0.188695
NaN
NaN
0.955057
E
0.190794
1.978757
2.605967
0.683509
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
Y
X
A
0.907969
0.628133
B
-0.848077
-0.319318
D
-0.933237
-0.758872
E
2.605967
1.978757
For two conditions you can use | and & with parenthesis:
df [( df [ 'W' ] > 0 ) & ( df [ 'Y' ] > 1 )]
W
X
Y
Z
E
0.190794
1.978757
2.605967
0.683509
More Index Details
Let’s discuss some more features of indexing, including resetting the
index or setting it something else. We’ll also talk about index
hierarchy!
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
# Reset to default 0,1...n index
df . reset_index ()
index
W
X
Y
Z
0
A
2.706850
0.628133
0.907969
0.503826
1
B
0.651118
-0.319318
-0.848077
0.605965
2
C
-2.018168
0.740122
0.528813
-0.589001
3
D
0.188695
-0.758872
-0.933237
0.955057
4
E
0.190794
1.978757
2.605967
0.683509
newind = 'CA NY WY OR CO' . split ()
W
X
Y
Z
States
A
2.706850
0.628133
0.907969
0.503826
CA
B
0.651118
-0.319318
-0.848077
0.605965
NY
C
-2.018168
0.740122
0.528813
-0.589001
WY
D
0.188695
-0.758872
-0.933237
0.955057
OR
E
0.190794
1.978757
2.605967
0.683509
CO
W
X
Y
Z
States
CA
2.706850
0.628133
0.907969
0.503826
NY
0.651118
-0.319318
-0.848077
0.605965
WY
-2.018168
0.740122
0.528813
-0.589001
OR
0.188695
-0.758872
-0.933237
0.955057
CO
0.190794
1.978757
2.605967
0.683509
W
X
Y
Z
States
A
2.706850
0.628133
0.907969
0.503826
CA
B
0.651118
-0.319318
-0.848077
0.605965
NY
C
-2.018168
0.740122
0.528813
-0.589001
WY
D
0.188695
-0.758872
-0.933237
0.955057
OR
E
0.190794
1.978757
2.605967
0.683509
CO
df . set_index ( 'States' , inplace = True )
W
X
Y
Z
States
CA
2.706850
0.628133
0.907969
0.503826
NY
0.651118
-0.319318
-0.848077
0.605965
WY
-2.018168
0.740122
0.528813
-0.589001
OR
0.188695
-0.758872
-0.933237
0.955057
CO
0.190794
1.978757
2.605967
0.683509
DataFrame Summaries
There are a couple of ways to obtain summary data on DataFrames.
df.describe() provides summary statistics on
all numerical columns. df.info and
df.dtypes displays the data type of all columns.
W
X
Y
Z
count
5.000000
5.000000
5.000000
5.000000
mean
0.343858
0.453764
0.452287
0.431871
std
1.681131
1.061385
1.454516
0.594708
min
-2.018168
-0.758872
-0.933237
-0.589001
25%
0.188695
-0.319318
-0.848077
0.503826
50%
0.190794
0.628133
0.528813
0.605965
75%
0.651118
0.740122
0.907969
0.683509
max
2.706850
1.978757
2.605967
0.955057
W float64
X float64
Y float64
Z float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, CA to CO
Data columns (total 4 columns):
W 5 non-null float64
X 5 non-null float64
Y 5 non-null float64
Z 5 non-null float64
dtypes: float64(4)
memory usage: 200.0+ bytes
Great Job! Thats the end of this part.
Don't forget to give a star on github and follow for more curated Computer Science, Machine Learning materials