Skip to content

================ by Jawad Haider

Chpt 2 - Data Manipulation with Pandas

02 - Data Indexing and Selection



Data Indexing and Selection

In Chapter 2, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g., arr[2, 1] ), slicing (e.g., arr[:,1:5]), masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and combinations thereof (e.g., arr[:, [1, 5]]). Here we’ll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects. If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very famil‐ iar, though there are a few quirks to be aware of.

Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one- dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays. ### Series as dictionary Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

import pandas as pd
data = pd.Series([0.25,0.5,0.75,1.0], index=['a','b','c','d'])
data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
data['b']
0.5
# also use dictionary-like python experssions and methods to examine the keys/indices and values
'a' in data
True
data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value

data['e']=1.25
data
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selec‐ tion via the same basic mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing. Examples of these are as follows:

# slicing by explicit index
data['a':'c']
a    0.25
b    0.50
c    0.75
dtype: float64
#slicing by implicit integer index
data[0:2]
a    0.25
b    0.50
dtype: float64
#masking
data[(data>0.3) & (data<0.8)]
b    0.50
c    0.75
dtype: float64
#fancy indexing
data[['a','e']]
a    0.25
e    1.25
dtype: float64

Among these, slicing may be the source of the most confusion. Notice that when you are slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when you’re slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

data =pd.Series(['a','b','c'], index=[1,3,5])
data
1    a
3    b
5    c
dtype: object
# explicit index when indexing
data[1]
'a'
data[3]
'b'
# implicit index when slicing
data[1:3]
3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

# First, the loc attribute allows indexing and slicing that always references the explicit index:
data.loc[1]
'a'
data.loc[1:3]
1    a
3    b
dtype: object
# The iloc attribute allows indexing and slicing that always references the implicit Python-style index
data.iloc[1]
'b'
data.iloc[0]
'a'
data.loc[0]
KeyError: 0
data.iloc[1:3]
3    b
5    c
dtype: object
data.loc[1:3]
1    a
3    b
dtype: object

Data Selection in DataFrame

DataFrame as a Dictionary

area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860
Illinois 149995 12882135
data['area']
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
data.area
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
data.area is data['area']
True
data.pop is data['pop'] # pop method is refered instead of pop in our datafram
False
data['density']=data['pop']/data['area']
data
area pop density
California 423967 38332521 90.413926
Texas 695662 26448193 38.018740
New York 141297 19651127 139.076746
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

DataFrame as 2D array

data.values
array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])
# Transposing the values
data.T
California Texas New York Florida Illinois
area 4.239670e+05 6.956620e+05 1.412970e+05 1.703120e+05 1.499950e+05
pop 3.833252e+07 2.644819e+07 1.965113e+07 1.955286e+07 1.288214e+07
density 9.041393e+01 3.801874e+01 1.390767e+02 1.148061e+02 8.588376e+01
data.values[0]
array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
data['area']
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
data.iloc[:3,:2]
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
data.loc[:'Illinois',:'pop']
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860
Illinois 149995 12882135
data.loc[data.density>100,['pop','density']]
pop density
New York 19651127 139.076746
Florida 19552860 114.806121
data.iloc[0,2]=90
data
area pop density
California 423967 38332521 90.000000
Texas 695662 26448193 38.018740
New York 141297 19651127 139.076746
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763