================ by Jawad Haider
Chpt 2 - Data Manipulation with Pandas¶
02 - Data Indexing and Selection¶
Data Indexing and Selection¶
In Chapter 2, we looked in detail at methods and tools to access, set,
and modify values in NumPy arrays. These included indexing (e.g.,
arr[2, 1]
), slicing (e.g., arr[:,1:5])
, masking (e.g.,
arr[arr > 0]
), fancy indexing (e.g., arr[0, [1, 5]]
), and
combinations thereof (e.g., arr[:, [1, 5]]
). Here we’ll look at
similar means of accessing and modifying values in Pandas Series and
DataFrame objects. If you have used the NumPy patterns, the
corresponding patterns in Pandas will feel very famil‐ iar, though there
are a few quirks to be aware of.
Data Selection in Series¶
As we saw in the previous section, a Series object acts in many ways like a one- dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays. ### Series as dictionary Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
0.5
# also use dictionary-like python experssions and methods to examine the keys/indices and values
'a' in data
True
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
Series as one-dimensional array¶
A Series builds on this dictionary-like interface and provides array-style item selec‐ tion via the same basic mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing. Examples of these are as follows:
a 0.25
b 0.50
c 0.75
dtype: float64
a 0.25
b 0.50
dtype: float64
b 0.50
c 0.75
dtype: float64
a 0.25
e 1.25
dtype: float64
Among these, slicing may be the source of the most confusion. Notice
that when you are slicing with an explicit index (i.e.,
data['a':'c']
), the final index is included in the slice, while when
you’re slicing with an implicit index (i.e., data[0:2]
), the final
index is excluded from the slice.
Indexers: loc, iloc, and ix¶
These slicing and indexing conventions can be a source of confusion. For
example, if your Series has an explicit integer index, an indexing
operation such as data[1]
will use the explicit indices, while a
slicing operation like data[1:3]
will use the implicit Python-style
index.
1 a
3 b
5 c
dtype: object
'a'
'b'
3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
# First, the loc attribute allows indexing and slicing that always references the explicit index:
data.loc[1]
'a'
1 a
3 b
dtype: object
# The iloc attribute allows indexing and slicing that always references the implicit Python-style index
data.iloc[1]
'b'
'a'
KeyError: 0
3 b
5 c
dtype: object
1 a
3 b
dtype: object
Data Selection in DataFrame¶
DataFrame as a Dictionary¶
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
True
False
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.413926 |
Texas | 695662 | 26448193 | 38.018740 |
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
DataFrame as 2D array¶
array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
[6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
[1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
[1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
[1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])
California | Texas | New York | Florida | Illinois | |
---|---|---|---|---|---|
area | 4.239670e+05 | 6.956620e+05 | 1.412970e+05 | 1.703120e+05 | 1.499950e+05 |
pop | 3.833252e+07 | 2.644819e+07 | 1.965113e+07 | 1.955286e+07 | 1.288214e+07 |
density | 9.041393e+01 | 3.801874e+01 | 1.390767e+02 | 1.148061e+02 | 8.588376e+01 |
array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
pop | density | |
---|---|---|
New York | 19651127 | 139.076746 |
Florida | 19552860 | 114.806121 |
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.000000 |
Texas | 695662 | 26448193 | 38.018740 |
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |