================ by Jawad Haider
04b - Full Artificial Neural Network Code Along - CLASSIFICATION¶
- Full Artificial Neural Network Code Along - CLASSIFICATION
- Working with tabular data
- Perform standard imports
- Load the NYC Taxi Fares dataset
- Calculate the distance traveled
- Add a datetime column and derive useful statistics
- Separate categorical from continuous columns
- Categorify
- Convert numpy arrays to tensors
- Set an embedding size
- Define a TabularModel
- Define loss function & optimizer
- Perform train/test splits
- Train the model
- Plot the loss function
- Validate the model
- Save the model
- Loading a saved model (starting from scratch)
- Feed new data through the trained model
Full Artificial Neural Network Code Along - CLASSIFICATION¶
In the last section we took in four continuous variables (lengths) to perform a classification. In this section we’ll combine continuous and categorical data to perform a similar classification. The goal is to estimate the relative cost of a New York City cab ride from several inputs. The inspiration behind this code along is a recent Kaggle competition.
Working with tabular data¶
Deep learning with neural networks is often associated with sophisticated image recognition, and in upcoming sections we’ll train models based on properties like pixels patterns and colors.
Here we’re working with tabular data (spreadsheets, SQL tables, etc.) with columns of values that may or may not be relevant. As it happens, neural networks can learn to make connections we probably wouldn’t have developed on our own. However, to do this we have to handle categorical values separately from continuous ones. Make sure to watch the theory lectures! You’ll want to be comfortable with: * continuous vs. categorical values * embeddings * batch normalization * dropout layers
Perform standard imports¶
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Load the NYC Taxi Fares dataset¶
The Kaggle competition provides a dataset with about 55 million records. The data contains only the pickup date & time, the latitude & longitude (GPS coordinates) of the pickup and dropoff locations, and the number of passengers. It is up to the contest participant to extract any further information. For instance, does the time of day matter? The day of the week? How do we determine the distance traveled from pairs of GPS coordinates?
For this exercise we’ve whittled the dataset down to just 120,000 records from April 11 to April 24, 2010. The records are randomly sorted. We’ll show how to calculate distance from GPS coordinates, and how to create a pandas datatime object from a text column. This will let us quickly get information like day of the week, am vs. pm, etc.
Let’s get started!
pickup_datetime | fare_amount | fare_class | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
---|---|---|---|---|---|---|---|---|
0 | 2010-04-19 08:17:56 UTC | 6.5 | 0 | -73.992365 | 40.730521 | -73.975499 | 40.744746 | 1 |
1 | 2010-04-17 15:43:53 UTC | 6.9 | 0 | -73.990078 | 40.740558 | -73.974232 | 40.744114 | 1 |
2 | 2010-04-17 11:23:26 UTC | 10.1 | 1 | -73.994149 | 40.751118 | -73.960064 | 40.766235 | 2 |
3 | 2010-04-11 21:25:03 UTC | 8.9 | 0 | -73.990485 | 40.756422 | -73.971205 | 40.748192 | 1 |
4 | 2010-04-17 02:19:01 UTC | 19.7 | 1 | -73.990976 | 40.734202 | -73.905956 | 40.743115 | 1 |
0 80000
1 40000
Name: fare_class, dtype: int64
Conveniently, ⅔ of the data have fares under \$10, and ⅓ have fares \$10 and above.
Fare classes correspond to fare amounts as follows:
Class | Values |
---|---|
0 | \< \$10.00 |
1 | > = \$10.00 > | >
Calculate the distance traveled¶
The haversine
formula calculates the distance on a sphere between two sets of GPS
coordinates.
Here we assign latitude values with
(phi) and longitude with
(lambda).
The distance formula works out to
where
def haversine_distance(df, lat1, long1, lat2, long2):
"""
Calculates the haversine distance between 2 sets of GPS coordinates in df
"""
r = 6371 # average radius of Earth in kilometers
phi1 = np.radians(df[lat1])
phi2 = np.radians(df[lat2])
delta_phi = np.radians(df[lat2]-df[lat1])
delta_lambda = np.radians(df[long2]-df[long1])
a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
d = (r * c) # in kilometers
return d
df['dist_km'] = haversine_distance(df,'pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude')
df.head()
pickup_datetime | fare_amount | fare_class | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | dist_km | |
---|---|---|---|---|---|---|---|---|---|
0 | 2010-04-19 08:17:56 UTC | 6.5 | 0 | -73.992365 | 40.730521 | -73.975499 | 40.744746 | 1 | 2.126312 |
1 | 2010-04-17 15:43:53 UTC | 6.9 | 0 | -73.990078 | 40.740558 | -73.974232 | 40.744114 | 1 | 1.392307 |
2 | 2010-04-17 11:23:26 UTC | 10.1 | 1 | -73.994149 | 40.751118 | -73.960064 | 40.766235 | 2 | 3.326763 |
3 | 2010-04-11 21:25:03 UTC | 8.9 | 0 | -73.990485 | 40.756422 | -73.971205 | 40.748192 | 1 | 1.864129 |
4 | 2010-04-17 02:19:01 UTC | 19.7 | 1 | -73.990976 | 40.734202 | -73.905956 | 40.743115 | 1 | 7.231321 |
Add a datetime column and derive useful statistics¶
By creating a datetime object, we can extract information like “day of the week”, “am vs. pm” etc. Note that the data was saved in UTC time. Our data falls in April of 2010 which occurred during Daylight Savings Time in New York. For that reason, we’ll make an adjustment to EDT using UTC-4 (subtracting four hours).
df['EDTdate'] = pd.to_datetime(df['pickup_datetime'].str[:19]) - pd.Timedelta(hours=4)
df['Hour'] = df['EDTdate'].dt.hour
df['AMorPM'] = np.where(df['Hour']<12,'am','pm')
df['Weekday'] = df['EDTdate'].dt.strftime("%a")
df.head()
pickup_datetime | fare_amount | fare_class | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | dist_km | EDTdate | Hour | AMorPM | Weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010-04-19 08:17:56 UTC | 6.5 | 0 | -73.992365 | 40.730521 | -73.975499 | 40.744746 | 1 | 2.126312 | 2010-04-19 04:17:56 | 4 | am | Mon |
1 | 2010-04-17 15:43:53 UTC | 6.9 | 0 | -73.990078 | 40.740558 | -73.974232 | 40.744114 | 1 | 1.392307 | 2010-04-17 11:43:53 | 11 | am | Sat |
2 | 2010-04-17 11:23:26 UTC | 10.1 | 1 | -73.994149 | 40.751118 | -73.960064 | 40.766235 | 2 | 3.326763 | 2010-04-17 07:23:26 | 7 | am | Sat |
3 | 2010-04-11 21:25:03 UTC | 8.9 | 0 | -73.990485 | 40.756422 | -73.971205 | 40.748192 | 1 | 1.864129 | 2010-04-11 17:25:03 | 17 | pm | Sun |
4 | 2010-04-17 02:19:01 UTC | 19.7 | 1 | -73.990976 | 40.734202 | -73.905956 | 40.743115 | 1 | 7.231321 | 2010-04-16 22:19:01 | 22 | pm | Fri |
Timestamp('2010-04-11 00:00:10')
Timestamp('2010-04-24 23:59:42')
Separate categorical from continuous columns¶
Index(['pickup_datetime', 'fare_amount', 'fare_class', 'pickup_longitude',
'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
'passenger_count', 'dist_km', 'EDTdate', 'Hour', 'AMorPM', 'Weekday'],
dtype='object')
cat_cols = ['Hour', 'AMorPM', 'Weekday']
cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']
y_col = ['fare_class'] # this column contains the labels
cont_cols = [col for col in df.columns if col not in cat_cols + y_col]Here we entered the continuous columns explicitly because there are columns we’re not running through the model (fare_amount and EDTdate)
Categorify¶
Pandas offers a category dtype for converting categorical values to numerical codes. A dataset containing months of the year will be assigned 12 codes, one for each month. These will usually be the integers 0 to 11. Pandas replaces the column values with codes, and retains an index list of category values. In the steps ahead we’ll call the categorical values “names” and the encodings “codes”.
# Convert our three categorical columns to category dtypes.
for cat in cat_cols:
df[cat] = df[cat].astype('category')
pickup_datetime object
fare_amount float64
fare_class int64
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dist_km float64
EDTdate datetime64[ns]
Hour category
AMorPM category
Weekday category
dtype: object
We can see that df[‘Hour’] is a categorical feature by displaying some of the rows:
0 4
1 11
2 7
3 17
4 22
Name: Hour, dtype: category
Categories (24, int64): [0, 1, 2, 3, ..., 20, 21, 22, 23]
Here our categorical names are the integers 0 through 23, for a total of 24 unique categories. These values also correspond to the codes assigned to each name.
We can access the category names with Series.cat.categories or just the codes with Series.cat.codes. This will make more sense if we look at df[‘AMorPM’]:
0 am
1 am
2 am
3 pm
4 pm
Name: AMorPM, dtype: category
Categories (2, object): [am, pm]
Index(['am', 'pm'], dtype='object')
0 0
1 0
2 0
3 1
4 1
dtype: int8
Index(['Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed'], dtype='object')
0 1
1 2
2 2
3 3
4 0
dtype: int8
Now we want to combine the three categorical columns into one input array using numpy.stack We don’t want the Series index, just the values.
hr = df['Hour'].cat.codes.values
ampm = df['AMorPM'].cat.codes.values
wkdy = df['Weekday'].cat.codes.values
cats = np.stack([hr, ampm, wkdy], 1)
cats[:5]
array([[ 4, 0, 1],
[11, 0, 2],
[ 7, 0, 2],
[17, 1, 3],
[22, 1, 0]], dtype=int8)
cats = np.stack([df[col].cat.codes.values for col in cat_cols], 1)Don’t worry about the dtype for now, we can make it int64 when we convert it to a tensor.
Convert numpy arrays to tensors¶
# Convert categorical variables to a tensor
cats = torch.tensor(cats, dtype=torch.int64)
# this syntax is ok, since the source data is an array, not an existing tensor
cats[:5]
tensor([[ 4, 0, 1],
[11, 0, 2],
[ 7, 0, 2],
[17, 1, 3],
[22, 1, 0]])
We can feed all of our continuous variables into the model as a tensor. We’re not normalizing the values here; we’ll let the model perform this step.
# Convert continuous variables to a tensor
conts = np.stack([df[col].values for col in cont_cols], 1)
conts = torch.tensor(conts, dtype=torch.float)
conts[:5]
tensor([[ 40.7305, -73.9924, 40.7447, -73.9755, 1.0000, 2.1263],
[ 40.7406, -73.9901, 40.7441, -73.9742, 1.0000, 1.3923],
[ 40.7511, -73.9941, 40.7662, -73.9601, 2.0000, 3.3268],
[ 40.7564, -73.9905, 40.7482, -73.9712, 1.0000, 1.8641],
[ 40.7342, -73.9910, 40.7431, -73.9060, 1.0000, 7.2313]])
'torch.FloatTensor'
Note: the CrossEntropyLoss function we’ll use below expects a 1d y-tensor, so we’ll replace .reshape(-1,1) with .flatten() this time.
tensor([0, 0, 1, 0, 1])
torch.Size([120000, 3])
torch.Size([120000, 6])
torch.Size([120000])
Set an embedding size¶
The rule of thumb for determining the embedding size is to divide the number of unique entries in each column by 2, but not to exceed 50.
# This will set embedding sizes for Hours, AMvsPM and Weekdays
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs
[(24, 12), (2, 1), (7, 4)]
Define a TabularModel¶
This somewhat follows the fast.ai library The goal is to define a model based on the number of continuous columns (given by conts.shape[1]) plus the number of categorical columns and their embeddings (given by len(emb_szs) and emb_szs respectively). The output would either be a regression (a single float value), or a classification (a group of bins and their softmax values). For this exercise our output will be a single regression value. Note that we’ll assume our data contains both categorical and continuous data. You can add boolean parameters to your own model class to handle a variety of datasets.
1. Extend the base Module class, set up the following parameters: - emb_szs: list of tuples: each categorical variable size is paired with an embedding size - n_cont: int: number of continuous variables - out_sz: int: output size - layers: list of ints: layer sizes - p: float: dropout probability for each layer (for simplicity we’ll use the same value throughout) class TabularModel(nn.Module):
def \_\_init\_\_(self, emb_szs, n_cont, out_sz, layers, p=0.5):
super().\_\_init\_\_()
2. Set up the embedded layers with torch.nn.ModuleList() and torch.nn.Embedding()
Categorical data will be filtered through these Embeddings in the forward section.
self.embeds = nn.ModuleList(\[nn.Embedding(ni, nf) for ni,nf in emb_szs\])
3. Set up a dropout function for the embeddings with torch.nn.Dropout() The default p-value=0.5
self.emb_drop = nn.Dropout(emb_drop)
4. Set up a normalization function for the continuous variables with torch.nn.BatchNorm1d()
self.bn_cont = nn.BatchNorm1d(n_cont)
5. Set up a sequence of neural network layers where each level includes a Linear function, an activation function (we’ll use ReLU), a normalization step, and a dropout layer. We’ll combine the list of layers with torch.nn.Sequential()
self.bn_cont = nn.BatchNorm1d(n_cont)
layerlist = \[\]
n_emb = sum((nf for ni,nf in emb_szs))
n_in = n_emb + n_cont
for i in layers:
layerlist.append(nn.Linear(n_in,i))
layerlist.append(nn.ReLU(inplace=True))
layerlist.append(nn.BatchNorm1d(i))
layerlist.append(nn.Dropout(p))
n_in = i
layerlist.append(nn.Linear(layers\[-1\],out_sz))
self.layers = nn.Sequential(\*layerlist)
6. Define the forward method. Preprocess the embeddings and normalize the continuous variables before passing them through the layers.
Use torch.cat() to combine multiple tensors into one.
def forward(self, x_cat, x_cont):
embeddings = \[\]
for i,e in enumerate(self.embeds):
embeddings.append(e(x_cat\[:,i\]))
x = torch.cat(embeddings, 1)
x = self.emb_drop(x)
x_cont = self.bn_cont(x_cont)
x = torch.cat(\[x, x_cont\], 1)
x = self.layers(x)
return x
tensor([[ 4, 0, 1],
[11, 0, 2],
[ 7, 0, 2],
[17, 1, 3]])
[(24, 12), (2, 1), (7, 4)]
# This is assigned inside the __init__() method
selfembeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
selfembeds
ModuleList(
(0): Embedding(24, 12)
(1): Embedding(2, 1)
(2): Embedding(7, 4)
)
[(0, Embedding(24, 12)), (1, Embedding(2, 1)), (2, Embedding(7, 4))]
# This happens inside the forward() method
embeddingz = []
for i,e in enumerate(selfembeds):
embeddingz.append(e(catz[:,i]))
embeddingz
[tensor([[ 0.0347, 0.3536, -1.2988, 1.6375, -0.0542, -0.2099, 0.3044, -1.2855,
0.8831, -0.7109, -0.9646, -0.1356],
[-0.5039, -0.9924, 1.2296, -0.6908, 0.4641, -1.0487, 0.5577, -1.1560,
0.8318, -0.0834, 1.2123, -0.6210],
[ 0.3509, 0.2216, 0.3432, 1.4547, -0.8747, 1.6727, -0.6417, -1.0160,
0.8217, -1.0531, 0.8357, -0.0637],
[ 0.7978, 0.4566, 1.0926, -0.4095, -0.3366, 1.0216, 0.3601, -0.2927,
0.3536, 0.2170, -1.4778, -1.1965]], grad_fn=<EmbeddingBackward>),
tensor([[-0.9676],
[-0.9676],
[-0.9676],
[-1.0656]], grad_fn=<EmbeddingBackward>),
tensor([[-2.1762, 1.0210, 1.3557, -0.1804],
[-1.0131, 0.9989, -0.4746, -0.1461],
[-1.0131, 0.9989, -0.4746, -0.1461],
[-0.3646, -3.2237, -0.9956, 0.2598]], grad_fn=<EmbeddingBackward>)]
tensor([[ 0.0347, 0.3536, -1.2988, 1.6375, -0.0542, -0.2099, 0.3044, -1.2855,
0.8831, -0.7109, -0.9646, -0.1356, -0.9676, -2.1762, 1.0210, 1.3557,
-0.1804],
[-0.5039, -0.9924, 1.2296, -0.6908, 0.4641, -1.0487, 0.5577, -1.1560,
0.8318, -0.0834, 1.2123, -0.6210, -0.9676, -1.0131, 0.9989, -0.4746,
-0.1461],
[ 0.3509, 0.2216, 0.3432, 1.4547, -0.8747, 1.6727, -0.6417, -1.0160,
0.8217, -1.0531, 0.8357, -0.0637, -0.9676, -1.0131, 0.9989, -0.4746,
-0.1461],
[ 0.7978, 0.4566, 1.0926, -0.4095, -0.3366, 1.0216, 0.3601, -0.2927,
0.3536, 0.2170, -1.4778, -1.1965, -1.0656, -0.3646, -3.2237, -0.9956,
0.2598]], grad_fn=<CatBackward>)
tensor([[ 0.0000, 0.0000, -2.1647, 0.0000, -0.0000, -0.3498, 0.5073, -2.1424,
0.0000, -1.1848, -1.6076, -0.2259, -1.6127, -3.6271, 0.0000, 2.2594,
-0.3007],
[-0.8398, -0.0000, 0.0000, -0.0000, 0.7734, -1.7478, 0.0000, -1.9267,
0.0000, -0.1390, 0.0000, -1.0350, -0.0000, -0.0000, 1.6648, -0.0000,
-0.2435],
[ 0.0000, 0.3693, 0.5719, 0.0000, -1.4578, 0.0000, -1.0694, -1.6933,
0.0000, -1.7552, 1.3929, -0.1062, -1.6127, -1.6886, 1.6648, -0.0000,
-0.0000],
[ 1.3297, 0.0000, 0.0000, -0.0000, -0.0000, 0.0000, 0.0000, -0.4879,
0.0000, 0.0000, -2.4631, -1.9941, -1.7760, -0.6077, -5.3728, -1.6593,
0.4330]], grad_fn=<MulBackward0>)
class TabularModel(nn.Module):
def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
super().__init__()
self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
self.emb_drop = nn.Dropout(p)
self.bn_cont = nn.BatchNorm1d(n_cont)
layerlist = []
n_emb = sum((nf for ni,nf in emb_szs))
n_in = n_emb + n_cont
for i in layers:
layerlist.append(nn.Linear(n_in,i))
layerlist.append(nn.ReLU(inplace=True))
layerlist.append(nn.BatchNorm1d(i))
layerlist.append(nn.Dropout(p))
n_in = i
layerlist.append(nn.Linear(layers[-1],out_sz))
self.layers = nn.Sequential(*layerlist)
def forward(self, x_cat, x_cont):
embeddings = []
for i,e in enumerate(self.embeds):
embeddings.append(e(x_cat[:,i]))
x = torch.cat(embeddings, 1)
x = self.emb_drop(x)
x_cont = self.bn_cont(x_cont)
x = torch.cat([x, x_cont], 1)
x = self.layers(x)
return x
torch.manual_seed(33)
model = TabularModel(emb_szs, conts.shape[1], 2, [200,100], p=0.4) # out_sz = 2
TabularModel(
(embeds): ModuleList(
(0): Embedding(24, 12)
(1): Embedding(2, 1)
(2): Embedding(7, 4)
)
(emb_drop): Dropout(p=0.4)
(bn_cont): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=23, out_features=200, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.4)
(4): Linear(in_features=200, out_features=100, bias=True)
(5): ReLU(inplace)
(6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.4)
(8): Linear(in_features=100, out_features=2, bias=True)
)
)
Define loss function & optimizer¶
For our classification we’ll replace the MSE loss function with
torch.nn.CrossEntropyLoss()
For the optimizer, we’ll continue to use
torch.optim.Adam()
Perform train/test splits¶
At this point our batch size is the entire dataset of 120,000 records. To save time we’ll use the first 60,000. Recall that our tensors are already randomly shuffled.
batch_size = 60000
test_size = 12000
cat_train = cats[:batch_size-test_size]
cat_test = cats[batch_size-test_size:batch_size]
con_train = conts[:batch_size-test_size]
con_test = conts[batch_size-test_size:batch_size]
y_train = y[:batch_size-test_size]
y_test = y[batch_size-test_size:batch_size]
48000
12000
Train the model¶
Expect this to take 30 minutes or more! We’ve added code to tell us the duration at the end.
import time
start_time = time.time()
epochs = 300
losses = []
for i in range(epochs):
i+=1
y_pred = model(cat_train, con_train)
loss = criterion(y_pred, y_train)
losses.append(loss)
# a neat trick to save screen space:
if i%25 == 1:
print(f'epoch: {i:3} loss: {loss.item():10.8f}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'epoch: {i:3} loss: {loss.item():10.8f}') # print the last line
print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed
epoch: 1 loss: 0.73441482
epoch: 26 loss: 0.45090991
epoch: 51 loss: 0.35915938
epoch: 76 loss: 0.31940848
epoch: 101 loss: 0.29913244
epoch: 126 loss: 0.28824982
epoch: 151 loss: 0.28091952
epoch: 176 loss: 0.27713534
epoch: 201 loss: 0.27236161
epoch: 226 loss: 0.27171907
epoch: 251 loss: 0.26830241
epoch: 276 loss: 0.26365638
epoch: 300 loss: 0.25949642
Duration: 709 seconds
Plot the loss function¶
Validate the model¶
# TO EVALUATE THE ENTIRE TEST SET
with torch.no_grad():
y_val = model(cat_test, con_test)
loss = criterion(y_val, y_test)
print(f'CE Loss: {loss:.8f}')
CE Loss: 0.25455481
Now let’s look at the first 50 predicted values
rows = 50
correct = 0
print(f'{"MODEL OUTPUT":26} ARGMAX Y_TEST')
for i in range(rows):
print(f'{str(y_val[i]):26} {y_val[i].argmax():^7}{y_test[i]:^7}')
if y_val[i].argmax().item() == y_test[i]:
correct += 1
print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')
MODEL OUTPUT ARGMAX Y_TEST
tensor([ 1.8140, -1.6443]) 0 0
tensor([-1.8268, 2.6373]) 1 0
tensor([ 1.4028, -1.9248]) 0 0
tensor([-1.9130, 1.4853]) 1 1
tensor([ 1.1757, -2.4964]) 0 0
tensor([ 2.0996, -2.2990]) 0 0
tensor([ 1.3226, -1.8349]) 0 0
tensor([-1.6211, 2.3889]) 1 1
tensor([ 2.2489, -2.4253]) 0 0
tensor([-0.4459, 1.1358]) 1 1
tensor([ 1.5145, -2.1619]) 0 0
tensor([ 0.7704, -1.9443]) 0 0
tensor([ 0.9637, -1.3796]) 0 0
tensor([-1.3527, 1.7322]) 1 1
tensor([ 1.4110, -2.4595]) 0 0
tensor([-1.4455, 2.6081]) 1 1
tensor([ 2.2798, -2.5864]) 0 1
tensor([ 1.4585, -2.7982]) 0 0
tensor([ 0.3342, -0.8995]) 0 0
tensor([ 2.0525, -1.9737]) 0 0
tensor([-1.3571, 2.1911]) 1 1
tensor([-0.4669, 0.2872]) 1 1
tensor([-2.0624, 2.2875]) 1 1
tensor([-2.1334, 2.6416]) 1 1
tensor([-3.1325, 5.1561]) 1 1
tensor([ 2.2128, -2.5172]) 0 0
tensor([ 1.0346, -1.7764]) 0 0
tensor([ 1.1221, -1.6717]) 0 0
tensor([-2.1322, 1.6714]) 1 1
tensor([ 1.5009, -1.6338]) 0 0
tensor([ 2.0387, -1.8475]) 0 0
tensor([-1.6346, 2.8899]) 1 1
tensor([-3.0129, 2.3519]) 1 1
tensor([-1.5746, 2.0000]) 1 1
tensor([ 1.3056, -2.2630]) 0 0
tensor([ 0.6631, -1.4797]) 0 0
tensor([-1.4585, 2.1836]) 1 1
tensor([ 1.0574, -1.5848]) 0 1
tensor([ 0.3376, -0.8050]) 0 1
tensor([ 1.9217, -1.9764]) 0 0
tensor([ 0.1011, -0.5529]) 0 0
tensor([ 0.6703, -0.5540]) 0 0
tensor([-0.6733, 0.8777]) 1 1
tensor([ 2.2017, -2.0445]) 0 0
tensor([-0.0442, -0.4276]) 0 0
tensor([-1.1204, 1.2558]) 1 1
tensor([-1.8170, 2.7124]) 1 1
tensor([ 1.7404, -2.0341]) 0 0
tensor([ 1.3266, -2.3039]) 0 0
tensor([-0.0671, 0.3291]) 1 0
45 out of 50 = 90.00% correct
Save the model¶
Save the trained model to a file in case you want to come back later and feed new data through it.
# Make sure to save the model only after the training has happened!
if len(losses) == epochs:
torch.save(model.state_dict(), 'TaxiFareClssModel.pt')
else:
print('Model has not been trained. Consider loading a trained model instead.')
Loading a saved model (starting from scratch)¶
We can load the trained weights and biases from a saved model. If we’ve just opened the notebook, we’ll have to run standard imports and function definitions. To demonstrate, restart the kernel before proceeding.
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
def haversine_distance(df, lat1, long1, lat2, long2):
r = 6371
phi1 = np.radians(df[lat1])
phi2 = np.radians(df[lat2])
delta_phi = np.radians(df[lat2]-df[lat1])
delta_lambda = np.radians(df[long2]-df[long1])
a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return r * c
class TabularModel(nn.Module):
def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
super().__init__()
self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
self.emb_drop = nn.Dropout(p)
self.bn_cont = nn.BatchNorm1d(n_cont)
layerlist = []
n_emb = sum((nf for ni,nf in emb_szs))
n_in = n_emb + n_cont
for i in layers:
layerlist.append(nn.Linear(n_in,i))
layerlist.append(nn.ReLU(inplace=True))
layerlist.append(nn.BatchNorm1d(i))
layerlist.append(nn.Dropout(p))
n_in = i
layerlist.append(nn.Linear(layers[-1],out_sz))
self.layers = nn.Sequential(*layerlist)
def forward(self, x_cat, x_cont):
embeddings = []
for i,e in enumerate(self.embeds):
embeddings.append(e(x_cat[:,i]))
x = torch.cat(embeddings, 1)
x = self.emb_drop(x)
x_cont = self.bn_cont(x_cont)
x = torch.cat([x, x_cont], 1)
return self.layers(x)
Now define the model. Before we can load the saved settings, we need to instantiate our TabularModel with the parameters we used before (embedding sizes, number of continuous columns, output size, layer sizes, and dropout layer p-value).
Once the model is set up, loading the saved settings is a snap.
model2.load_state_dict(torch.load('TaxiFareClssModel.pt'));
model2.eval() # be sure to run this step!
TabularModel(
(embeds): ModuleList(
(0): Embedding(24, 12)
(1): Embedding(2, 1)
(2): Embedding(7, 4)
)
(emb_drop): Dropout(p=0.4)
(bn_cont): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=23, out_features=200, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.4)
(4): Linear(in_features=200, out_features=100, bias=True)
(5): ReLU(inplace)
(6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.4)
(8): Linear(in_features=100, out_features=2, bias=True)
)
)
Next we’ll define a function that takes in new parameters from the user, performs all of the preprocessing steps above, and passes the new data through our trained model.
def test_data(mdl): # pass in the name of the new model
# INPUT NEW DATA
plat = float(input('What is the pickup latitude? '))
plong = float(input('What is the pickup longitude? '))
dlat = float(input('What is the dropoff latitude? '))
dlong = float(input('What is the dropoff longitude? '))
psngr = int(input('How many passengers? '))
dt = input('What is the pickup date and time?\nFormat as YYYY-MM-DD HH:MM:SS ')
# PREPROCESS THE DATA
dfx_dict = {'pickup_latitude':plat,'pickup_longitude':plong,'dropoff_latitude':dlat,
'dropoff_longitude':dlong,'passenger_count':psngr,'EDTdate':dt}
dfx = pd.DataFrame(dfx_dict, index=[0])
dfx['dist_km'] = haversine_distance(dfx,'pickup_latitude', 'pickup_longitude',
'dropoff_latitude', 'dropoff_longitude')
dfx['EDTdate'] = pd.to_datetime(dfx['EDTdate'])
# We can skip the .astype(category) step since our fields are small,
# and encode them right away
dfx['Hour'] = dfx['EDTdate'].dt.hour
dfx['AMorPM'] = np.where(dfx['Hour']<12,0,1)
dfx['Weekday'] = dfx['EDTdate'].dt.strftime("%a")
dfx['Weekday'] = dfx['Weekday'].replace(['Fri','Mon','Sat','Sun','Thu','Tue','Wed'],
[0,1,2,3,4,5,6]).astype('int64')
# CREATE CAT AND CONT TENSORS
cat_cols = ['Hour', 'AMorPM', 'Weekday']
cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude',
'dropoff_longitude', 'passenger_count', 'dist_km']
xcats = np.stack([dfx[col].values for col in cat_cols], 1)
xcats = torch.tensor(xcats, dtype=torch.int64)
xconts = np.stack([dfx[col].values for col in cont_cols], 1)
xconts = torch.tensor(xconts, dtype=torch.float)
# PASS NEW DATA THROUGH THE MODEL WITHOUT PERFORMING A BACKPROP
with torch.no_grad():
z = mdl(xcats, xconts).argmax().item()
print(f'\nThe predicted fare class is {z}')
Feed new data through the trained model¶
For convenience, here are the max and min values for each of the variables:
Column | Minimum | Maximum |
---|---|---|
pickup_latitude | 40 | 41 |
pickup_longitude | -74.5 | -73.3 |
dropoff_latitude | 40 | 41 |
dropoff_longitude | -74.5 | -73.3 |
passenger_count | 1 | 5 |
EDTdate | 2010-04-11 00:00:00 | 2010-04-24 23:59:42 |