Skip to content

================ by Jawad Haider

04b - Full Artificial Neural Network Code Along - CLASSIFICATION


Image
Copyright Qalmaqihir
For more information, visit us at www.github.com/qalmaqihir/



Full Artificial Neural Network Code Along - CLASSIFICATION

In the last section we took in four continuous variables (lengths) to perform a classification. In this section we’ll combine continuous and categorical data to perform a similar classification. The goal is to estimate the relative cost of a New York City cab ride from several inputs. The inspiration behind this code along is a recent Kaggle competition.

NOTE: This notebook differs from the previous regression notebook in that it uses ‘fare_class’ for the y set, and the output contains two values instead of one. In this exercise we’re training our model to perform a binary classification, and predict whether a fare is greater or less than \$10.00.

Working with tabular data

Deep learning with neural networks is often associated with sophisticated image recognition, and in upcoming sections we’ll train models based on properties like pixels patterns and colors.

Here we’re working with tabular data (spreadsheets, SQL tables, etc.) with columns of values that may or may not be relevant. As it happens, neural networks can learn to make connections we probably wouldn’t have developed on our own. However, to do this we have to handle categorical values separately from continuous ones. Make sure to watch the theory lectures! You’ll want to be comfortable with: * continuous vs. categorical values * embeddings * batch normalization * dropout layers

Perform standard imports

import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Load the NYC Taxi Fares dataset

The Kaggle competition provides a dataset with about 55 million records. The data contains only the pickup date & time, the latitude & longitude (GPS coordinates) of the pickup and dropoff locations, and the number of passengers. It is up to the contest participant to extract any further information. For instance, does the time of day matter? The day of the week? How do we determine the distance traveled from pairs of GPS coordinates?

For this exercise we’ve whittled the dataset down to just 120,000 records from April 11 to April 24, 2010. The records are randomly sorted. We’ll show how to calculate distance from GPS coordinates, and how to create a pandas datatime object from a text column. This will let us quickly get information like day of the week, am vs. pm, etc.

Let’s get started!

df = pd.read_csv('../Data/NYCTaxiFares.csv')
df.head()
pickup_datetime fare_amount fare_class pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2010-04-19 08:17:56 UTC 6.5 0 -73.992365 40.730521 -73.975499 40.744746 1
1 2010-04-17 15:43:53 UTC 6.9 0 -73.990078 40.740558 -73.974232 40.744114 1
2 2010-04-17 11:23:26 UTC 10.1 1 -73.994149 40.751118 -73.960064 40.766235 2
3 2010-04-11 21:25:03 UTC 8.9 0 -73.990485 40.756422 -73.971205 40.748192 1
4 2010-04-17 02:19:01 UTC 19.7 1 -73.990976 40.734202 -73.905956 40.743115 1
df['fare_class'].value_counts()
0    80000
1    40000
Name: fare_class, dtype: int64

Conveniently, ⅔ of the data have fares under \$10, and ⅓ have fares \$10 and above.

Fare classes correspond to fare amounts as follows:

> >
Class Values
0 \< \$10.00
1 > = \$10.00 >

Calculate the distance traveled

The haversine formula calculates the distance on a sphere between two sets of GPS coordinates.
Here we assign latitude values with \varphi (phi) and longitude with \lambda (lambda).

The distance formula works out to

{\displaystyle d=2r\arcsin \left({\sqrt {\sin ^{2}\left({\frac {\varphi _{2}-\varphi _{1}}{2}}\right)+\cos(\varphi _{1})\:\cos(\varphi _{2})\:\sin ^{2}\left({\frac {\lambda _{2}-\lambda _{1}}{2}}\right)}}\right)}

where

\begin{split} r&: \textrm {radius of the sphere (Earth's radius averages 6371 km)}\\ \varphi_1, \varphi_2&: \textrm {latitudes of point 1 and point 2}\\ \lambda_1, \lambda_2&: \textrm {longitudes of point 1 and point 2}\end{split}

def haversine_distance(df, lat1, long1, lat2, long2):
    """
    Calculates the haversine distance between 2 sets of GPS coordinates in df
    """
    r = 6371  # average radius of Earth in kilometers

    phi1 = np.radians(df[lat1])
    phi2 = np.radians(df[lat2])

    delta_phi = np.radians(df[lat2]-df[lat1])
    delta_lambda = np.radians(df[long2]-df[long1])

    a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = (r * c) # in kilometers

    return d
df['dist_km'] = haversine_distance(df,'pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude')
df.head()
pickup_datetime fare_amount fare_class pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count dist_km
0 2010-04-19 08:17:56 UTC 6.5 0 -73.992365 40.730521 -73.975499 40.744746 1 2.126312
1 2010-04-17 15:43:53 UTC 6.9 0 -73.990078 40.740558 -73.974232 40.744114 1 1.392307
2 2010-04-17 11:23:26 UTC 10.1 1 -73.994149 40.751118 -73.960064 40.766235 2 3.326763
3 2010-04-11 21:25:03 UTC 8.9 0 -73.990485 40.756422 -73.971205 40.748192 1 1.864129
4 2010-04-17 02:19:01 UTC 19.7 1 -73.990976 40.734202 -73.905956 40.743115 1 7.231321

Add a datetime column and derive useful statistics

By creating a datetime object, we can extract information like “day of the week”, “am vs. pm” etc. Note that the data was saved in UTC time. Our data falls in April of 2010 which occurred during Daylight Savings Time in New York. For that reason, we’ll make an adjustment to EDT using UTC-4 (subtracting four hours).

df['EDTdate'] = pd.to_datetime(df['pickup_datetime'].str[:19]) - pd.Timedelta(hours=4)
df['Hour'] = df['EDTdate'].dt.hour
df['AMorPM'] = np.where(df['Hour']<12,'am','pm')
df['Weekday'] = df['EDTdate'].dt.strftime("%a")
df.head()
pickup_datetime fare_amount fare_class pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count dist_km EDTdate Hour AMorPM Weekday
0 2010-04-19 08:17:56 UTC 6.5 0 -73.992365 40.730521 -73.975499 40.744746 1 2.126312 2010-04-19 04:17:56 4 am Mon
1 2010-04-17 15:43:53 UTC 6.9 0 -73.990078 40.740558 -73.974232 40.744114 1 1.392307 2010-04-17 11:43:53 11 am Sat
2 2010-04-17 11:23:26 UTC 10.1 1 -73.994149 40.751118 -73.960064 40.766235 2 3.326763 2010-04-17 07:23:26 7 am Sat
3 2010-04-11 21:25:03 UTC 8.9 0 -73.990485 40.756422 -73.971205 40.748192 1 1.864129 2010-04-11 17:25:03 17 pm Sun
4 2010-04-17 02:19:01 UTC 19.7 1 -73.990976 40.734202 -73.905956 40.743115 1 7.231321 2010-04-16 22:19:01 22 pm Fri
df['EDTdate'].min()
Timestamp('2010-04-11 00:00:10')
df['EDTdate'].max()
Timestamp('2010-04-24 23:59:42')

Separate categorical from continuous columns

df.columns
Index(['pickup_datetime', 'fare_amount', 'fare_class', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count', 'dist_km', 'EDTdate', 'Hour', 'AMorPM', 'Weekday'],
      dtype='object')
cat_cols = ['Hour', 'AMorPM', 'Weekday']
cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']
y_col = ['fare_class']  # this column contains the labels
NOTE: If you plan to use all of the columns in the data table, there’s a shortcut to grab the remaining continuous columns:
cont_cols = [col for col in df.columns if col not in cat_cols + y_col]
Here we entered the continuous columns explicitly because there are columns we’re not running through the model (fare_amount and EDTdate)

Categorify

Pandas offers a category dtype for converting categorical values to numerical codes. A dataset containing months of the year will be assigned 12 codes, one for each month. These will usually be the integers 0 to 11. Pandas replaces the column values with codes, and retains an index list of category values. In the steps ahead we’ll call the categorical values “names” and the encodings “codes”.

# Convert our three categorical columns to category dtypes.
for cat in cat_cols:
    df[cat] = df[cat].astype('category')
df.dtypes
pickup_datetime              object
fare_amount                 float64
fare_class                    int64
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
dist_km                     float64
EDTdate              datetime64[ns]
Hour                       category
AMorPM                     category
Weekday                    category
dtype: object

We can see that df[‘Hour’] is a categorical feature by displaying some of the rows:

df['Hour'].head()
0     4
1    11
2     7
3    17
4    22
Name: Hour, dtype: category
Categories (24, int64): [0, 1, 2, 3, ..., 20, 21, 22, 23]

Here our categorical names are the integers 0 through 23, for a total of 24 unique categories. These values also correspond to the codes assigned to each name.

We can access the category names with Series.cat.categories or just the codes with Series.cat.codes. This will make more sense if we look at df[‘AMorPM’]:

df['AMorPM'].head()
0    am
1    am
2    am
3    pm
4    pm
Name: AMorPM, dtype: category
Categories (2, object): [am, pm]
df['AMorPM'].cat.categories
Index(['am', 'pm'], dtype='object')
df['AMorPM'].head().cat.codes
0    0
1    0
2    0
3    1
4    1
dtype: int8
df['Weekday'].cat.categories
Index(['Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed'], dtype='object')
df['Weekday'].head().cat.codes
0    1
1    2
2    2
3    3
4    0
dtype: int8
NOTE: NaN values in categorical data are assigned a code of -1. We don’t have any in this particular dataset.

Now we want to combine the three categorical columns into one input array using numpy.stack We don’t want the Series index, just the values.

hr = df['Hour'].cat.codes.values
ampm = df['AMorPM'].cat.codes.values
wkdy = df['Weekday'].cat.codes.values

cats = np.stack([hr, ampm, wkdy], 1)

cats[:5]
array([[ 4,  0,  1],
       [11,  0,  2],
       [ 7,  0,  2],
       [17,  1,  3],
       [22,  1,  0]], dtype=int8)
NOTE: This can be done in one line of code using a list comprehension:
cats = np.stack([df[col].cat.codes.values for col in cat_cols], 1)
Don’t worry about the dtype for now, we can make it int64 when we convert it to a tensor.

Convert numpy arrays to tensors

# Convert categorical variables to a tensor
cats = torch.tensor(cats, dtype=torch.int64)
# this syntax is ok, since the source data is an array, not an existing tensor

cats[:5]
tensor([[ 4,  0,  1],
        [11,  0,  2],
        [ 7,  0,  2],
        [17,  1,  3],
        [22,  1,  0]])

We can feed all of our continuous variables into the model as a tensor. We’re not normalizing the values here; we’ll let the model perform this step.

NOTE: We have to store conts and y as Float (float32) tensors, not Double (float64) in order for batch normalization to work properly.
# Convert continuous variables to a tensor
conts = np.stack([df[col].values for col in cont_cols], 1)
conts = torch.tensor(conts, dtype=torch.float)
conts[:5]
tensor([[ 40.7305, -73.9924,  40.7447, -73.9755,   1.0000,   2.1263],
        [ 40.7406, -73.9901,  40.7441, -73.9742,   1.0000,   1.3923],
        [ 40.7511, -73.9941,  40.7662, -73.9601,   2.0000,   3.3268],
        [ 40.7564, -73.9905,  40.7482, -73.9712,   1.0000,   1.8641],
        [ 40.7342, -73.9910,  40.7431, -73.9060,   1.0000,   7.2313]])
conts.type()
'torch.FloatTensor'

Note: the CrossEntropyLoss function we’ll use below expects a 1d y-tensor, so we’ll replace .reshape(-1,1) with .flatten() this time.

# Convert labels to a tensor
y = torch.tensor(df[y_col].values).flatten()

y[:5]
tensor([0, 0, 1, 0, 1])
cats.shape
torch.Size([120000, 3])
conts.shape
torch.Size([120000, 6])
y.shape
torch.Size([120000])

Set an embedding size

The rule of thumb for determining the embedding size is to divide the number of unique entries in each column by 2, but not to exceed 50.

# This will set embedding sizes for Hours, AMvsPM and Weekdays
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs
[(24, 12), (2, 1), (7, 4)]

Define a TabularModel

This somewhat follows the fast.ai library The goal is to define a model based on the number of continuous columns (given by conts.shape[1]) plus the number of categorical columns and their embeddings (given by len(emb_szs) and emb_szs respectively). The output would either be a regression (a single float value), or a classification (a group of bins and their softmax values). For this exercise our output will be a single regression value. Note that we’ll assume our data contains both categorical and continuous data. You can add boolean parameters to your own model class to handle a variety of datasets.

Let’s walk through the steps we’re about to take. See below for more detailed illustrations of the steps.
1. Extend the base Module class, set up the following parameters: - emb_szs: list of tuples: each categorical variable size is paired with an embedding size - n_cont: int: number of continuous variables - out_sz: int: output size - layers: list of ints: layer sizes - p: float: dropout probability for each layer (for simplicity we’ll use the same value throughout) class TabularModel(nn.Module):
    def \_\_init\_\_(self, emb_szs, n_cont, out_sz, layers, p=0.5):
        super().\_\_init\_\_()

2. Set up the embedded layers with torch.nn.ModuleList() and torch.nn.Embedding()
Categorical data will be filtered through these Embeddings in the forward section.
    self.embeds = nn.ModuleList(\[nn.Embedding(ni, nf) for ni,nf in emb_szs\])

3. Set up a dropout function for the embeddings with torch.nn.Dropout() The default p-value=0.5
    self.emb_drop = nn.Dropout(emb_drop)

4. Set up a normalization function for the continuous variables with torch.nn.BatchNorm1d()
    self.bn_cont = nn.BatchNorm1d(n_cont)

5. Set up a sequence of neural network layers where each level includes a Linear function, an activation function (we’ll use ReLU), a normalization step, and a dropout layer. We’ll combine the list of layers with torch.nn.Sequential()
    self.bn_cont = nn.BatchNorm1d(n_cont)
    layerlist = \[\]
    n_emb = sum((nf for ni,nf in emb_szs))
    n_in = n_emb + n_cont

    for i in layers:
        layerlist.append(nn.Linear(n_in,i))
        layerlist.append(nn.ReLU(inplace=True))
        layerlist.append(nn.BatchNorm1d(i))
        layerlist.append(nn.Dropout(p))
        n_in = i
    layerlist.append(nn.Linear(layers\[-1\],out_sz))

    self.layers = nn.Sequential(\*layerlist)


6. Define the forward method. Preprocess the embeddings and normalize the continuous variables before passing them through the layers.
Use torch.cat() to combine multiple tensors into one.
def forward(self, x_cat, x_cont):
    embeddings = \[\]
    for i,e in enumerate(self.embeds):
        embeddings.append(e(x_cat\[:,i\]))
    x = torch.cat(embeddings, 1)
    x = self.emb_drop(x)

    x_cont = self.bn_cont(x_cont)
    x = torch.cat(\[x, x_cont\], 1)
    x = self.layers(x)
    return x
Breaking down the embeddings steps (this code is for illustration purposes only.)
# This is our source data
catz = cats[:4]
catz
tensor([[ 4,  0,  1],
        [11,  0,  2],
        [ 7,  0,  2],
        [17,  1,  3]])
# This is passed in when the model is instantiated
emb_szs
[(24, 12), (2, 1), (7, 4)]
# This is assigned inside the __init__() method
selfembeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
selfembeds
ModuleList(
  (0): Embedding(24, 12)
  (1): Embedding(2, 1)
  (2): Embedding(7, 4)
)
list(enumerate(selfembeds))
[(0, Embedding(24, 12)), (1, Embedding(2, 1)), (2, Embedding(7, 4))]
# This happens inside the forward() method
embeddingz = []
for i,e in enumerate(selfembeds):
    embeddingz.append(e(catz[:,i]))
embeddingz
[tensor([[ 0.0347,  0.3536, -1.2988,  1.6375, -0.0542, -0.2099,  0.3044, -1.2855,
           0.8831, -0.7109, -0.9646, -0.1356],
         [-0.5039, -0.9924,  1.2296, -0.6908,  0.4641, -1.0487,  0.5577, -1.1560,
           0.8318, -0.0834,  1.2123, -0.6210],
         [ 0.3509,  0.2216,  0.3432,  1.4547, -0.8747,  1.6727, -0.6417, -1.0160,
           0.8217, -1.0531,  0.8357, -0.0637],
         [ 0.7978,  0.4566,  1.0926, -0.4095, -0.3366,  1.0216,  0.3601, -0.2927,
           0.3536,  0.2170, -1.4778, -1.1965]], grad_fn=<EmbeddingBackward>),
 tensor([[-0.9676],
         [-0.9676],
         [-0.9676],
         [-1.0656]], grad_fn=<EmbeddingBackward>),
 tensor([[-2.1762,  1.0210,  1.3557, -0.1804],
         [-1.0131,  0.9989, -0.4746, -0.1461],
         [-1.0131,  0.9989, -0.4746, -0.1461],
         [-0.3646, -3.2237, -0.9956,  0.2598]], grad_fn=<EmbeddingBackward>)]
# We concatenate the embedding sections (12,1,4) into one (17)
z = torch.cat(embeddingz, 1)
z
tensor([[ 0.0347,  0.3536, -1.2988,  1.6375, -0.0542, -0.2099,  0.3044, -1.2855,
          0.8831, -0.7109, -0.9646, -0.1356, -0.9676, -2.1762,  1.0210,  1.3557,
         -0.1804],
        [-0.5039, -0.9924,  1.2296, -0.6908,  0.4641, -1.0487,  0.5577, -1.1560,
          0.8318, -0.0834,  1.2123, -0.6210, -0.9676, -1.0131,  0.9989, -0.4746,
         -0.1461],
        [ 0.3509,  0.2216,  0.3432,  1.4547, -0.8747,  1.6727, -0.6417, -1.0160,
          0.8217, -1.0531,  0.8357, -0.0637, -0.9676, -1.0131,  0.9989, -0.4746,
         -0.1461],
        [ 0.7978,  0.4566,  1.0926, -0.4095, -0.3366,  1.0216,  0.3601, -0.2927,
          0.3536,  0.2170, -1.4778, -1.1965, -1.0656, -0.3646, -3.2237, -0.9956,
          0.2598]], grad_fn=<CatBackward>)
# This was assigned under the __init__() method
selfembdrop = nn.Dropout(.4)
z = selfembdrop(z)
z
tensor([[ 0.0000,  0.0000, -2.1647,  0.0000, -0.0000, -0.3498,  0.5073, -2.1424,
          0.0000, -1.1848, -1.6076, -0.2259, -1.6127, -3.6271,  0.0000,  2.2594,
         -0.3007],
        [-0.8398, -0.0000,  0.0000, -0.0000,  0.7734, -1.7478,  0.0000, -1.9267,
          0.0000, -0.1390,  0.0000, -1.0350, -0.0000, -0.0000,  1.6648, -0.0000,
         -0.2435],
        [ 0.0000,  0.3693,  0.5719,  0.0000, -1.4578,  0.0000, -1.0694, -1.6933,
          0.0000, -1.7552,  1.3929, -0.1062, -1.6127, -1.6886,  1.6648, -0.0000,
         -0.0000],
        [ 1.3297,  0.0000,  0.0000, -0.0000, -0.0000,  0.0000,  0.0000, -0.4879,
          0.0000,  0.0000, -2.4631, -1.9941, -1.7760, -0.6077, -5.3728, -1.6593,
          0.4330]], grad_fn=<MulBackward0>)
This is how the categorical embeddings are passed into the layers.
class TabularModel(nn.Module):

    def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(p)
        self.bn_cont = nn.BatchNorm1d(n_cont)

        layerlist = []
        n_emb = sum((nf for ni,nf in emb_szs))
        n_in = n_emb + n_cont

        for i in layers:
            layerlist.append(nn.Linear(n_in,i)) 
            layerlist.append(nn.ReLU(inplace=True))
            layerlist.append(nn.BatchNorm1d(i))
            layerlist.append(nn.Dropout(p))
            n_in = i
        layerlist.append(nn.Linear(layers[-1],out_sz))

        self.layers = nn.Sequential(*layerlist)

    def forward(self, x_cat, x_cont):
        embeddings = []
        for i,e in enumerate(self.embeds):
            embeddings.append(e(x_cat[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.emb_drop(x)

        x_cont = self.bn_cont(x_cont)
        x = torch.cat([x, x_cont], 1)
        x = self.layers(x)
        return x
torch.manual_seed(33)
model = TabularModel(emb_szs, conts.shape[1], 2, [200,100], p=0.4) # out_sz = 2
model
TabularModel(
  (embeds): ModuleList(
    (0): Embedding(24, 12)
    (1): Embedding(2, 1)
    (2): Embedding(7, 4)
  )
  (emb_drop): Dropout(p=0.4)
  (bn_cont): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=23, out_features=200, bias=True)
    (1): ReLU(inplace)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4)
    (8): Linear(in_features=100, out_features=2, bias=True)
  )
)

Define loss function & optimizer

For our classification we’ll replace the MSE loss function with torch.nn.CrossEntropyLoss()
For the optimizer, we’ll continue to use torch.optim.Adam()

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Perform train/test splits

At this point our batch size is the entire dataset of 120,000 records. To save time we’ll use the first 60,000. Recall that our tensors are already randomly shuffled.

batch_size = 60000
test_size = 12000

cat_train = cats[:batch_size-test_size]
cat_test = cats[batch_size-test_size:batch_size]
con_train = conts[:batch_size-test_size]
con_test = conts[batch_size-test_size:batch_size]
y_train = y[:batch_size-test_size]
y_test = y[batch_size-test_size:batch_size]
len(cat_train)
48000
len(cat_test)
12000

Train the model

Expect this to take 30 minutes or more! We’ve added code to tell us the duration at the end.

import time
start_time = time.time()

epochs = 300
losses = []

for i in range(epochs):
    i+=1
    y_pred = model(cat_train, con_train)
    loss = criterion(y_pred, y_train)
    losses.append(loss)

    # a neat trick to save screen space:
    if i%25 == 1:
        print(f'epoch: {i:3}  loss: {loss.item():10.8f}')

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(f'epoch: {i:3}  loss: {loss.item():10.8f}') # print the last line
print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed
epoch:   1  loss: 0.73441482
epoch:  26  loss: 0.45090991
epoch:  51  loss: 0.35915938
epoch:  76  loss: 0.31940848
epoch: 101  loss: 0.29913244
epoch: 126  loss: 0.28824982
epoch: 151  loss: 0.28091952
epoch: 176  loss: 0.27713534
epoch: 201  loss: 0.27236161
epoch: 226  loss: 0.27171907
epoch: 251  loss: 0.26830241
epoch: 276  loss: 0.26365638
epoch: 300  loss: 0.25949642

Duration: 709 seconds

Plot the loss function

plt.plot(range(epochs), losses)
plt.ylabel('Cross Entropy Loss')
plt.xlabel('epoch');

Validate the model

# TO EVALUATE THE ENTIRE TEST SET
with torch.no_grad():
    y_val = model(cat_test, con_test)
    loss = criterion(y_val, y_test)
print(f'CE Loss: {loss:.8f}')
CE Loss: 0.25455481

Now let’s look at the first 50 predicted values

rows = 50
correct = 0
print(f'{"MODEL OUTPUT":26} ARGMAX  Y_TEST')
for i in range(rows):
    print(f'{str(y_val[i]):26} {y_val[i].argmax():^7}{y_test[i]:^7}')
    if y_val[i].argmax().item() == y_test[i]:
        correct += 1
print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')
MODEL OUTPUT               ARGMAX  Y_TEST
tensor([ 1.8140, -1.6443])    0      0   
tensor([-1.8268,  2.6373])    1      0   
tensor([ 1.4028, -1.9248])    0      0   
tensor([-1.9130,  1.4853])    1      1   
tensor([ 1.1757, -2.4964])    0      0   
tensor([ 2.0996, -2.2990])    0      0   
tensor([ 1.3226, -1.8349])    0      0   
tensor([-1.6211,  2.3889])    1      1   
tensor([ 2.2489, -2.4253])    0      0   
tensor([-0.4459,  1.1358])    1      1   
tensor([ 1.5145, -2.1619])    0      0   
tensor([ 0.7704, -1.9443])    0      0   
tensor([ 0.9637, -1.3796])    0      0   
tensor([-1.3527,  1.7322])    1      1   
tensor([ 1.4110, -2.4595])    0      0   
tensor([-1.4455,  2.6081])    1      1   
tensor([ 2.2798, -2.5864])    0      1   
tensor([ 1.4585, -2.7982])    0      0   
tensor([ 0.3342, -0.8995])    0      0   
tensor([ 2.0525, -1.9737])    0      0   
tensor([-1.3571,  2.1911])    1      1   
tensor([-0.4669,  0.2872])    1      1   
tensor([-2.0624,  2.2875])    1      1   
tensor([-2.1334,  2.6416])    1      1   
tensor([-3.1325,  5.1561])    1      1   
tensor([ 2.2128, -2.5172])    0      0   
tensor([ 1.0346, -1.7764])    0      0   
tensor([ 1.1221, -1.6717])    0      0   
tensor([-2.1322,  1.6714])    1      1   
tensor([ 1.5009, -1.6338])    0      0   
tensor([ 2.0387, -1.8475])    0      0   
tensor([-1.6346,  2.8899])    1      1   
tensor([-3.0129,  2.3519])    1      1   
tensor([-1.5746,  2.0000])    1      1   
tensor([ 1.3056, -2.2630])    0      0   
tensor([ 0.6631, -1.4797])    0      0   
tensor([-1.4585,  2.1836])    1      1   
tensor([ 1.0574, -1.5848])    0      1   
tensor([ 0.3376, -0.8050])    0      1   
tensor([ 1.9217, -1.9764])    0      0   
tensor([ 0.1011, -0.5529])    0      0   
tensor([ 0.6703, -0.5540])    0      0   
tensor([-0.6733,  0.8777])    1      1   
tensor([ 2.2017, -2.0445])    0      0   
tensor([-0.0442, -0.4276])    0      0   
tensor([-1.1204,  1.2558])    1      1   
tensor([-1.8170,  2.7124])    1      1   
tensor([ 1.7404, -2.0341])    0      0   
tensor([ 1.3266, -2.3039])    0      0   
tensor([-0.0671,  0.3291])    1      0

45 out of 50 = 90.00% correct

Save the model

Save the trained model to a file in case you want to come back later and feed new data through it.

# Make sure to save the model only after the training has happened!
if len(losses) == epochs:
    torch.save(model.state_dict(), 'TaxiFareClssModel.pt')
else:
    print('Model has not been trained. Consider loading a trained model instead.')

Loading a saved model (starting from scratch)

We can load the trained weights and biases from a saved model. If we’ve just opened the notebook, we’ll have to run standard imports and function definitions. To demonstrate, restart the kernel before proceeding.

import torch
import torch.nn as nn
import numpy as np
import pandas as pd

def haversine_distance(df, lat1, long1, lat2, long2):
    r = 6371
    phi1 = np.radians(df[lat1])
    phi2 = np.radians(df[lat2])
    delta_phi = np.radians(df[lat2]-df[lat1])
    delta_lambda = np.radians(df[long2]-df[long1])
    a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    return r * c

class TabularModel(nn.Module):
    def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(p)
        self.bn_cont = nn.BatchNorm1d(n_cont)
        layerlist = []
        n_emb = sum((nf for ni,nf in emb_szs))
        n_in = n_emb + n_cont
        for i in layers:
            layerlist.append(nn.Linear(n_in,i)) 
            layerlist.append(nn.ReLU(inplace=True))
            layerlist.append(nn.BatchNorm1d(i))
            layerlist.append(nn.Dropout(p))
            n_in = i
        layerlist.append(nn.Linear(layers[-1],out_sz))
        self.layers = nn.Sequential(*layerlist)
    def forward(self, x_cat, x_cont):
        embeddings = []
        for i,e in enumerate(self.embeds):
            embeddings.append(e(x_cat[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.emb_drop(x)
        x_cont = self.bn_cont(x_cont)
        x = torch.cat([x, x_cont], 1)
        return self.layers(x)

Now define the model. Before we can load the saved settings, we need to instantiate our TabularModel with the parameters we used before (embedding sizes, number of continuous columns, output size, layer sizes, and dropout layer p-value).

emb_szs = [(24, 12), (2, 1), (7, 4)]
model2 = TabularModel(emb_szs, 6, 2, [200,100], p=0.4)

Once the model is set up, loading the saved settings is a snap.

model2.load_state_dict(torch.load('TaxiFareClssModel.pt'));
model2.eval() # be sure to run this step!
TabularModel(
  (embeds): ModuleList(
    (0): Embedding(24, 12)
    (1): Embedding(2, 1)
    (2): Embedding(7, 4)
  )
  (emb_drop): Dropout(p=0.4)
  (bn_cont): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=23, out_features=200, bias=True)
    (1): ReLU(inplace)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4)
    (8): Linear(in_features=100, out_features=2, bias=True)
  )
)

Next we’ll define a function that takes in new parameters from the user, performs all of the preprocessing steps above, and passes the new data through our trained model.

def test_data(mdl): # pass in the name of the new model
    # INPUT NEW DATA
    plat = float(input('What is the pickup latitude?  '))
    plong = float(input('What is the pickup longitude? '))
    dlat = float(input('What is the dropoff latitude?  '))
    dlong = float(input('What is the dropoff longitude? '))
    psngr = int(input('How many passengers? '))
    dt = input('What is the pickup date and time?\nFormat as YYYY-MM-DD HH:MM:SS     ')

    # PREPROCESS THE DATA
    dfx_dict = {'pickup_latitude':plat,'pickup_longitude':plong,'dropoff_latitude':dlat,
         'dropoff_longitude':dlong,'passenger_count':psngr,'EDTdate':dt}
    dfx = pd.DataFrame(dfx_dict, index=[0])
    dfx['dist_km'] = haversine_distance(dfx,'pickup_latitude', 'pickup_longitude',
                                        'dropoff_latitude', 'dropoff_longitude')
    dfx['EDTdate'] = pd.to_datetime(dfx['EDTdate'])

    # We can skip the .astype(category) step since our fields are small,
    # and encode them right away
    dfx['Hour'] = dfx['EDTdate'].dt.hour
    dfx['AMorPM'] = np.where(dfx['Hour']<12,0,1) 
    dfx['Weekday'] = dfx['EDTdate'].dt.strftime("%a")
    dfx['Weekday'] = dfx['Weekday'].replace(['Fri','Mon','Sat','Sun','Thu','Tue','Wed'],
                                            [0,1,2,3,4,5,6]).astype('int64')
    # CREATE CAT AND CONT TENSORS
    cat_cols = ['Hour', 'AMorPM', 'Weekday']
    cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude',
                 'dropoff_longitude', 'passenger_count', 'dist_km']
    xcats = np.stack([dfx[col].values for col in cat_cols], 1)
    xcats = torch.tensor(xcats, dtype=torch.int64)
    xconts = np.stack([dfx[col].values for col in cont_cols], 1)
    xconts = torch.tensor(xconts, dtype=torch.float)

    # PASS NEW DATA THROUGH THE MODEL WITHOUT PERFORMING A BACKPROP
    with torch.no_grad():
        z = mdl(xcats, xconts).argmax().item()
    print(f'\nThe predicted fare class is {z}')

Feed new data through the trained model

For convenience, here are the max and min values for each of the variables:

Use caution! The distance between 1 degree of latitude (from 40 to 41) is 111km (69mi) and between 1 degree of longitude (from -73 to -74) is 85km (53mi). The longest cab ride in the dataset spanned a difference of only 0.243 degrees latitude and 0.284 degrees longitude. The mean difference for both latitude and longitude was about 0.02. To get a fair prediction, use values that fall close to one another.
test_data(model2)
What is the pickup latitude? 40.5 What is the pickup longitude? -73.9 What is the dropoff latitude? 40.52 What is the dropoff longitude? -73.92 How many passengers? 2 What is the pickup date and time? Format as YYYY-MM-DD HH:MM:SS 2010-04-15 16:00:00 The predicted fare class is 1 Perfect! Where our regression predicted a fare value of \~\\\$14, our binary classification predicts a fare greater than \$10. \## Great job!
Column Minimum Maximum
pickup_latitude 40 41
pickup_longitude -74.5 -73.3
dropoff_latitude 40 41
dropoff_longitude -74.5 -73.3
passenger_count 1 5
EDTdate 2010-04-11 00:00:00 2010-04-24 23:59:42