================ by Jawad Haider
06 - Neural Network Exercises - SOLUTIONS¶
- Neural Network Exercises - SOLUTIONS
- Census Income Dataset
- Perform standard imports
- 1. Separate continuous, categorical and label column names
- 2. Convert categorical columns to category dtypes
- Optional: Shuffle the dataset
- 3. Set the embedding sizes
- 4. Create an array of categorical values
- 5. Convert “cats” to a tensor
- 6. Create an array of continuous values
- 7. Convert “conts” to a tensor
- 8. Create a label tensor
- 9. Create train and test sets from cats, conts, and y
- Define the model class
- 10. Set the random seed
- 11. Create a TabularModel instance
- 12. Define the loss and optimization functions
- Train the model
- 13. Plot the Cross Entropy Loss against epochs
- 14. Evaluate the test set
- 15. Calculate the overall percent accuracy
- BONUS: Feed new data through the trained model
- Great job!
Neural Network Exercises - SOLUTIONS¶
For these exercises we’ll perform a binary classification on the Census
Income dataset available from the
UC Irvine
Machine Learning Repository
The goal is to determine if an
individual earns more than $50K based on a set of continuous and
categorical variables.
otherwise you will end up writing over the example output!
Census Income Dataset¶
For this exercises we’re using the Census Income dataset available from the UC Irvine Machine Learning Repository.
The full dataset has 48,842 entries. For this exercise we have reduced the number of records, fields and field entries, and have removed entries with null values. The file income.csv has 30,000 entries
Each entry contains the following information about an individual: * age: the age of an individual as an integer from 18 to 90 (continuous) * sex: Male or Female (categorical) * education: represents the highest level of education achieved by an individual (categorical) * education_num: represents education as an integer from 3 to 16 (categorical)
3 | 5th-6th | 8 | 12th | 13 | Bachelors |
4 | 7th-8th | 9 | HS-grad | 14 | Masters |
5 | 9th | 10 | Some-college | 15 | Prof-school |
6 | 10th | 11 | Assoc-voc | 16 | Doctorate |
7 | 11th | 12 | Assoc-acdm |
-
marital-status: marital status of an individual (categorical)
Married Divorced Married-spouse-absent Separated Widowed Never-married -
workclass: a general term to represent the employment status of an individual (categorical)
Local-gov Private State-gov Self-emp Federal-gov -
occupation: the general type of occupation of an individual (categorical)
Adm-clerical Handlers-cleaners Protective-serv Craft-repair Machine-op-inspct Sales Exec-managerial Other-service Tech-support Farming-fishing Prof-specialty Transport-moving -
hours-per-week: the hours an individual has reported to work per week as an integer from 20 to 90 (continuous)
- income: whether or not an individual makes more than \$50,000 annually (label)
- label: income represented as an integer (0: \<=\$50K, 1: >\$50K) (optional label)
Perform standard imports¶
Run the cell below to load the libraries needed for this exercise and the Census Income dataset.
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
%matplotlib inline
df = pd.read_csv('../Data/income.csv')
30000
age | sex | education | education-num | marital-status | workclass | occupation | hours-per-week | income | label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 27 | Male | HS-grad | 9 | Never-married | Private | Craft-repair | 40 | <=50K | 0 |
1 | 47 | Male | Masters | 14 | Married | Local-gov | Exec-managerial | 50 | >50K | 1 |
2 | 59 | Male | HS-grad | 9 | Divorced | Self-emp | Prof-specialty | 20 | <=50K | 0 |
3 | 38 | Female | Prof-school | 15 | Never-married | Federal-gov | Prof-specialty | 57 | >50K | 1 |
4 | 64 | Female | 11th | 7 | Widowed | Private | Farming-fishing | 40 | <=50K | 0 |
0 21700
1 8300
Name: label, dtype: int64
1. Separate continuous, categorical and label column names¶
You should find that there are 5 categorical columns, 2 continuous
columns and 1 label.
In the case of education and
education-num it doesn’t matter which column you use. For the
label column, be sure to use label and not income.
Assign the variable names “cat_cols”, “cont_cols” and “y_col” to the
lists of names.
# CODE HERE
# RUN THIS CODE TO COMPARE RESULTS:
print(f'cat_cols has {len(cat_cols)} columns')
print(f'cont_cols has {len(cont_cols)} columns')
print(f'y_col has {len(y_col)} column')
# DON'T WRITE HERE
cat_cols = ['sex', 'education', 'marital-status', 'workclass', 'occupation']
cont_cols = ['age', 'hours-per-week']
y_col = ['label']
print(f'cat_cols has {len(cat_cols)} columns') # 5
print(f'cont_cols has {len(cont_cols)} columns') # 2
print(f'y_col has {len(y_col)} column') # 1
cat_cols has 5 columns
cont_cols has 2 columns
y_col has 1 column
2. Convert categorical columns to category dtypes¶
Optional: Shuffle the dataset¶
The income.csv dataset is already shuffled. However, if you would like to try different configurations after completing the exercises, this is where you would want to shuffle the entire set.
# THIS CELL IS OPTIONAL
df = shuffle(df, random_state=101)
df.reset_index(drop=True, inplace=True)
df.head()
3. Set the embedding sizes¶
Create a variable “cat_szs” to hold the number of categories in each
variable.
Then create a variable “emb_szs” to hold the list of
(category size, embedding size) tuples.
# DON'T WRITE HERE
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs
[(2, 1), (14, 7), (6, 3), (5, 3), (12, 6)]
4. Create an array of categorical values¶
Create a NumPy array called “cats” that contains a stack of each
categorical column .cat.codes.values
Note: your output may
contain different values. Ours came after performing the shuffle step
shown above.
# DON'T WRITE HERE
sx = df['sex'].cat.codes.values
ed = df['education'].cat.codes.values
ms = df['marital-status'].cat.codes.values
wc = df['workclass'].cat.codes.values
oc = df['occupation'].cat.codes.values
cats = np.stack([sx,ed,ms,wc,oc], 1)
cats[:5]
array([[ 1, 10, 3, 2, 1],
[ 1, 11, 1, 1, 2],
[ 1, 10, 0, 3, 7],
[ 0, 12, 3, 0, 7],
[ 0, 1, 5, 2, 3]], dtype=int8)
5. Convert “cats” to a tensor¶
Convert the “cats” NumPy array to a tensor of dtype int64
6. Create an array of continuous values¶
Create a NumPy array called “conts” that contains a stack of each
continuous column.
Note: your output may contain different values.
Ours came after performing the shuffle step shown above.
array([[27, 40],
[47, 50],
[59, 20],
[38, 57],
[64, 40]], dtype=int64)
7. Convert “conts” to a tensor¶
Convert the “conts” NumPy array to a tensor of dtype float32
torch.float32
8. Create a label tensor¶
Create a tensor called “y” from the values in the label column. Be sure to flatten the tensor so that it can be passed into the CE Loss function.
9. Create train and test sets from cats, conts, and y¶
We use the entire batch of 30,000 records, but a smaller batch size will
save time during training.
We used a test size of 5,000 records, but
you can choose another fixed value or a percentage of the batch
size.
Make sure that your test records remain separate from your
training records, without overlap.
To make coding slices easier, we
recommend assigning batch and test sizes to simple variables like “b”
and “t”.
# DON'T WRITE HERE
b = 30000 # suggested batch size
t = 5000 # suggested test size
cat_train = cats[:b-t]
cat_test = cats[b-t:b]
con_train = conts[:b-t]
con_test = conts[b-t:b]
y_train = y[:b-t]
y_test = y[b-t:b]
Define the model class¶
Run the cell below to define the TabularModel model class we’ve used before.
class TabularModel(nn.Module):
def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
# Call the parent __init__
super().__init__()
# Set up the embedding, dropout, and batch normalization layer attributes
self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
self.emb_drop = nn.Dropout(p)
self.bn_cont = nn.BatchNorm1d(n_cont)
# Assign a variable to hold a list of layers
layerlist = []
# Assign a variable to store the number of embedding and continuous layers
n_emb = sum((nf for ni,nf in emb_szs))
n_in = n_emb + n_cont
# Iterate through the passed-in "layers" parameter (ie, [200,100]) to build a list of layers
for i in layers:
layerlist.append(nn.Linear(n_in,i))
layerlist.append(nn.ReLU(inplace=True))
layerlist.append(nn.BatchNorm1d(i))
layerlist.append(nn.Dropout(p))
n_in = i
layerlist.append(nn.Linear(layers[-1],out_sz))
# Convert the list of layers into an attribute
self.layers = nn.Sequential(*layerlist)
def forward(self, x_cat, x_cont):
# Extract embedding values from the incoming categorical data
embeddings = []
for i,e in enumerate(self.embeds):
embeddings.append(e(x_cat[:,i]))
x = torch.cat(embeddings, 1)
# Perform an initial dropout on the embeddings
x = self.emb_drop(x)
# Normalize the incoming continuous data
x_cont = self.bn_cont(x_cont)
x = torch.cat([x, x_cont], 1)
# Set up model layers
x = self.layers(x)
return x
10. Set the random seed¶
To obtain results that can be recreated, set a torch manual_seed (we used 33).
<torch._C.Generator at 0x1e5e64e5e30>
11. Create a TabularModel instance¶
Create an instance called “model” with one hidden layer containing 50 neurons and a dropout layer p-value of 0.4
TabularModel(
(embeds): ModuleList(
(0): Embedding(2, 1)
(1): Embedding(14, 7)
(2): Embedding(6, 3)
(3): Embedding(5, 3)
(4): Embedding(12, 6)
)
(emb_drop): Dropout(p=0.4)
(bn_cont): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=22, out_features=50, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.4)
(4): Linear(in_features=50, out_features=2, bias=True)
)
)
12. Define the loss and optimization functions¶
Create a loss function called “criterion” using CrossEntropyLoss
Create an optimization function called “optimizer” using Adam, with a
learning rate of 0.001
# DON'T WRITE HERE
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Train the model¶
Run the cell below to train the model through 300 epochs. Remember,
results may vary!
After completing the exercises, feel free to come
back to this section and experiment with different parameters.
import time
start_time = time.time()
epochs = 300
losses = []
for i in range(epochs):
i+=1
y_pred = model(cat_train, con_train)
loss = criterion(y_pred, y_train)
losses.append(loss)
# a neat trick to save screen space:
if i%25 == 1:
print(f'epoch: {i:3} loss: {loss.item():10.8f}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'epoch: {i:3} loss: {loss.item():10.8f}') # print the last line
print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed
epoch: 1 loss: 0.65308946
epoch: 26 loss: 0.54059124
epoch: 51 loss: 0.46917316
epoch: 76 loss: 0.41288978
epoch: 101 loss: 0.37744597
epoch: 126 loss: 0.35649022
epoch: 151 loss: 0.34338138
epoch: 176 loss: 0.33378774
epoch: 201 loss: 0.32601979
epoch: 226 loss: 0.32018784
epoch: 251 loss: 0.31548899
epoch: 276 loss: 0.30901730
epoch: 300 loss: 0.30690485
Duration: 170 seconds
13. Plot the Cross Entropy Loss against epochs¶
Results may vary. The shape of the plot is what matters.
# DON'T WRITE HERE
plt.plot(range(epochs), losses)
plt.ylabel('Cross Entropy Loss')
plt.xlabel('epoch');
14. Evaluate the test set¶
With torch set to no_grad, pass cat_test and con_test through the trained model. Create a validation set called “y_val”. Compare the output to y_test using the loss function defined above. Results may vary.
# TO EVALUATE THE TEST SET
with torch.no_grad():
y_val = model(cat_test, con_test)
loss = criterion(y_val, y_test)
print(f'CE Loss: {loss:.8f}')
CE Loss: 0.30774996
15. Calculate the overall percent accuracy¶
Using a for loop, compare the argmax values of the y_val validation set to the y_test set.
# DON'T WRITE HERE
rows = len(y_test)
correct = 0
# print(f'{"MODEL OUTPUT":26} ARGMAX Y_TEST')
for i in range(rows):
# print(f'{str(y_val[i]):26} {y_val[i].argmax().item():^7}{y_test[i]:^7}')
if y_val[i].argmax().item() == y_test[i]:
correct += 1
print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')
4255 out of 5000 = 85.10% correct
BONUS: Feed new data through the trained model¶
See if you can write a function that allows a user to input their own
values, and generates a prediction.
HINT:
There’s no need to build a DataFrame. You can
use inputs to populate column variables, convert them to embeddings with
a context dictionary, and pass the embedded values directly into the
tensor constructors:
mar = input("What is the person's marital status? ") mar_d = dict(Divorced=0, Married=1, Married-spouse-absent=2, Never-married=3, Separated=4, Widowed=5) mar = mar_d[mar] cats = torch.tensor([..., ..., mar, ..., ...], dtype=torch.int64).reshape(1,-1)
Make sure that names are put in alphabetical order before assigning numbers.
Also, be sure to run model.eval() before passing new date through. Good luck!
# DON'T WRITE HERE
def test_data(mdl): # pass in the name of the model
# INPUT NEW DATA
age = float(input("What is the person's age? (18-90) "))
sex = input("What is the person's sex? (Male/Female) ").capitalize()
edn = int(input("What is the person's education level? (3-16) "))
mar = input("What is the person's marital status? ").capitalize()
wrk = input("What is the person's workclass? ").capitalize()
occ = input("What is the person's occupation? ").capitalize()
hrs = float(input("How many hours/week are worked? (20-90) "))
# PREPROCESS THE DATA
sex_d = {'Female':0, 'Male':1}
mar_d = {'Divorced':0, 'Married':1, 'Married-spouse-absent':2, 'Never-married':3, 'Separated':4, 'Widowed':5}
wrk_d = {'Federal-gov':0, 'Local-gov':1, 'Private':2, 'Self-emp':3, 'State-gov':4}
occ_d = {'Adm-clerical':0, 'Craft-repair':1, 'Exec-managerial':2, 'Farming-fishing':3, 'Handlers-cleaners':4,
'Machine-op-inspct':5, 'Other-service':6, 'Prof-specialty':7, 'Protective-serv':8, 'Sales':9,
'Tech-support':10, 'Transport-moving':11}
sex = sex_d[sex]
mar = mar_d[mar]
wrk = wrk_d[wrk]
occ = occ_d[occ]
# CREATE CAT AND CONT TENSORS
cats = torch.tensor([sex,edn,mar,wrk,occ], dtype=torch.int64).reshape(1,-1)
conts = torch.tensor([age,hrs], dtype=torch.float).reshape(1,-1)
# SET MODEL TO EVAL (in case this hasn't been done)
mdl.eval()
# PASS NEW DATA THROUGH THE MODEL WITHOUT PERFORMING A BACKPROP
with torch.no_grad():
z = mdl(cats, conts).argmax().item()
print(f'\nThe predicted label is {z}')
test_data(model)
What is the person's age? (18-90) 22
What is the person's sex? (Male/Female) male
What is the person's education level? (3-16) 12
What is the person's marital status? married
What is the person's workclass? private
What is the person's occupation? sales
How many hours/week are worked? (20-90) 40
The predicted label is 0