Skip to content

================ by Jawad Haider

06 - Neural Network Exercises - SOLUTIONS


Image
Copyright Qalmaqihir
For more information, visit us at www.github.com/qalmaqihir/



Neural Network Exercises - SOLUTIONS

For these exercises we’ll perform a binary classification on the Census Income dataset available from the UC Irvine Machine Learning Repository
The goal is to determine if an individual earns more than $50K based on a set of continuous and categorical variables.

IMPORTANT NOTE! Make sure you don’t run the cells directly above the example output shown,
otherwise you will end up writing over the example output!

Census Income Dataset

For this exercises we’re using the Census Income dataset available from the UC Irvine Machine Learning Repository.

The full dataset has 48,842 entries. For this exercise we have reduced the number of records, fields and field entries, and have removed entries with null values. The file income.csv has 30,000 entries

Each entry contains the following information about an individual: * age: the age of an individual as an integer from 18 to 90 (continuous) * sex: Male or Female (categorical) * education: represents the highest level of education achieved by an individual (categorical) * education_num: represents education as an integer from 3 to 16 (categorical)

3 5th-6th 8 12th 13 Bachelors
4 7th-8th 9 HS-grad 14 Masters
5 9th 10 Some-college 15 Prof-school
6 10th 11 Assoc-voc 16 Doctorate
7 11th 12 Assoc-acdm
  • marital-status: marital status of an individual (categorical)

    Married Divorced Married-spouse-absent
    Separated Widowed Never-married

  • workclass: a general term to represent the employment status of an individual (categorical)

    Local-gov Private
    State-gov Self-emp
    Federal-gov

  • occupation: the general type of occupation of an individual (categorical)

    Adm-clerical Handlers-cleaners Protective-serv
    Craft-repair Machine-op-inspct Sales
    Exec-managerial Other-service Tech-support
    Farming-fishing Prof-specialty Transport-moving

  • hours-per-week: the hours an individual has reported to work per week as an integer from 20 to 90 (continuous)

  • income: whether or not an individual makes more than \$50,000 annually (label)
  • label: income represented as an integer (0: \<=\$50K, 1: >\$50K) (optional label)

Perform standard imports

Run the cell below to load the libraries needed for this exercise and the Census Income dataset.

import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
%matplotlib inline

df = pd.read_csv('../Data/income.csv')
print(len(df))
df.head()
30000
age sex education education-num marital-status workclass occupation hours-per-week income label
0 27 Male HS-grad 9 Never-married Private Craft-repair 40 <=50K 0
1 47 Male Masters 14 Married Local-gov Exec-managerial 50 >50K 1
2 59 Male HS-grad 9 Divorced Self-emp Prof-specialty 20 <=50K 0
3 38 Female Prof-school 15 Never-married Federal-gov Prof-specialty 57 >50K 1
4 64 Female 11th 7 Widowed Private Farming-fishing 40 <=50K 0
df['label'].value_counts()
0    21700
1     8300
Name: label, dtype: int64

1. Separate continuous, categorical and label column names

You should find that there are 5 categorical columns, 2 continuous columns and 1 label.
In the case of education and education-num it doesn’t matter which column you use. For the label column, be sure to use label and not income.
Assign the variable names “cat_cols”, “cont_cols” and “y_col” to the lists of names.

df.columns
# CODE HERE







# RUN THIS CODE TO COMPARE RESULTS:
print(f'cat_cols  has {len(cat_cols)} columns')
print(f'cont_cols has {len(cont_cols)} columns')
print(f'y_col     has {len(y_col)} column')
# DON'T WRITE HERE
cat_cols = ['sex', 'education', 'marital-status', 'workclass', 'occupation']
cont_cols = ['age', 'hours-per-week']
y_col = ['label']

print(f'cat_cols  has {len(cat_cols)} columns')  # 5
print(f'cont_cols has {len(cont_cols)} columns') # 2
print(f'y_col     has {len(y_col)} column')      # 1
cat_cols  has 5 columns
cont_cols has 2 columns
y_col     has 1 column

2. Convert categorical columns to category dtypes

# CODE HERE
# DON'T WRITE HERE
for cat in cat_cols:
    df[cat] = df[cat].astype('category')

Optional: Shuffle the dataset

The income.csv dataset is already shuffled. However, if you would like to try different configurations after completing the exercises, this is where you would want to shuffle the entire set.

# THIS CELL IS OPTIONAL
df = shuffle(df, random_state=101)
df.reset_index(drop=True, inplace=True)
df.head()

3. Set the embedding sizes

Create a variable “cat_szs” to hold the number of categories in each variable.
Then create a variable “emb_szs” to hold the list of (category size, embedding size) tuples.

# CODE HERE
# DON'T WRITE HERE
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs
[(2, 1), (14, 7), (6, 3), (5, 3), (12, 6)]

4. Create an array of categorical values

Create a NumPy array called “cats” that contains a stack of each categorical column .cat.codes.values
Note: your output may contain different values. Ours came after performing the shuffle step shown above.

# CODE HERE







# RUN THIS CODE TO COMPARE RESULTS
cats[:5]
# DON'T WRITE HERE
sx = df['sex'].cat.codes.values
ed = df['education'].cat.codes.values
ms = df['marital-status'].cat.codes.values
wc = df['workclass'].cat.codes.values
oc = df['occupation'].cat.codes.values

cats = np.stack([sx,ed,ms,wc,oc], 1)

cats[:5]
array([[ 1, 10,  3,  2,  1],
       [ 1, 11,  1,  1,  2],
       [ 1, 10,  0,  3,  7],
       [ 0, 12,  3,  0,  7],
       [ 0,  1,  5,  2,  3]], dtype=int8)

5. Convert “cats” to a tensor

Convert the “cats” NumPy array to a tensor of dtype int64

# CODE HERE
# DON'T WRITE HERE
cats = torch.tensor(cats, dtype=torch.int64)

6. Create an array of continuous values

Create a NumPy array called “conts” that contains a stack of each continuous column.
Note: your output may contain different values. Ours came after performing the shuffle step shown above.

# CODE HERE


# RUN THIS CODE TO COMPARE RESULTS
conts[:5]
# DON'T WRITE HERE
conts = np.stack([df[col].values for col in cont_cols], 1)
conts[:5]
array([[27, 40],
       [47, 50],
       [59, 20],
       [38, 57],
       [64, 40]], dtype=int64)

7. Convert “conts” to a tensor

Convert the “conts” NumPy array to a tensor of dtype float32

# CODE HERE


# RUN THIS CODE TO COMPARE RESULTS
conts.dtype
# DON'T WRITE HERE
conts = torch.tensor(conts, dtype=torch.float)
conts.dtype
torch.float32

8. Create a label tensor

Create a tensor called “y” from the values in the label column. Be sure to flatten the tensor so that it can be passed into the CE Loss function.

# CODE HERE
# DON'T WRITE HERE
y = torch.tensor(df[y_col].values).flatten()

9. Create train and test sets from cats, conts, and y

We use the entire batch of 30,000 records, but a smaller batch size will save time during training.
We used a test size of 5,000 records, but you can choose another fixed value or a percentage of the batch size.
Make sure that your test records remain separate from your training records, without overlap.
To make coding slices easier, we recommend assigning batch and test sizes to simple variables like “b” and “t”.

# CODE HERE
b = 30000 # suggested batch size
t = 5000  # suggested test size
# DON'T WRITE HERE
b = 30000 # suggested batch size
t = 5000  # suggested test size

cat_train = cats[:b-t]
cat_test  = cats[b-t:b]
con_train = conts[:b-t]
con_test  = conts[b-t:b]
y_train   = y[:b-t]
y_test    = y[b-t:b]

Define the model class

Run the cell below to define the TabularModel model class we’ve used before.

class TabularModel(nn.Module):

    def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
        # Call the parent __init__
        super().__init__()

        # Set up the embedding, dropout, and batch normalization layer attributes
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(p)
        self.bn_cont = nn.BatchNorm1d(n_cont)

        # Assign a variable to hold a list of layers
        layerlist = []

        # Assign a variable to store the number of embedding and continuous layers
        n_emb = sum((nf for ni,nf in emb_szs))
        n_in = n_emb + n_cont

        # Iterate through the passed-in "layers" parameter (ie, [200,100]) to build a list of layers
        for i in layers:
            layerlist.append(nn.Linear(n_in,i)) 
            layerlist.append(nn.ReLU(inplace=True))
            layerlist.append(nn.BatchNorm1d(i))
            layerlist.append(nn.Dropout(p))
            n_in = i
        layerlist.append(nn.Linear(layers[-1],out_sz))

        # Convert the list of layers into an attribute
        self.layers = nn.Sequential(*layerlist)

    def forward(self, x_cat, x_cont):
        # Extract embedding values from the incoming categorical data
        embeddings = []
        for i,e in enumerate(self.embeds):
            embeddings.append(e(x_cat[:,i]))
        x = torch.cat(embeddings, 1)
        # Perform an initial dropout on the embeddings
        x = self.emb_drop(x)

        # Normalize the incoming continuous data
        x_cont = self.bn_cont(x_cont)
        x = torch.cat([x, x_cont], 1)

        # Set up model layers
        x = self.layers(x)
        return x

10. Set the random seed

To obtain results that can be recreated, set a torch manual_seed (we used 33).

# CODE HERE
# DON'T WRITE HERE
torch.manual_seed(33)
<torch._C.Generator at 0x1e5e64e5e30>

11. Create a TabularModel instance

Create an instance called “model” with one hidden layer containing 50 neurons and a dropout layer p-value of 0.4

# CODE HERE


# RUN THIS CODE TO COMPARE RESULTS
model
# DON'T WRITE HERE
model = TabularModel(emb_szs, conts.shape[1], 2, [50], p=0.4)
model
TabularModel(
  (embeds): ModuleList(
    (0): Embedding(2, 1)
    (1): Embedding(14, 7)
    (2): Embedding(6, 3)
    (3): Embedding(5, 3)
    (4): Embedding(12, 6)
  )
  (emb_drop): Dropout(p=0.4)
  (bn_cont): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=22, out_features=50, bias=True)
    (1): ReLU(inplace)
    (2): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4)
    (4): Linear(in_features=50, out_features=2, bias=True)
  )
)

12. Define the loss and optimization functions

Create a loss function called “criterion” using CrossEntropyLoss
Create an optimization function called “optimizer” using Adam, with a learning rate of 0.001

# CODE HERE
# DON'T WRITE HERE
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Train the model

Run the cell below to train the model through 300 epochs. Remember, results may vary!
After completing the exercises, feel free to come back to this section and experiment with different parameters.

import time
start_time = time.time()

epochs = 300
losses = []

for i in range(epochs):
    i+=1
    y_pred = model(cat_train, con_train)
    loss = criterion(y_pred, y_train)
    losses.append(loss)

    # a neat trick to save screen space:
    if i%25 == 1:
        print(f'epoch: {i:3}  loss: {loss.item():10.8f}')

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(f'epoch: {i:3}  loss: {loss.item():10.8f}') # print the last line
print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed
epoch:   1  loss: 0.65308946
epoch:  26  loss: 0.54059124
epoch:  51  loss: 0.46917316
epoch:  76  loss: 0.41288978
epoch: 101  loss: 0.37744597
epoch: 126  loss: 0.35649022
epoch: 151  loss: 0.34338138
epoch: 176  loss: 0.33378774
epoch: 201  loss: 0.32601979
epoch: 226  loss: 0.32018784
epoch: 251  loss: 0.31548899
epoch: 276  loss: 0.30901730
epoch: 300  loss: 0.30690485

Duration: 170 seconds

13. Plot the Cross Entropy Loss against epochs

Results may vary. The shape of the plot is what matters.

# CODE HERE
# DON'T WRITE HERE
plt.plot(range(epochs), losses)
plt.ylabel('Cross Entropy Loss')
plt.xlabel('epoch');

14. Evaluate the test set

With torch set to no_grad, pass cat_test and con_test through the trained model. Create a validation set called “y_val”. Compare the output to y_test using the loss function defined above. Results may vary.

# CODE HERE




# RUN THIS CODE TO COMPARE RESULTS
print(f'CE Loss: {loss:.8f}')
# TO EVALUATE THE TEST SET
with torch.no_grad():
    y_val = model(cat_test, con_test)
    loss = criterion(y_val, y_test)
print(f'CE Loss: {loss:.8f}')
CE Loss: 0.30774996

15. Calculate the overall percent accuracy

Using a for loop, compare the argmax values of the y_val validation set to the y_test set.

# CODE HERE
# DON'T WRITE HERE
rows = len(y_test)
correct = 0

# print(f'{"MODEL OUTPUT":26} ARGMAX  Y_TEST')

for i in range(rows):
    # print(f'{str(y_val[i]):26} {y_val[i].argmax().item():^7}{y_test[i]:^7}')

    if y_val[i].argmax().item() == y_test[i]:
        correct += 1

print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')
4255 out of 5000 = 85.10% correct

BONUS: Feed new data through the trained model

See if you can write a function that allows a user to input their own values, and generates a prediction.
HINT:
There’s no need to build a DataFrame. You can use inputs to populate column variables, convert them to embeddings with a context dictionary, and pass the embedded values directly into the tensor constructors:

mar = input("What is the person's marital status? ")
mar_d = dict(Divorced=0, Married=1, Married-spouse-absent=2, Never-married=3, Separated=4, Widowed=5)
mar = mar_d[mar]
cats = torch.tensor([..., ..., mar, ..., ...], dtype=torch.int64).reshape(1,-1)

Make sure that names are put in alphabetical order before assigning numbers.

Also, be sure to run model.eval() before passing new date through. Good luck!

# WRITE YOUR CODE HERE:
# RUN YOUR CODE HERE:
# DON'T WRITE HERE
def test_data(mdl): # pass in the name of the model
    # INPUT NEW DATA
    age = float(input("What is the person's age? (18-90)  "))
    sex = input("What is the person's sex? (Male/Female) ").capitalize()
    edn = int(input("What is the person's education level? (3-16) "))
    mar = input("What is the person's marital status? ").capitalize()
    wrk = input("What is the person's workclass? ").capitalize()
    occ = input("What is the person's occupation? ").capitalize()
    hrs = float(input("How many hours/week are worked? (20-90)  "))

    # PREPROCESS THE DATA
    sex_d = {'Female':0, 'Male':1}
    mar_d = {'Divorced':0, 'Married':1, 'Married-spouse-absent':2, 'Never-married':3, 'Separated':4, 'Widowed':5}
    wrk_d = {'Federal-gov':0, 'Local-gov':1, 'Private':2, 'Self-emp':3, 'State-gov':4}
    occ_d = {'Adm-clerical':0, 'Craft-repair':1, 'Exec-managerial':2, 'Farming-fishing':3, 'Handlers-cleaners':4,
            'Machine-op-inspct':5, 'Other-service':6, 'Prof-specialty':7, 'Protective-serv':8, 'Sales':9, 
            'Tech-support':10, 'Transport-moving':11}

    sex = sex_d[sex]
    mar = mar_d[mar]
    wrk = wrk_d[wrk]
    occ = occ_d[occ]

    # CREATE CAT AND CONT TENSORS
    cats = torch.tensor([sex,edn,mar,wrk,occ], dtype=torch.int64).reshape(1,-1)
    conts = torch.tensor([age,hrs], dtype=torch.float).reshape(1,-1)

    # SET MODEL TO EVAL (in case this hasn't been done)
    mdl.eval()

    # PASS NEW DATA THROUGH THE MODEL WITHOUT PERFORMING A BACKPROP
    with torch.no_grad():
        z = mdl(cats, conts).argmax().item()

    print(f'\nThe predicted label is {z}')

test_data(model)
What is the person's age? (18-90)  22
What is the person's sex? (Male/Female) male
What is the person's education level? (3-16) 12
What is the person's marital status? married
What is the person's workclass? private
What is the person's occupation? sales
How many hours/week are worked? (20-90)  40

The predicted label is 0

Great job!