================ by Jawad Haider
00 - PyTorch Gradients¶
- PyTorch Gradients
- Autograd - Automatic Differentiation
- Back-propagation on one step
- Back-propagation on multiple steps
- Turn off tracking
PyTorch Gradients¶
This section covers the PyTorch
autograd
implementation of gradient descent. Tools include:
*
torch.autograd.backward()
*torch.autograd.grad()
Before continuing in this section, be sure to watch the theory lectures
to understand the following concepts:
* Error functions (step andsigmoid)
* One-hot encoding
* Maximum likelihood
* Cross entropy(including multi-class cross entropy)
* Back propagation (backprop)
Additional Resources:
PyTorch Notes: Autograd mechanicsAutograd - Automatic Differentiation¶
In previous sections we created tensors and performed a variety of operations on them, but we did nothing to store the sequence of operations, or to apply the derivative of a completed function.
In this section we’ll introduce the concept of the dynamic computational graph which is comprised of all the Tensor objects in the network, as well as the Functions used to create them. Note that only the input Tensors we create ourselves will not have associated Function objects.
The PyTorch autograd package provides automatic differentiation for all operations on Tensors. This is because operations become attributes of the tensors themselves. When a Tensor’s .requires_grad attribute is set to True, it starts to track all operations on it. When an operation finishes you can call .backward() and have all the gradients computed automatically. The gradient for a tensor will be accumulated into its .grad attribute.
Let’s see this in practice.
Back-propagation on one step¶
We’ll start by applying a single polynomial function
to tensor . Then we’ll
backprop and print the gradient
.
Step 1. Perform standard imports¶
Step 2. Create a tensor with requires_grad set to True¶
This sets up computational tracking on the tensor.
Step 3. Define a function¶
tensor(63., grad_fn=<AddBackward0>)
Since was created as a
result of an operation, it has an associated gradient function
accessible as y.grad_fn
The calculation of
is done as:
This is the value of when .
Step 4. Backprop¶
Step 5. Display the resulting gradient¶
tensor(93.)
Note that x.grad is an attribute of tensor
, so we don’t use
parentheses. The computation is the result of
This is the slope of the polynomial at the point .
Back-propagation on multiple steps¶
Now let’s do something more complex, involving layers and between and our output layer .
1. Create¶
a tensor
tensor([[1., 2., 3.],
[3., 2., 1.]], requires_grad=True)
2. Create the first layer with ¶
tensor([[ 5., 8., 11.],
[11., 8., 5.]], grad_fn=<AddBackward0>)
3. Create the second layer with ¶
tensor([[ 50., 128., 242.],
[242., 128., 50.]], grad_fn=<MulBackward0>)
4. Set the output to be the matrix mean¶
tensor(140., grad_fn=<MeanBackward1>)
5. Now perform back-propagation to find the gradient of x w.r.t out¶
(If you haven’t seen it before, w.r.t. is an abbreviation of with respect to)
tensor([[10., 16., 22.],
[22., 16., 10.]])
You should see a 2x3 matrix. If we call the final out tensor
“”, we can calculate the
partial derivative of
with respect to
as follows:
To solve the derivative of
we use the
chain rule, where
the derivative of
In this case
Therefore,
Turn off tracking¶
There may be times when we don’t want or need to track the computational history.
You can reset a tensor’s requires_grad attribute in-place using
.requires_grad_(True)
(or False) as needed.
When performing evaluations, it’s often helpful to wrap a set of
operations in with torch.no_grad():
A less-used method is to run .detach()
on a tensor to prevent future
computations from being tracked. This can be handy when cloning a
tensor.