PyTorch Fundamentals - Week 4
Brushing up on my PyTorch skills every week. Starting from scratch. Not in a hurry. The goal is to follow along TorchLeet and go up to karpathy/nanoGPT or karpathy/nanochat. Previously,
Now, summary of the week 4.
- Custom Loss function: Huber Loss
- A
torch.nn.Moduleclass having withforward()method to compute the loss. -
Huber Loss is defined as:
\[L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta, \\ \delta \cdot (|y - \hat{y}| - \frac{1}{2} \delta) & \text{for } |y - \hat{y}| > \delta, \end{cases}\]where:
- \(y\) is the true value,
- \(\hat{y}\) is the predicted value,
- \(\delta\) is a threshold parameter that controls the transition between L1 and L2 loss.
- More details about the Huber Loss
- A
- Some custom losses in Keras and PyTorch: Loss Function Library - Keras & PyTorch
- Used the Linear Regression model to test the custom loss.
- Error 1:
RuntimeError: grad can be implicitly created only for scalar outputs- Reason: the
forward()function was returning a tensor with lenght > 1. Got the hint from PyTorch forums. - Fix: returned
loss.mean()instead ofloss.
- Reason: the
- Error 2: All the losses were
nan: this was a genuine bug in my code. - Implementation approach 1: Use masks (my approach)
def forward(self, y_pred, y_true): error = torch.abs(y_true - y_pred) flag1 = error <= self.d flag2 = 1 - error l2_loss = 0.5 * error**2 * flag1 l1_loss = self.d * (error - 0.5 * self.d) * flag2 loss = l2_loss + l1_loss return loss.mean() - Implementation approach 2: Use
torch.where()(solution provided in TorchLeet)def forward(self, y_pred, y_true): error = torch.abs(y_true - y_pred) condition = error <= self.d loss = torch.where(condition, 0.5 * error**2, self.d * (error - 0.5 * self.d)) return loss.mean() - Turns out,
torch.where()is the most optimised way of doing this. It is vectorised and GPU-friendly. It is also a cleaner implementation of the same logic. Masking will require extra memory and extra operations (two multiplications, and one addition). - Used tensorboard to visualise the training results.
- Read up more on
optimizer.zero_grad().- PyTorch accumulates gradients by default. The
loss.backward()will add to the previous gradients (can be accessed byweight.grad). - If we don’t reset the gradients using
zero_grad(), the new gradient will be a combination of the old and the newly-computed gradient. Since the old gradient was already used to update the model in the last iteration, the combined gradient will point in a different direction than the minimum (or maximum.) [ref]
- PyTorch accumulates gradients by default. The
- Q: When should we use
zero_grad()? A: When we want gradient accumulation on purponse. -
Q: When do we want gradient accumulation on purpose? A: In the following scenarios:
- Large batch size with limited gpu memory. Split the batch into mini-batches. Accumulate gradients for all the mini-batches and then run
optimizer.step(). Used during training on smaller GPUs. - Multiple loss components before a single update. Useful for multi-task learning. Losses that require multiple passes.
- Parallel training. When model is split across devices -> accumulate the gradients across the micro-batches and then update parameters once.
- Training with noisy gradients. Accumulate over multiple steps with noisy gradients to smooth the gradients before updating.
- Large batch size with limited gpu memory. Split the batch into mini-batches. Accumulate gradients for all the mini-batches and then run