This is the second in my series of attempts to improve the loss on my test dataset
— interventions, as I’m calling them —
for a from-scratch GPT-2 small base model, trained on code based on
Sebastian Raschka‘s book
“Build a Large Language Model (from Scratch)“.
Last time around I saw what gradient clipping can do —
it improved loss over the baseline
by 0.014, bringing it down from 3.692 to 3.678. Not much, but it’s something!
This time, I wanted to see what happened if we trained without dropout. Would removing it make
the test loss worse, or better?
Background:
In a blog post last summer about
architectural advances in LLMs since GPT-2,
Sebastian Raschka wrote:
Dropout (2012) is a traditional technique to prevent overfitting by randomly
“dropping out” (i.e., setting to zero) a fraction of the layer activations or
attention scores (Figure 3) during training. However, dropout is rarely used
in modern LLMs, and most models after GPT-2 have dropped it (no pun intended).I assume that dropout was originally used in GPT-2 because it was inherited
from the original transformer architecture. Researchers likely noticed that
it does not really improve LLM performance (I observed the same in my
small-scale GPT-2 replication runs). This is likely because LLMs are typically
trained for only a single epoch over massive datasets, which is in contrast to
the multi-hundred-epoch training regimes for which dropout was first
introduced. So, since LLMs see each token only once during training, there is
little risk of overfitting.
That makes quite a lot of sense. My own understanding of dropout was that it was
a bit broader than just preventing overfitting — it seemed to me to be similar
to the
mandatory vacation policies that financial firms user to prevent over-dependence on individuals.
My instinct was that having knowledge distributed across different weights in the
model was good in and of itself, even beyond its benefit on multiple-epoch training.
But it is quite a high price to pay.
With the training parameters we’ve been using we’re literally discarding 10% of our calculations’ results —
attention weights, feed-forward neuron activations, and so on — as we do the forward pass.
It’s easy to see why it would harm training.
Let’s give it a go.
The training run
The nice thing about this one is that, unlike the gradient clipping experiment,
I didn’t have to write any new code. The dropout level was already controlled by
a setting in the model.json file,
so by setting that to zero for this run, I could just kick it off and let it
do its thing while I worked on something else:
Here’s what the training run chart looked like (please disregard the stuff about
grad norms in the title and the axis — I’ll remove that for the next train):

As you can see, we still have loss spikes, including one just after global step 20,000
that lasts for several checkpoint periods of 617 steps. I imagine gradient clipping
might have helped with that, but I’m very deliberately testing each intervention in
isolation.
At the end of the training run, we got this:
Training complete in 11,376.067 seconds
Tokens seen: 3,260,252,160
Throughput: 286,589 tokens/second
Final train loss: 3.621
So, interestingly, it took 967 seconds — about 16 minutes — less time than the
gradient clipping run, and about 15 minutes less than the baseline train. So
while gradient clipping added on a small amount of time (or maybe that was just noise),
dropping dropout certainly seems to speed things up! I guess there’s quite a lot of
work involved in generating and applying the random masks that drop things out as we’re
doing the forward pass.
Anyway, with the model trained, it was time to download it,
upload it to Hugging Face Hub, and run the evals.
Evals
Firstly, the smoke test, where it just needs to continue the sequence Every effort moves you,
it came up with something reasonably coherent:
giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-remove-dropout/model.json runs/8xa100m40-remove-dropout/checkpoints/best/model.safetensors
Every effort moves you to make the world a better place.
As an international student of the arts in the UK,
…but it was on the test of the loss on the training set that it was most impressive:
giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets/ runs/8xa100m40-remove-dropout/model.json runs/8xa100m40-remove-dropout/checkpoints/best/model.safetensors
Fetching 4 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [04:54
Loss against our test dataset: 3.641
That’s a bigger improvement on the baseline train’s 3.692 than gradient clipping:
0.051, which is more than three times the improvement!
Let’s start keeping a table of these:
| Test set loss | Improvement vs baseline | |
|---|---|---|
| 8xa100m40-baseline | 3.692 | – |
| 8xa100m40-gradient-clipping | 3.678 | 0.014 |
| 8xa100m40-remove-dropout | 3.641 | 0.051 |
Now, of course, we don’t know how these different interventions combine together —
it would be naive to think that if we did both gradient clipping and dropout
removal, we’d get a total loss reduction of 0.014 + 0.051 — but, especially with that
long-lived loss spike in our training run — it does feel like they might play well
together.
Wrapping up
So, that’s dropout covered. Which one next? I think a nice easy one that I should
be able to get done on a Friday will be adding bias to the attention weight calculations.
Let’s give that a go and see if it makes things worse or better!
Stay tuned…
Here’s a link to the next post in this series.
{
const url = new URL(document.querySelector(“link[rel=\”canonical\”]”).href);
url.host = “www.gilesthomas.com”;
return url.toString();
})()
}”
>