Writing an LLM from scratch, part 32d — Interventions: adding attention bias :: Giles’ blog

I’m still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on
Sebastian Raschka‘s book
“Build a Large Language Model (from Scratch)“.
This is the third intervention I’m trying: adding bias to the attention weight matrices.

In the code from the book, we have this:

class MultiHeadAttention(nn.Module):

    def __init__(
        self,
        d_in, d_out,
        context_length,
        dropout,
        num_heads,
        qkv_bias=False
    ):
        ...

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

        ...

    def forward(self, x):
        ...

        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

So: we initialise the weights $W_{q}$ , $W_{k}$ and $W_{v}$ as linear layers rather than
simple matrices of weights, and have a parameter qkv_bias to say whether or not we should
add bias to those. In all of our trains so far we’ve set that to False.

Why do we have this parameter, and where did it come from?

The background

In Raschka’s book, the use of the nn.Linear for these weights is introduced in section 3.4.2
with the wording:

We can improve the SelfAttention_v1 implementation further by utilizing PyTorch’s
nn.Linear layers, which effectively perform matrix multiplication when the
bias units are disabled. Additionally, a significant advantage of using nn.Linear
instead of manually implementing nn.Parameter(torch.rand(...)) is that nn.Linear
has an optimized weight initialization scheme, contributing to more stable and
effective model training.

So, it’s presented essentially as a way of getting better weights for our untrained
model, which makes good sense in and of itself — but, if that’s the only reason,
why don’t we just hard-wire it to have bias=False? That would be the sensible thing
to do if the initialisation were the only reason, but clearly there’s more to it
than that.

Section 4.1 has a bit more information:

qkv_bias determines whether to include a bias vector in the Linear layers
of the multi-head attention … We will initially disable this, following the norms
of modern LLMs, but we will revisit it in chapter 6 when we load pretrained
GPT-2 weights from OpenAI into our model.

That looks like a typo, as the real explanation is in chapter 5, section 5
(page 164 in my copy), where we do indeed load the OpenAI weights:

OpenAI used bias vectors in the multi-head attention module’s linear layers to
implement the query, key and value matrix computations. Bias vectors are not
commonly used in LLMs anymore as they don’t improve the modeling performance
and are thus unnecessary.

So, that all makes sense so far. QKV bias was part of the original GPT-2 models,
perhaps just because it was standard at the time, inherited from something else,
or perhaps for some other reason — I can’t find any reference to it in
the actual paper.
But people have found it doesn’t help, so no-one uses it these days.

But… is there some way in which an LLM of this
specific size, or in some other way similar to the GPT-2 small model that we’re
training, might in some way benefit from having bias?

That’s what this experiment is for 🙂

Parameters

One thing that occurred to me while setting this up is that we have been training
on a Chinchilla-optimal number of tokens, 20x the number of parameters. Without
QKV bias, we have 163,009,536 parameters, so we’ve been training on 3,260,190,720 tokens,
rounded up to the nearest batch size, which is 3,260,252,160 in our current setup for
these experiments (per-GPU micro-batches of 12, with 8 GPUs, so a total batch size of 96).

These extra bias terms will be parameters, though! We’re essentially making our
model larger by adding them, which changes the Chinchilla calculation. How much?

In [1]: params = {
   ...:     "vocab_size": 50257,
   ...:     "context_length": 1024,
   ...:     "emb_dim": 768,
   ...:     "n_heads": 12,
   ...:     "n_layers": 12,
   ...:     "drop_rate": 0.1,
   ...:     "qkv_bias": True
   ...: }

In [2]: from gpt import GPTModel

In [3]: model = GPTModel(params)

In [4]: sum(p.numel() for p in model.parameters())
Out[4]: 163037184

OK, that’s essentially nothing — 27,648 extra total paramaters on top of 163 million.
I make it less than two hundredths of a percentage
point larger! The correct number of tokens goes up to 3,260,743,680, so if we wanted
to be very pedantic, we’re under-training. But I feel like training on a larger dataset
is worse in terms of comparability between the baseline and our “intervened-on” model
with QKV bias.

So: we’ll train a model with QKV bias on 3,260,252,160 tokens, accepting that it’s a tiny bit less than
Chinchilla-optimal. Let’s see how it goes!

The run

Here’s the model.json config file for this train.
Running it gives this training chart:

Pretty standard, though the loss spikes look less prominent than they have been in
the other trains. Might QKV bias actually help with model stability in some way…?

The train finished with these stats:

Training complete in 12,329.557 seconds
Tokens seen: 3,260,252,160
Throughput: 264,426 tokens/second
Final train loss: 3.719

Timing-wise, pretty much indistinguishable from the baseline train’s 12,243.523 seconds. The final train
loss looks a tad better, but we can’t rely on that — the test set loss is the important
one.

So it was time to download it, upload it to Hugging Face Hub, and then on
to the evals.

Evals

Firstly, our normal “how should you continue Every effort moves you“:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-qkv-bias/model.json runs/8xa100m40-qkv-bias/checkpoints/best/model.safetensors
Every effort moves you toward success. The right questions are asked to become your business coach and help shape the future of their

Not bad at all, borderline coherent! Next, the loss on the test set:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets runs/8xa100m40-qkv-bias/model.json runs/8xa100m40-qkv-bias/checkpoints/best/model.safetensors
Fetching 4 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [04:52
Loss against our test dataset: 3.669

Well, crap! Now that’s a surprise. Let’s look at that in the context of the other interventions to see
how surprising that is, given Raschka’s comments (which were undoubtedly backed
up by serious research):

	Test set loss	Improvement vs baseline
8xa100m40-baseline	3.692	–
8xa100m40-gradient-clipping	3.678	0.014
8xa100m40-qkv-bias	3.669	0.023
8xa100m40-remove-dropout	3.641	0.051

So, adding QKV bias actually improved our test set loss by more than gradient clipping
did!

The loss spikes in the training chart look smaller than in the other trains , so, speculating
wildly, perhaps with a model of this size, the bias stabilises things somehow? Or perhaps
what we’re seeing is the model become that tiny bit smarter because it has some extra parameters
— albeit less than 0.02 percent more?

I’m not going to spend time investigating things now, but this is a really interesting result.
One extra thing that does occur to me is that the direction research has taken since GPT-2 has
definitely been in the direction of larger models. The attention weight matrices are
sized $d_{emb} \times d_{emb}$ , so excluding bias they have $d_{emb}^{2}$ weights
each. Bias adds on another $d_{emb}$ . So, as a model scales up, the attention-related
non-bias weights will scale quadratically — doubling $d_{emb}$ will square their number —
while the bias weights will scale linearly.

So perhaps it’s just that the effect — whatever causes it — gets rapidly swamped
as you scale out of toy-model territory. That, at least, seems pretty plausible.

One final note to self, though: these improvements are small enough that I do find
myself wondering whether or not it might be some kind of noise, despite the setting of
the random seeds I’m doing:

    seed = 42
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

I think that at the end of this, before I do a final train, it would be worth doing
another baseline train and measuring the test set loss again, and doing another comparison.
If it comes out exactly the same — and I can bump up the number of significant figures
in the output, it’s just a formatting parameter — then I don’t need to worry. But if
they vary to some degree, perhaps I’ll need to update my mental model of what level of
finding is significant, and what isn’t.

Summing up

I think it goes without saying that QKV bias definitely goes onto the list of interventions
we want to add when training our best-possible GPT-2 small-scale model, assuming that the
random seed test goes well. That surprises
me a bit, I was expecting it to have negligible impact! That, of course, is why it’s worth
doing these tests.

Next up, I think, is trying to understand how we can tweak the learning rate, and its associated
parameters like weight decay. This will need a bit of a deep dive, so you can expect the next
post late next week, or perhaps even later. I’m sure you can’t wait 😉

{
const url = new URL(document.querySelector(“link[rel=\”canonical\”]”).href);
url.host = “www.gilesthomas.com”;
return url.toString();
})()
}”
>

Source link

The background

Parameters

The run

Evals

Summing up

Leave a Reply Cancel reply