May 11, 2026

Small Dog, Small Language Model: Training a Transformer for $5

💡

“Once upon a time, there was a dog named Cookie. She loved to play fetch with her owner. One day, they went for a walk in the park and found a big ball. Cookie picked it up with her mouth and brought it back to her owner...”

That paragraph above? Generated by a 91-million-parameter transformer I trained from scratch over a Saturday afternoon.

The dog Cookie is real. She's mine. She is currently asleep on the couch, blissfully unaware that there's now a small language model wandering the internet that can write fan-fiction about her.

Total compute cost: about $5 of rented H100 time on runpod.io across three training runs. The headline is generous; the actual final run cost $1.50. I'm calling it $5 because "$1.50" makes a less catchy post.

This is a log of training a language model end-to-end.

The Roadmap:

What I built (and what it actually means at this scale)
The journey from gibberish to coherence — three training runs, what worked, what didn't
The bugs

1. What I built

A 91M-parameter GPT-style transformer, trained from scratch on the TinyStories corpus — about 470 million tokens of synthetic children's stories with a deliberately-constrained vocabulary (~3000 base words).

Architecture:

91 million parameters (including embeddings)
12 layers
12 attention heads with RoPE
Custom 4,096-word BPE tokenizer (since the corpus just had ~3000 words)
512 token context window

This is built on the Hugging Face stack — transformers, tokenizers, datasets, safetensors . This is the same ecosystem teams use in practice for running real-world training pipelines, just at the smallest possible scale.

Perspective on Scale

Model	Parameters	Training Tokens	Estimated Cost
GPT-2 base (2019)	124M	~few billions	undisclosed
Nanochat	~560M	~11.2B	~$100
This model	91M	~930M	$1.50
GPT-4-class (rumored)	~1.7T (est.)	~13T (est.)	$100M+ (est.)

2. The journey: from gibberish to coherence

Round 1: Smoke-test on my laptop

Data: WikiText-2, a 2M-token Wikipedia corpus.
Model: 28M-param transformer (8 layers × 512 embeddings, 8 heads).
Hardware: Apple Silicon MPS, ~3 minutes wall-clock.
Result: Final perplexity 194. Total gibberish.

To be fair, this wasn't meant to be a real training run but meant to just be a validation exercise to confirm the pipeline ran end-to-end without crashing and that loss was progressively getting lower.

While doing this evaluation, it struck me that Wikipedia is the wrong corpus for a 28M model.

WikiText-2 is tiny (~2M tokens). Chinchilla rule of thumb states that I need to have ~560M tokens (20x) of the model param size (28M). So, the corpus clearly doesn't have enough data to fill the model.
WikiText-2 data is complex. It has dates, technical jargons and a wide range of low-frequency vocabulary cutting across several domains. Even if it had 560M tokens, a 28M model would lack the capacity to model that distribution.

Time to move on to a different corpus. A bit of research revealed that TinyStories was a great fit: ~470M tokens of synthetic, simple-vocabulary (3000 words) children's stories that a tiny model can actually saturate.

Round 2: The first real attempt

Switched the corpus to TinyStories (~470M tokens), kept the model at 28M params and trained for 15,000 steps with bf16 autocast and torch.compile.

Hardware: RunPod H100 SXM 80GB.
Wall-clock: ~6 minutes.
Cost: ~$0.30.
Result: Perplexity 4.0.

Step :     0/15000  | val: 8.40 | ppl: 4054
Step :  1000/15000  | val: 2.16 | ppl:  8.6
Step :  5000/15000  | val: 1.61 | ppl:  5.0
Step : 14999/15000  | val: 1.40 | ppl:  4.0

From literally random to decent prediction in 6 minutes of GPU time.

But when I sampled:

💡

"Once upon a time, there was a dog named Cookie.. Igaintto-doo. He is like it was gone. It was a very sad..."

It spoke "English words" interleaved with baby talk (let's call it that - "Igaintto-doo", "Annabutterram", "girlsmagulled"). It also suffered from mode collapse where every other prompt got stuck in loops like "Once upon a lotion. Once upon a lotion."

It took a few minutes to figure out that there were two separate problems:

The model was undertrained. Look at the last third of the loss curve, the number of steps were clearly inadequate. The val loss kept dropping but I stopped the training.

The sampling was uncontrolled. I understood that the "Once upon a lotion" loops aren't a model bug, they're a decoder bug. Apparently, the model.generate() settings will happily emit the same high-probability sequence unless I tune 2 parameters:

repetition_penalty : lowers the probability of tokens that have already appeared and
top_p : keeps generation diverse without going into low-probability nonsense (aka nucleus sampling)

Both were fixed in Round 3.

Round 3: 50K steps (~$1.00)

Bumped NUM_STEPS = 15_000 to NUM_STEPS = 50_000. Also fixed the sampling: added repetition_penalty=1.3 and top_p=0.95 to workaround the mode collapse.

Hardware: RunPod H100 SXM 80GB.
Wall-clock: ~20 minutes.
Cost: ~$1.00.
Result: Perplexity 3.62.

The "lotion" loops were gone. Real names appeared. But the output was still largely incoherent with sentences being "alright" but the paragraphs were unrelated and lacked continuity.

A look at the training and validation loss curves told me this was the best I could do with a 28M model - both the training loss and the validation curves flattened. Learnt that the term for this was that I "hit a capacity ceiling" . Apparently, the model is no longer learning because it has run out of parameters to hold, not because it has run out of data to learn from.

The solution was not to add more steps but to add more data to the model. Then I thought I'll add more capacity to the model to learn from larger amounts of data.

Round 4: 91M params (the final run)

Bumped the architecture: 12 layers (vs 8), 768 hidden dim (vs 512), 12 attention heads (vs 8). Total parameters: 28M → 91M. The idea behind using the same corpus but fewer training steps (30K vs 50K) was that each step at 91M params is ~3× more compute than at 28M, so 30K steps here is comparable total compute to 50K at the smaller size.

Hardware: RunPod H100 SXM 80GB.
Trained tokens: 30k steps * 64 batches * 512 context/seq len = 983M
Wall-clock: ~20 minutes.
Cost: ~$1.50.
Result: Perplexity ~2.7.

Output became a coherent narrative:

💡

"Once upon a time, there was a dog named Cookie. She loved to play fetch with her owner... His owner threw the ball and Cookie ran after it as fast as she could. But when she came back, she had lost the ball!..."

This model produced stories. Character continuity was there and the sentences were related to each other in a coherent ashion. It even ended sentences with "The end." and starts a new story.

The pronouns still drift — Cookie is sometimes "she" and sometimes "he" but I believe this would be fixed by scale.

This run was truly a qualitative jump. Same architecture family, same corpus, just more data and capacity of the model.

3. The bugs

The RoPE buffer NaN bug: I was excited to run inference only to realise that it crashed, every time.

    File ".../torch/utils/_contextlib.py", line 124, in decorate_context
      return func(*args, **kwargs)
    File ".../transformers/generation/utils.py", line 2560, in generate
      result = decoding_method(
    File ".../transformers/generation/utils.py", line 2808, in _sample
      next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)                                                                     
  RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

A bit of research said that the precomputed RoPE cos/sin buffers had NaN entries. The problem is that this NaN will propagate through every attention layer that touches that position. The root cause is that I had the cos and the sin buffers as persistent=False, which was the recommended approach for storing recomputable tables.

self.register_buffer('cos', None, persistent=False)
self.register_buffer('sin', None, persistent=False)

Learnt that Hugging Face's from_pretrained uses init_empty_weights() during loading, which allocates memory without initializing it. Since RoPE's cos/sin map was registered as non-persistent buffers, HF didn't save them in the checkpoint (weights are persistent and stored in checkpoint). Since the cos/sin values of RoPE are deterministic, the fix was to manually call model._init_rope(seq_len, head_dim) right after from_pretrained.

MPS scaled_dot_product_attention: PyTorch's F.scaled_dot_product_attention(..., is_causal=True) on Apple's MPS produces NaN under certain conditions. Understood that it was a known PyTorch issue. The fix was to wrap inference in with sdpa_kernel(SDPBackend.MATH): to force the unoptimized math kernel. I didn't particularly dig deep into this patch.

I am truly amazed at how easily accessible training an LLM has become now. Not so long ago, this kind of project would have meant a research budget.

The full code is available at github.com/arunma/learn-you-an-hf-llm. Cookie has had her dinner and remains unimpressed.

Code Musings

Home

Summary of Notes