Building a Small AI Model From Scratch: A Senior Engineer's Guide

You've used pre-trained models, fine-tuned them, and deployed them. Now, you want to understand the engine under the hood. Building a small AI model from scratch isn't just an academic exercise; it's a critical path to truly understanding the architectural choices, data dependencies, and training dynamics that make large language models tick. This guide walks through the essential components and processes, focusing on a decoder-only Transformer, using modern tools and PyTorch.

Data: The Foundation of Intelligence

The quality and relevance of your data directly dictate your model's capabilities. For a small model, curate a clean, focused dataset. Forget terabytes; think megabytes or a few gigabytes of high-quality, task-specific text. For instance, if you're building a code completion model, use a dataset of Python scripts. If it's a creative writing assistant, use fiction excerpts.

Start with raw text, then clean it. Remove boilerplate, HTML tags, excessive whitespace, and duplicate lines. Normalize Unicode characters. For a small model, consider a domain-specific dataset rather than a generic web crawl. My recommendation: for a first build, pick a readily available, clean dataset like a subset of Project Gutenberg or a specific GitHub repository's code.

import os
import re
from datasets import load_dataset # pip install datasets

def clean_text(text):
    text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
    text = re.sub(r'[^\x00-\x7F]+', '', text) # Remove non-ASCII characters
    return text

# Example: Loading a small subset of WikiText-2
# For a real project, you'd download and process a custom corpus.
print("Loading dataset...")
try:
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    # Take a small sample for demonstration purposes
    sample_size = 10000 # Roughly 10,000 lines
    raw_text_data = [item['text'] for item in dataset.select(range(sample_size)) if item['text'].strip()]
    
    print(f"Initial raw text lines: {len(raw_text_data)}")
    
    # Clean and concatenate
    cleaned_corpus = "\n".join([clean_text(text) for text in raw_text_data])
    
    # Save to a file for tokenizer training
    output_file = "corpus.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(cleaned_corpus)
    
    print(f"Cleaned corpus saved to {output_file} (approx {len(cleaned_corpus) / 1024:.2f} KB)")

except Exception as e:
    print(f"Error loading dataset: {e}. Please ensure 'datasets' library is installed and you have an internet connection.")
    print("Falling back to a dummy corpus for demonstration.")
    cleaned_corpus = "This is a sample sentence for demonstrating the tokenizer. It contains various words and punctuation. We will use this text to train our subword tokenizer from scratch. The quick brown fox jumps over the lazy dog." * 100
    with open("corpus.txt", "w", encoding="utf-8") as f:
        f.write(cleaned_corpus)
    print("Dummy corpus saved to corpus.txt")

DATA_FILE = "corpus.txt"

Tokenization: Bridging Text and Tensors

Machines don't understand text; they understand numbers. Tokenization is the process of converting raw text into numerical representations (tokens) that a model can process. For generative models, subword tokenization is standard, balancing vocabulary size with the ability to represent unseen words.

Byte Pair Encoding (BPE)

BPE is a compression algorithm adapted for text. It iteratively merges the most frequent pairs of characters or character sequences into new, single tokens. This creates a vocabulary of common words, subwords, and characters. It's efficient and handles out-of-vocabulary words by breaking them down into smaller, known units.

The core idea: Start with individual characters. Find the most frequent adjacent pair of tokens and replace all occurrences of that pair with a new, merged token. Repeat until a desired vocabulary size is reached or no more merges are possible.

# Conceptual BPE (simplified, not production-ready)
from collections import defaultdict

def get_stats(vocab):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split(' ')
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?'] += 1 # Add end-of-word marker

    vocab = words.copy()
    merges = {}

    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_vocab(best_pair, vocab)
        merges[best_pair] = ''.join(best_pair)
        # print(f"Merge {i+1}: {best_pair} -> {''.join(best_pair)}")
    
    # This simplified version doesn't extract the final tokens,
    # but demonstrates the merge process.
    # A real BPE implementation would build a mapping from subword to ID.
    print(f"Trained {len(merges)} BPE merges.")
    return merges

# We won't run this full BPE training here, it's illustrative.
# For production, use SentencePiece or Hugging Face tokenizers.
# bpe_merges = train_bpe(cleaned_corpus, num_merges=100)

SentencePiece for Production

SentencePiece, developed by Google, is a language-agnostic subword tokenizer. It treats the input as a raw stream of Unicode characters, including whitespace, which simplifies pre-processing and avoids issues with different language tokenization rules. It can train BPE or Unigram models.

import sentencepiece as spm # pip install sentencepiece

# Define SentencePiece model parameters
SPM_MODEL_PREFIX = "my_spm_model"
VOCAB_SIZE = 8000 # A reasonable size for a small model
CHARACTER_COVERAGE = 0.9995 # Cover almost all characters in the corpus

print(f"Training SentencePiece model with vocab size {VOCAB_SIZE}...")
try:
    spm.SentencePieceTrainer.train(
        input=DATA_FILE,
        model_prefix=SPM_MODEL_PREFIX,
        vocab_size=VOCAB_SIZE,
        character_coverage=CHARACTER_COVERAGE,
        model_type="bpe", # Can also be "unigram"
        num_threads=os.cpu_count(),
        # Additional options for better performance/control
        bos_id=-1, # No beginning-of-sentence token
        eos_id=1,  # End-of-sentence token (often used as padding/mask)
        pad_id=0,  # Padding token
        unk_id=2,  # Unknown token
        # Allow sentencepiece to learn a special token for newline if present
        # user_defined_symbols=['\n'] 
    )
    print(f"SentencePiece model trained and saved as {SPM_MODEL_PREFIX}.model and {SPM_MODEL_PREFIX}.vocab")

    # Load the trained tokenizer
    tokenizer = spm.SentencePieceProcessor()
    tokenizer.load(f"{SPM_MODEL_PREFIX}.model")

    # Test the tokenizer
    sample_text = "This is an example sentence for our new tokenizer. How does it handle punctuation and unknown words?"
    encoded_ids = tokenizer.encode_as_ids(sample_text)
    decoded_text = tokenizer.decode_ids(encoded_ids)

    print(f"\nOriginal: '{sample_text}'")
    print(f"Encoded IDs: {encoded_ids}")
    print(f"Decoded: '{decoded_text}'")
    print(f"Vocabulary size: {tokenizer.get_piece_size()}")

except Exception as e:
    print(f"Error training SentencePiece: {e}")
    print("Please ensure you have a valid 'corpus.txt' file.")
    # Fallback for demonstration if SentencePiece fails
    class DummyTokenizer:
        def __init__(self, vocab_size=8000):
            self.vocab_size = vocab_size
            self.word_to_id = {
                'this': 3, 'is': 4, 'an': 5, 'example': 6, 'sentence': 7,
                'for': 8, 'our': 9, 'new': 10, 'tokenizer': 11, '.': 12,
                'how': 13, 'does': 14, 'it': 15, 'handle': 16, 'punctuation': 17,
                'and': 18, 'unknown': 19, 'words': 20, '?': 21,
                '': 0, '': 1, '': 2, '': 2 # Using 2 for unk as a common fallback
            }
            self.id_to_word = {v: k for k, v in self.word_to_id.items()}
            self.max_id = max(self.id_to_word.keys())
            # Add some more dummy tokens up to vocab_size
            for i in range(self.max_id + 1, vocab_size):
                self.id_to_word[i] = f"token_{i}"
                self.word_to_id[f"token_{i}"] = i

        def encode_as_ids(self, text):
            # Simple whitespace tokenization for dummy
            tokens = text.lower().replace('.', ' . ').replace('?', ' ? ').split()
            return [self.word_to_id.get(token, self.word_to_id['']) for token in tokens]

        def decode_ids(self, ids):
            return " ".join([self.id_to_word.get(id, '') for id in ids])
        
        def get_piece_size(self):
            return self.vocab_size

    tokenizer = DummyTokenizer(vocab_size=VOCAB_SIZE)
    print("Using dummy tokenizer for demonstration.")

VOCAB_SIZE = tokenizer.get_piece_size() # Update VOCAB_SIZE based on actual tokenizer
PAD_TOKEN_ID = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0
EOS_TOKEN_ID = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 1

Tiktoken-Style Efficiency

Tiktoken, from OpenAI, is a highly optimized BPE implementation. It's not a training library but a fast inference engine for specific BPE models. Its key characteristic is speed, achieved through Rust implementations and efficient data structures. While you won't train a Tiktoken model from scratch directly, understanding its approach means prioritizing fast encoding/decoding and efficient vocabulary management. For your custom model, SentencePiece with BPE is a robust choice that provides both training and inference.

Feature	BPE (Conceptual)	SentencePiece (BPE)	Tiktoken-style
Training	Algorithm, requires custom implementation or a library.	Built-in trainer, language-agnostic.	Pre-trained, not for custom training.
Input Handling	Typically requires pre-tokenization into words.	Raw text stream (including whitespace), treats everything as characters.	Optimized for specific pre-tokenization rules used by OpenAI.
Speed	Depends on implementation, Python can be slow.	Fast C++ backend, good for production.	Extremely fast (Rust), highly optimized for inference.
Vocabulary	Subword units, handles OOV by breaking down.	Subword units, handles OOV by breaking down.	Subword units, specific to OpenAI models.
Use Case	Understanding the core algorithm.	Custom model training, production deployment, multilingual.	Using OpenAI's models, fast inference with their tokenizers.

The Decoder-Only Transformer Architecture

For generative text tasks (like next-word prediction), the decoder-only Transformer is the standard. It processes input sequentially, attending only to past tokens, and predicts the next token. This architecture, popularized by models like GPT, is simpler than encoder-decoder models and highly effective for auto-regressive generation. We'll build ours using PyTorch (version 2.x recommended).

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Hyperparameters for our small model
# These are illustrative, adjust based on your dataset and compute.
N_EMBD = 256 # Embedding dimension
N_HEADS = 4  # Number of attention heads
N_LAYER = 4  # Number of Transformer blocks
BLOCK_SIZE = 128 # Maximum sequence length for context
DROPOUT = 0.1 # Dropout rate

Embedding Layer: Initial Representation

Each token ID needs to be converted into a dense vector representation. This is the token embedding. Additionally, since Transformers process sequences in parallel without inherent order, we need positional encoding to inject positional information.

class TokenAndPositionalEmbedding(nn.Module):
    def __init__(self, vocab_size, n_embd, block_size, dropout):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.dropout = nn.Dropout(dropout)
        self.block_size = block_size

    def forward(self, idx):
        # idx is (B, T) tensor of integers
        B, T = idx.shape
        if T > self.block_size:
            raise ValueError(f"Input sequence length {T} exceeds block_size {self.block_size}")

        tok_emb = self.token_embedding_table(idx) # (B, T, N_EMBD)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, N_EMBD)
        x = tok_emb + pos_emb # (B, T, N_EMBD)
        return self.dropout(x)

Multi-Head Self-Attention: Contextual Understanding

Self-attention allows the model to weigh the importance of different tokens in the input sequence when processing each token. Multi-head attention performs this operation in parallel with multiple "heads," allowing the model to focus on different aspects of the input simultaneously. For a decoder, we use a causal mask to prevent attention to future tokens.

class Head(nn.Module):
    """ One head of self-attention """
    def __init__(self, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape # (Batch, Time, Channel/N_EMBD)
        k = self.key(x)   # (B, T, head_size)
        q = self.query(x) # (B, T, head_size)

        # Compute attention scores ("affinities")
        # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        wei = q @ k.transpose(-2, -1) * (C**-0.5) 
        
        # Causal mask: ensure attention only to preceding tokens
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) 
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)

        # Weighted aggregation of the values
        v = self.value(x) # (B, T, head_size)
        out = wei @ v     # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
        return out

class MultiHeadAttention(nn.Module):
    """ Multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size, n_embd, block_size, dropout) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd) # Projection layer after concatenating heads
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Concatenate outputs from all heads (B, T, N_HEADS * head_size)
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out) # Project back to N_EMBD
        out = self.dropout(out)
        return out

Feed-Forward Network: Non-Linearity and Transformation

After attention, a simple point-wise feed-forward network is applied independently to each position. This network typically consists of two linear transformations with a non-linear activation (like GELU) in between. It allows the model to process the information aggregated by attention.

class FeedFoward(nn.Module):
    """ A simple linear layer followed by a non-linearity and another linear layer """
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # Expansion factor of 4 is common
            nn.GELU(), # Gaussian Error Linear Unit
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

Layer Normalization: Stability and Speed

Layer Normalization is crucial for stabilizing training and speeding up convergence in deep networks. It normalizes the inputs across the feature dimension for each sample independently. It's typically applied before the self-attention and feed-forward sub-layers (pre-norm configuration).

class LayerNorm(nn.Module):
    """ Simple LayerNorm for demonstration, uses PyTorch's built-in """
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

# PyTorch's nn.LayerNorm is preferred for production:
# norm = nn.LayerNorm(N_EMBD)

Assembling the Transformer Block

A Transformer block combines Multi-Head Self-Attention and a Feed-Forward Network, typically with residual connections and Layer Normalization. This structure helps with gradient flow and allows for deeper networks.

class Block(nn.Module):
    """ Transformer block: communication followed by computation """
    def __init__(self, n_embd, n_heads, block_size, dropout):
        super().__init__()
        head_size = n_embd // n_heads
        self.sa = MultiHeadAttention(n_heads, head_size, n_embd, block_size, dropout)
        self.ffwd = FeedFoward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd) # Pre-norm configuration
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # Residual connections and layer normalization
        x = x + self.sa(self.ln1(x)) # Apply attention, add residual, then normalize
        x = x + self.ffwd(self.ln2(x)) # Apply FFN, add residual, then normalize
        return x

The Full Decoder-Only Transformer Model

Finally, we stack multiple Transformer blocks, add the embedding layer at the beginning, and a linear layer at the end to predict the logits for the next token.

class SmallGPT(nn.Module):
    def __init__(self, vocab_size, n_embd, block_size, n_heads, n_layer, dropout):
        super().__init__()
        self.block_size = block_size

        self.token_and_pos_embedding = TokenAndPositionalEmbedding(vocab_size, n_embd, block_size, dropout)
        self.blocks = nn.Sequential(*[Block(n_embd, n_heads, block_size, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # Final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size) # Linear layer to predict logits

        # Initialize weights for better training stability
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.LayerNorm):
            torch.nn.init.ones_(module.weight)
            torch.nn.init.zeros_(module.bias)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensor of integers
        x = self.token_and_pos_embedding(idx) # (B, T, N_EMBD)
        x = self.blocks(x) # (B, T, N_EMBD)
        x = self.ln_f(x) # (B, T, N_EMBD)
        logits = self.lm_head(x) # (B, T, VOCAB_SIZE)

        loss = None
        if targets is not None:
            # Reshape logits and targets for F.cross_entropy
            # PyTorch expects (N, C, ...) for input, (N, ...) for target
            logits = logits.view(B*T, -1) # (B*T, VOCAB_SIZE)
            targets = targets.view(-1) # (B*T)
            loss = F.cross_entropy(logits, targets, ignore_index=PAD_TOKEN_ID)

        return logits, loss

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # Crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
            # Get the predictions
            logits, _ = self(idx_cond)
            # Focus only on the last time step
            logits = logits[:, -1, :] / temperature # (B, VOCAB_SIZE)

            # Apply top-k sampling if specified
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, VOCAB_SIZE)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

Training Process: Bringing the Model to Life

Training involves feeding the model data, calculating the loss, and updating its weights using an optimizer. For generative models, we typically use Cross-Entropy Loss, aiming to maximize the likelihood of the next token given the preceding ones.

# Training Hyperparameters
BATCH_SIZE = 16 # How many independent sequences will we process in parallel?
LEARNING_RATE = 3e-4
MAX_ITERS = 5000 # Number of training steps
EVAL_INTERVAL = 500
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
EVAL_ITERS = 200 # Number of batches to average for evaluation

print(f"Using device: {DEVICE}")

# Load the corpus and prepare data for training
with open(DATA_FILE, 'r', encoding='utf-8') as f:
    text = f.read()

# Encode the entire text with the tokenizer
# If tokenizer failed, this will use the dummy one.
try:
    data = torch.tensor(tokenizer.encode_