Building a Small AI Model From Scratch: A Senior Engineer's Guide
You've used pre-trained models, fine-tuned them, and deployed them. Now, you want to understand the engine under the hood. Building a small AI model from scratch isn't just an academic exercise; it's a critical path to truly understanding the architectural choices, data dependencies, and training dynamics that make large language models tick. This guide walks through the essential components and processes, focusing on a decoder-only Transformer, using modern tools and PyTorch.
Data: The Foundation of Intelligence
The quality and relevance of your data directly dictate your model's capabilities. For a small model, curate a clean, focused dataset. Forget terabytes; think megabytes or a few gigabytes of high-quality, task-specific text. For instance, if you're building a code completion model, use a dataset of Python scripts. If it's a creative writing assistant, use fiction excerpts.
Start with raw text, then clean it. Remove boilerplate, HTML tags, excessive whitespace, and duplicate lines. Normalize Unicode characters. For a small model, consider a domain-specific dataset rather than a generic web crawl. My recommendation: for a first build, pick a readily available, clean dataset like a subset of Project Gutenberg or a specific GitHub repository's code.
import os
import re
from datasets import load_dataset # pip install datasets
def clean_text(text):
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
text = re.sub(r'[^\x00-\x7F]+', '', text) # Remove non-ASCII characters
return text
# Example: Loading a small subset of WikiText-2
# For a real project, you'd download and process a custom corpus.
print("Loading dataset...")
try:
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
# Take a small sample for demonstration purposes
sample_size = 10000 # Roughly 10,000 lines
raw_text_data = [item['text'] for item in dataset.select(range(sample_size)) if item['text'].strip()]
print(f"Initial raw text lines: {len(raw_text_data)}")
# Clean and concatenate
cleaned_corpus = "\n".join([clean_text(text) for text in raw_text_data])
# Save to a file for tokenizer training
output_file = "corpus.txt"
with open(output_file, "w", encoding="utf-8") as f:
f.write(cleaned_corpus)
print(f"Cleaned corpus saved to {output_file} (approx {len(cleaned_corpus) / 1024:.2f} KB)")
except Exception as e:
print(f"Error loading dataset: {e}. Please ensure 'datasets' library is installed and you have an internet connection.")
print("Falling back to a dummy corpus for demonstration.")
cleaned_corpus = "This is a sample sentence for demonstrating the tokenizer. It contains various words and punctuation. We will use this text to train our subword tokenizer from scratch. The quick brown fox jumps over the lazy dog." * 100
with open("corpus.txt", "w", encoding="utf-8") as f:
f.write(cleaned_corpus)
print("Dummy corpus saved to corpus.txt")
DATA_FILE = "corpus.txt"
Tokenization: Bridging Text and Tensors
Machines don't understand text; they understand numbers. Tokenization is the process of converting raw text into numerical representations (tokens) that a model can process. For generative models, subword tokenization is standard, balancing vocabulary size with the ability to represent unseen words.
Byte Pair Encoding (BPE)
BPE is a compression algorithm adapted for text. It iteratively merges the most frequent pairs of characters or character sequences into new, single tokens. This creates a vocabulary of common words, subwords, and characters. It's efficient and handles out-of-vocabulary words by breaking them down into smaller, known units.
The core idea: Start with individual characters. Find the most frequent adjacent pair of tokens and replace all occurrences of that pair with a new, merged token. Repeat until a desired vocabulary size is reached or no more merges are possible.
# Conceptual BPE (simplified, not production-ready)
from collections import defaultdict
def get_stats(vocab):
pairs = defaultdict(int)
for word, freq in vocab.items():
symbols = word.split(' ')
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i+1])] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?'] += 1 # Add end-of-word marker
vocab = words.copy()
merges = {}
for i in range(num_merges):
pairs = get_stats(vocab)
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
vocab = merge_vocab(best_pair, vocab)
merges[best_pair] = ''.join(best_pair)
# print(f"Merge {i+1}: {best_pair} -> {''.join(best_pair)}")
# This simplified version doesn't extract the final tokens,
# but demonstrates the merge process.
# A real BPE implementation would build a mapping from subword to ID.
print(f"Trained {len(merges)} BPE merges.")
return merges
# We won't run this full BPE training here, it's illustrative.
# For production, use SentencePiece or Hugging Face tokenizers.
# bpe_merges = train_bpe(cleaned_corpus, num_merges=100)
SentencePiece for Production
SentencePiece, developed by Google, is a language-agnostic subword tokenizer. It treats the input as a raw stream of Unicode characters, including whitespace, which simplifies pre-processing and avoids issues with different language tokenization rules. It can train BPE or Unigram models.
import sentencepiece as spm # pip install sentencepiece
# Define SentencePiece model parameters
SPM_MODEL_PREFIX = "my_spm_model"
VOCAB_SIZE = 8000 # A reasonable size for a small model
CHARACTER_COVERAGE = 0.9995 # Cover almost all characters in the corpus
print(f"Training SentencePiece model with vocab size {VOCAB_SIZE}...")
try:
spm.SentencePieceTrainer.train(
input=DATA_FILE,
model_prefix=SPM_MODEL_PREFIX,
vocab_size=VOCAB_SIZE,
character_coverage=CHARACTER_COVERAGE,
model_type="bpe", # Can also be "unigram"
num_threads=os.cpu_count(),
# Additional options for better performance/control
bos_id=-1, # No beginning-of-sentence token
eos_id=1, # End-of-sentence token (often used as padding/mask)
pad_id=0, # Padding token
unk_id=2, # Unknown token
# Allow sentencepiece to learn a special token for newline if present
# user_defined_symbols=['\n']
)
print(f"SentencePiece model trained and saved as {SPM_MODEL_PREFIX}.model and {SPM_MODEL_PREFIX}.vocab")
# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(f"{SPM_MODEL_PREFIX}.model")
# Test the tokenizer
sample_text = "This is an example sentence for our new tokenizer. How does it handle punctuation and unknown words?"
encoded_ids = tokenizer.encode_as_ids(sample_text)
decoded_text = tokenizer.decode_ids(encoded_ids)
print(f"\nOriginal: '{sample_text}'")
print(f"Encoded IDs: {encoded_ids}")
print(f"Decoded: '{decoded_text}'")
print(f"Vocabulary size: {tokenizer.get_piece_size()}")
except Exception as e:
print(f"Error training SentencePiece: {e}")
print("Please ensure you have a valid 'corpus.txt' file.")
# Fallback for demonstration if SentencePiece fails
class DummyTokenizer:
def __init__(self, vocab_size=8000):
self.vocab_size = vocab_size
self.word_to_id = {
'this': 3, 'is': 4, 'an': 5, 'example': 6, 'sentence': 7,
'for': 8, 'our': 9, 'new': 10, 'tokenizer': 11, '.': 12,
'how': 13, 'does': 14, 'it': 15, 'handle': 16, 'punctuation': 17,
'and': 18, 'unknown': 19, 'words': 20, '?': 21,
'': 0, '': 1, '': 2, '': 2 # Using 2 for unk as a common fallback
}
self.id_to_word = {v: k for k, v in self.word_to_id.items()}
self.max_id = max(self.id_to_word.keys())
# Add some more dummy tokens up to vocab_size
for i in range(self.max_id + 1, vocab_size):
self.id_to_word[i] = f"token_{i}"
self.word_to_id[f"token_{i}"] = i
def encode_as_ids(self, text):
# Simple whitespace tokenization for dummy
tokens = text.lower().replace('.', ' . ').replace('?', ' ? ').split()
return [self.word_to_id.get(token, self.word_to_id['']) for token in tokens]
def decode_ids(self, ids):
return " ".join([self.id_to_word.get(id, '') for id in ids])
def get_piece_size(self):
return self.vocab_size
tokenizer = DummyTokenizer(vocab_size=VOCAB_SIZE)
print("Using dummy tokenizer for demonstration.")
VOCAB_SIZE = tokenizer.get_piece_size() # Update VOCAB_SIZE based on actual tokenizer
PAD_TOKEN_ID = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0
EOS_TOKEN_ID = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 1
Tiktoken-Style Efficiency
Tiktoken, from OpenAI, is a highly optimized BPE implementation. It's not a training library but a fast inference engine for specific BPE models. Its key characteristic is speed, achieved through Rust implementations and efficient data structures. While you won't train a Tiktoken model from scratch directly, understanding its approach means prioritizing fast encoding/decoding and efficient vocabulary management. For your custom model, SentencePiece with BPE is a robust choice that provides both training and inference.
| Feature | BPE (Conceptual) | SentencePiece (BPE) | Tiktoken-style |
|---|---|---|---|
| Training | Algorithm, requires custom implementation or a library. | Built-in trainer, language-agnostic. | Pre-trained, not for custom training. |
| Input Handling | Typically requires pre-tokenization into words. | Raw text stream (including whitespace), treats everything as characters. | Optimized for specific pre-tokenization rules used by OpenAI. |
| Speed | Depends on implementation, Python can be slow. | Fast C++ backend, good for production. | Extremely fast (Rust), highly optimized for inference. |
| Vocabulary | Subword units, handles OOV by breaking down. | Subword units, handles OOV by breaking down. | Subword units, specific to OpenAI models. |
| Use Case | Understanding the core algorithm. | Custom model training, production deployment, multilingual. | Using OpenAI's models, fast inference with their tokenizers. |
The Decoder-Only Transformer Architecture
For generative text tasks (like next-word prediction), the decoder-only Transformer is the standard. It processes input sequentially, attending only to past tokens, and predicts the next token. This architecture, popularized by models like GPT, is simpler than encoder-decoder models and highly effective for auto-regressive generation. We'll build ours using PyTorch (version 2.x recommended).
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# Hyperparameters for our small model
# These are illustrative, adjust based on your dataset and compute.
N_EMBD = 256 # Embedding dimension
N_HEADS = 4 # Number of attention heads
N_LAYER = 4 # Number of Transformer blocks
BLOCK_SIZE = 128 # Maximum sequence length for context
DROPOUT = 0.1 # Dropout rate
Embedding Layer: Initial Representation
Each token ID needs to be converted into a dense vector representation. This is the token embedding. Additionally, since Transformers process sequences in parallel without inherent order, we need positional encoding to inject positional information.
class TokenAndPositionalEmbedding(nn.Module):
def __init__(self, vocab_size, n_embd, block_size, dropout):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.dropout = nn.Dropout(dropout)
self.block_size = block_size
def forward(self, idx):
# idx is (B, T) tensor of integers
B, T = idx.shape
if T > self.block_size:
raise ValueError(f"Input sequence length {T} exceeds block_size {self.block_size}")
tok_emb = self.token_embedding_table(idx) # (B, T, N_EMBD)
pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, N_EMBD)
x = tok_emb + pos_emb # (B, T, N_EMBD)
return self.dropout(x)
Multi-Head Self-Attention: Contextual Understanding
Self-attention allows the model to weigh the importance of different tokens in the input sequence when processing each token. Multi-head attention performs this operation in parallel with multiple "heads," allowing the model to focus on different aspects of the input simultaneously. For a decoder, we use a causal mask to prevent attention to future tokens.
class Head(nn.Module):
""" One head of self-attention """
def __init__(self, head_size, n_embd, block_size, dropout):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape # (Batch, Time, Channel/N_EMBD)
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
# Compute attention scores ("affinities")
# (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
wei = q @ k.transpose(-2, -1) * (C**-0.5)
# Causal mask: ensure attention only to preceding tokens
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# Weighted aggregation of the values
v = self.value(x) # (B, T, head_size)
out = wei @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
return out
class MultiHeadAttention(nn.Module):
""" Multiple heads of self-attention in parallel """
def __init__(self, num_heads, head_size, n_embd, block_size, dropout):
super().__init__()
self.heads = nn.ModuleList([Head(head_size, n_embd, block_size, dropout) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd) # Projection layer after concatenating heads
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Concatenate outputs from all heads (B, T, N_HEADS * head_size)
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.proj(out) # Project back to N_EMBD
out = self.dropout(out)
return out
Feed-Forward Network: Non-Linearity and Transformation
After attention, a simple point-wise feed-forward network is applied independently to each position. This network typically consists of two linear transformations with a non-linear activation (like GELU) in between. It allows the model to process the information aggregated by attention.
class FeedFoward(nn.Module):
""" A simple linear layer followed by a non-linearity and another linear layer """
def __init__(self, n_embd, dropout):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd), # Expansion factor of 4 is common
nn.GELU(), # Gaussian Error Linear Unit
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
Layer Normalization: Stability and Speed
Layer Normalization is crucial for stabilizing training and speeding up convergence in deep networks. It normalizes the inputs across the feature dimension for each sample independently. It's typically applied before the self-attention and feed-forward sub-layers (pre-norm configuration).
class LayerNorm(nn.Module):
""" Simple LayerNorm for demonstration, uses PyTorch's built-in """
def __init__(self, dim, eps=1e-5):
super().__init__()
self.gamma = nn.Parameter(torch.ones(dim))
self.beta = nn.Parameter(torch.zeros(dim))
self.eps = eps
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
# PyTorch's nn.LayerNorm is preferred for production:
# norm = nn.LayerNorm(N_EMBD)
Assembling the Transformer Block
A Transformer block combines Multi-Head Self-Attention and a Feed-Forward Network, typically with residual connections and Layer Normalization. This structure helps with gradient flow and allows for deeper networks.
class Block(nn.Module):
""" Transformer block: communication followed by computation """
def __init__(self, n_embd, n_heads, block_size, dropout):
super().__init__()
head_size = n_embd // n_heads
self.sa = MultiHeadAttention(n_heads, head_size, n_embd, block_size, dropout)
self.ffwd = FeedFoward(n_embd, dropout)
self.ln1 = nn.LayerNorm(n_embd) # Pre-norm configuration
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
# Residual connections and layer normalization
x = x + self.sa(self.ln1(x)) # Apply attention, add residual, then normalize
x = x + self.ffwd(self.ln2(x)) # Apply FFN, add residual, then normalize
return x
The Full Decoder-Only Transformer Model
Finally, we stack multiple Transformer blocks, add the embedding layer at the beginning, and a linear layer at the end to predict the logits for the next token.
class SmallGPT(nn.Module):
def __init__(self, vocab_size, n_embd, block_size, n_heads, n_layer, dropout):
super().__init__()
self.block_size = block_size
self.token_and_pos_embedding = TokenAndPositionalEmbedding(vocab_size, n_embd, block_size, dropout)
self.blocks = nn.Sequential(*[Block(n_embd, n_heads, block_size, dropout) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # Final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size) # Linear layer to predict logits
# Initialize weights for better training stability
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.LayerNorm):
torch.nn.init.ones_(module.weight)
torch.nn.init.zeros_(module.bias)
def forward(self, idx, targets=None):
B, T = idx.shape
# idx and targets are both (B, T) tensor of integers
x = self.token_and_pos_embedding(idx) # (B, T, N_EMBD)
x = self.blocks(x) # (B, T, N_EMBD)
x = self.ln_f(x) # (B, T, N_EMBD)
logits = self.lm_head(x) # (B, T, VOCAB_SIZE)
loss = None
if targets is not None:
# Reshape logits and targets for F.cross_entropy
# PyTorch expects (N, C, ...) for input, (N, ...) for target
logits = logits.view(B*T, -1) # (B*T, VOCAB_SIZE)
targets = targets.view(-1) # (B*T)
loss = F.cross_entropy(logits, targets, ignore_index=PAD_TOKEN_ID)
return logits, loss
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# Crop idx to the last block_size tokens
idx_cond = idx[:, -self.block_size:]
# Get the predictions
logits, _ = self(idx_cond)
# Focus only on the last time step
logits = logits[:, -1, :] / temperature # (B, VOCAB_SIZE)
# Apply top-k sampling if specified
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
# Apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, VOCAB_SIZE)
# Sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
Training Process: Bringing the Model to Life
Training involves feeding the model data, calculating the loss, and updating its weights using an optimizer. For generative models, we typically use Cross-Entropy Loss, aiming to maximize the likelihood of the next token given the preceding ones.
# Training Hyperparameters
BATCH_SIZE = 16 # How many independent sequences will we process in parallel?
LEARNING_RATE = 3e-4
MAX_ITERS = 5000 # Number of training steps
EVAL_INTERVAL = 500
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
EVAL_ITERS = 200 # Number of batches to average for evaluation
print(f"Using device: {DEVICE}")
# Load the corpus and prepare data for training
with open(DATA_FILE, 'r', encoding='utf-8') as f:
text = f.read()
# Encode the entire text with the tokenizer
# If tokenizer failed, this will use the dummy one.
try:
data = torch.tensor(tokenizer.encode_