Project Overview

This project walks through the end-to-end process of implementing a transformer-based language model from scratch. It covers tokenization using GPT-2's BPE tokenizer, positional and token embeddings, masked multi-head attention, and training an autoregressive model using Harry Potter text data. The final model is capable of generating coherent text completions based on input prompts.

Key Features

  • Custom GPT-style architecture with multi-head attention and residual connections
  • Uses tiktoken for efficient byte-pair encoding (BPE) tokenization
  • Implements causal masking for autoregressive text prediction
  • Trained on the Harry Potter book series for 2 weeks with early stopping and sampling
  • Simple text generation function to query the model with prompts

Methodology

The development follows a bottom-up approach, starting from raw text preprocessing to defining transformer blocks and training procedures. Token IDs are generated using tiktoken’s GPT-2 encoding. Training data is formed using a sliding window over token sequences. The model is implemented in PyTorch with positional and token embeddings followed by multiple transformer layers and a linear output head.

Training

The model is trained using CrossEntropy loss and AdamW optimizer. The dataset is chunked using a stride method, and the training is carried out with batch-wise iterations over several epochs. Overfitting is avoided using dropout and validation-based early stopping.

Generation

Once trained, the model can generate coherent continuations for prompts. The sampling process uses greedy decoding or can be enhanced with top-k/top-p sampling strategies to generate diverse completions.

Technical Implementation

Core Components

The implementation uses PyTorch to build each part of the transformer model from scratch:

  1. Embedding Layer: Includes token and positional embeddings of size 768.
  2. Transformer Blocks: Each block includes LayerNorm, masked multi-head self-attention, GELU activations, feed-forward networks, and residual connections.
  3. Loss Function: CrossEntropy loss with label shifting to predict the next token.
  4. Optimizer: AdamW with weight decay, optimized for training large language models.
Dataset and Tokenizer

Used the first four books of the Harry Potter series as training data. The text is tokenized using OpenAI's tiktoken with GPT-2 vocabulary. Data is split into training and validation using a 90/10 ratio.

Training Details

The model was trained over two weeks on a GPU setup. Batch size was set to 32, context length 128, and sequence stride 64. The model reaches convergence with a training loss below 1.5.

Results and Sample Outputs

  • Perplexity reduced to below 20 by epoch 5
  • Coherent generation for character-driven prompts from the Harry Potter universe
  • Stable convergence and robust text continuation ability
  • Examples: “Harry walked into the room and said...” → full paragraph generation

Conclusion

This project demonstrates that with a fundamental understanding of transformers and proper training strategies, it is possible to build a capable LLM from scratch. The pipeline, while minimal, serves as an excellent educational foundation for building more advanced models like GPT-2 and beyond.

Project Information

  • Category: Natural Language Processing
  • Duration: 2 weeks
  • Completed: 2025
  • Institution: University at Buffalo

Technologies Used

Python
PyTorch
tiktoken
Jupyter Notebook
HTML
GitHub

Interested in this project?

If you're curious about how this LLM was built from the ground up, or if you want to collaborate on future deep learning and NLP research, feel free to reach out!

Contact Me