In December 2023, I started reading the insightful book 'Build a Large Language Model (From Scratch)' by Sebastian Raschka, a treasure trove of knowledge for anyone fascinated by the world of language models.
Sebastian Raschka, an expert in the field, offers a deep dive into the intricacies of building large language models. His book, still in early access, is more than just theory; it's a hands-on experience. And the best part? The author himself has provided a wealth of code implementations, a goldmine for practical learners like me.
Diving into the code, I started my playground in a Jupyter Notebook. It's here that the concepts leaped off the page and came to life, allowing me to grasp the nuances of tokenization, a process of transforming a sequence of characters (like a sentence) into a sequence of tokens. These tokens can be words, subwords, or even individual characters, depending on the algorithm used.
Consider the alphabet (whether it's ASCII, Unicode, etc.) as a set of characters. Similarly, for a language model, we define a vocabulary, which is a set of tokens. Tokenization can be thought of as a mapping function which maps a string (sequence of characters) to a sequence of tokens. This mapping is not always one-to-one; a single character can map to different tokens based on context, and a token can represent multiple characters. Once tokenized, each token is represented as a vector in a high-dimensional space. These vectors capture not just the identity of the token but also semantic and syntactic properties. Different tokenization algorithms (like Byte-Pair Encoding, WordPiece, and SentencePiece) aim to optimize the balance between the size of the vocabulary and the ability to represent a language accurately. In LLMs, language is often modelled as a probabilistic system with the probability distributions learned over sequences of tokens, not just individual tokens.
🔗 Grab your copy here: https://www.manning.com/books/build-a-large-language-model-from-scratch
Explore the code implementation by the author: https://github.com/rasbt/LLMs-from-scratch