The Anatomy of the Least Squares Method, Part Four

By Tivadar Danka · The Palindrome ·Nov 17, 2025 · 18 min read

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

Rock music in Hungary 11 min read
Linked in the article (9 min read)
Byte-pair encoding 7 min read
The article specifically mentions the byte-pair-encoding algorithm as the tokenization method used by GPT models, explaining how it segments language based on statistical co-occurrences. Understanding this compression algorithm would give readers deeper insight into how LLMs process text.
Attention (machine learning) 12 min read
The article focuses heavily on the attention mechanism as 'the heart and soul of a language model' and uses regression to analyze attention adjustments. This Wikipedia article would provide the theoretical foundation for understanding how attention calculates word-pair importance.

Hey! It’s Tivadar from The Palindrome.

The legendary Mike X Cohen, PhD is back with the final part of our deep dive into the least squares method, the bread and butter of data science and machine learning.

Enjoy!

Cheers,
Tivadar

By the end of this post series, you will be confident about understanding, applying, and interpreting regression models (general linear models) that are solved using the famous least-squares algorithm. Here’s a breakdown of the post series:

Part 1: Theory and math. If you haven’t read this post yet, please do so!

Part 2: Explorations in simulations. You learned how to simulate and visualize data and regression results.

Part 3: real-data examples. Here you learned how to import, inspect, clean, and analyze a real-world dataset using the statsmodels, pandas, and seaborn libraries.

Part 4 (this post): modeling GPT activations. We’ll dissect OpenAI’s LLM GPT2, the precursor to its state-of-the-art ChatGPT. You’ll learn more about least-squares and also about LLM mechanisms.

Following along with code

Seeing math come alive in code gives you a deeper understanding and intuition — and that warm fuzzy feeling of confidence in your newly harnessed coding and machine-learning skills. You can learn a lot of math with a bit of code.

Here is the link to the online code on my GitHub for this post. I recommend following along with the code as you read this post.

📌 The Palindrome breaks down advanced math and machine learning concepts with visuals that make everything click.

Join the premium tier to get access to the upcoming live courses on Neural Networks from Scratch and Mathematics of Machine Learning.

Import and inspect the GPT2 model

A large language model (LLM) is a deep-learning model that is trained to input text and generate predictions about what text should come next. It’s a form of generative AI because it uses context and learned worldview information to generate new text.

If you think LLMs are so complicated that they are impossible to understand, then I have bad news for you… you’re wrong! LLMs are not so complicated, and you can learn all about them with just a high-school-level math background. If you’d like to use Python to learn how LLMs are designed and how they work, you can check out my 6-part series on using machine-learning to understand LLM mechanisms here on Substack.

There are two goals of this post: ...

Read full article on The Palindrome →

This excerpt is provided for preview purposes. Full article content is available on the original publication.