Noteworthy AI Research Papers of 2024 (Part Two)
I hope your 2025 is off to a great start! To kick off the year, I've finally been able to complete the draft and second part of this AI Research Highlights of 2024 article. It covers a variety of relevant topics, from mixture-of-experts models to new LLM scaling laws for precision.
Note that this article is Part Two in this series, focusing on the second half of 2024 from July through December. You can find Part One, covering January to June here.
The selection criteria are admittedly subjective, based on what stood out to me this year. I've also aimed for some variety, so it's not all just about LLM model releases.
I hope you are having a great 2025, and happy reading!
7. July: The Llama 3 Herd of Models
Readers are probably already well familiar with Meta AI's Llama 3 models and paper, but since these are such important and widely-used models, I want to dedicate the July section to The Llama 3 Herd of Models (July 2024) paper by Grattafiori and colleagues.
What's notable about the Llama 3 model family is the increased sophistication of the pre-training and post-training pipelines compared to its Llama 2 predecessor. Note that this is not only true for Llama 3 but other LLMs like Gemma 2, Qwen 2, Apple's Foundation Models, and others, as I described a few months ago in my New LLM Pre-training and Post-training Paradigms article.
7.1 Llama 3 architecture summary
Llama 3 was first released in 8 billion and 70 billion parameter sizes, but the team kept iterating on the model, releasing 3.1, 3.2, and 3.3 versions of Llama. The sizes are summarized below:
Llama 3 (April 2024)
8B parameters
70B parameters
Llama 3.1 (July 2024, discussed in the paper)
8B parameters
70B parameters
405B parameters
Llama 3.2 (September 2024)
1B parameters
3B parameters
11B parameters (vision-enabled)
90B parameters (vision-enabled)
Llama 3.3 (December 2024)
70B parameters
Overall, the Llama 3 architecture closely resembles that of Llama 2. The key differences lie in its larger vocabulary and the introduction of grouped-query attention for the smaller model variant. A summary of the differences is shown in the figure below.

If you're curious about architectural details, a great way to learn is by implementing the model from scratch
...This excerpt is provided for preview purposes. Full article content is available on the original publication.