Understanding Multimodal LLMs
It was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to AI and several interesting research papers published.
Among others, Meta AI released their latest Llama 3.2 models, which include open-weight versions for the 1B and 3B large language models and two multimodal models.
In this article, I aim to explain how multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks (including Llama 3.2) to compare their approaches.
(To see a table of contents menu, click on the stack of lines on the left-hand side.)

But before we begin, I also have some exciting news to share on the personal front! My book, "Build A Large Language Model (From Scratch)", is now finally available on Amazon!

Writing this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers!
For those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months.
P.S. If you have read the book, I'd really appreciate it if you could leave a brief review; it truly helps us authors!
1. Use cases of multimodal LLMs
What are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each "modality" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs.
A classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below.

Of course, there are many
...This excerpt is provided for preview purposes. Full article content is available on the original publication.