Meta has taken a bold step in the world of artificial intelligence with the announcement of its upgraded large language model, Llama 3.2. Unveiled during the recent Meta Connect event, this new version goes beyond just text processing—it introduces visual capabilities that allow it to “see” as well. Remarkably, some iterations of this model can fit onto smartphones without compromising performance, opening up new possibilities for private, local AI interactions and custom applications that do not require data to be sent to third-party servers.
The Llama 3.2 model comes in four distinct versions, each designed for specific tasks. The heavyweight models—11 billion (11B) and 90 billion (90B) parameters—demonstrate impressive capabilities in both text and image processing. They can handle intricate tasks such as chart analysis, image captioning, and even object recognition based on natural language descriptions. This versatility makes Llama 3.2 a formidable contender in the competitive landscape of AI.
Coinciding with the release of Meta’s Llama 3.2, the Allen Institute introduced its own multimodal vision model, Molmo. Initial tests suggest that Molmo competes favourably with leading models like GPT-4o, Claude 3.5 Sonnet, and Reka Core, setting a high bar for performance in the open-source AI sector.
Meta also presented two smaller models—the 1B and 3B parameter versions—targeting efficiency and speed for tasks that require less computational power. These compact models are adept at multilingual text processing and exhibit a strong capability for “tool-calling,” meaning they can integrate seamlessly with various programming tools. Despite their smaller size, these models feature a 128K token context window, comparable to that of GPT-4o and other high-end models, making them excellent for summarisation, instruction following, and rewriting tasks.
The engineering team at Meta demonstrated remarkable ingenuity in developing Llama 3.2. They employed structured pruning to eliminate redundant data from larger models, followed by knowledge distillation to transfer valuable insights from these larger models to their smaller counterparts. The outcome is a series of compact models that outperform competitors like Google’s Gemma 2 2.6B and Microsoft’s Phi-2 2.7B across various benchmarks.
In a bid to enhance on-device AI capabilities, Meta has partnered with hardware giants such as Qualcomm, MediaTek, and Arm, ensuring that Llama 3.2 is compatible with mobile chips from the outset. This collaboration extends to major cloud service providers, including AWS, Google Cloud, and Microsoft Azure, all of which offer immediate access to these new models.
The architecture of Llama 3.2’s vision capabilities results from clever design adjustments. By integrating adapter weights into the existing language model, Meta has successfully bridged pre-trained image encoders with the text-processing core. This means the model’s visual abilities do not detract from its text processing performance, allowing users to expect similar, if not superior, text output compared to its predecessor, Llama 3.1.
Eager to assess Llama 3.2’s capabilities, we conducted a series of tests to explore its performance across various tasks. In text-based interactions, the model generally matched the performance of its predecessors, but the coding abilities showcased a mixed bag of results. During testing on Groq’s platform, Llama 3.2 managed to generate code for popular games and basic programs effectively. However, the smaller 70B model encountered difficulties when tasked with creating functional code for a custom game we designed. The larger 90B model, on the other hand, excelled in this area, successfully generating a functional game on the first attempt.
One of Llama 3.2’s standout features is its ability to identify subjective elements within images. When presented with a cyberpunk-style image and asked about its alignment with the steampunk aesthetic, the model accurately assessed the style, pointing out that the image lacked key elements associated with steampunk. This demonstrates the model’s ability to interpret complex visual themes and provide insightful feedback.
Llama 3.2 also shows promise in chart analysis, although it requires high-resolution images to perform optimally. In our tests, when we provided a screenshot containing a chart—one that other models like Molmo and Reka handled with ease—Llama 3.2 struggled due to the lower image quality. The model apologised for its inability to read the text correctly, highlighting an area for improvement. However, when we presented a larger image containing text, such as a presentation slide, Llama 3.2 excelled, correctly identifying the context and distinguishing between names and job roles without errors.
The overall verdict on Llama 3.2 is that it represents a significant leap forward from its predecessor and contributes positively to the open-source AI landscape. Its strengths lie in image interpretation and handling large text, although there remain areas for potential enhancement, particularly regarding lower-quality image processing and complex coding tasks.
Looking ahead, the promise of on-device compatibility is a strong indicator of a shift towards more private and local AI applications, providing a viable alternative to proprietary offerings like Gemini Nano and Apple’s closed models. As Meta continues to innovate, Llama 3.2 positions itself as a formidable player in the open-source AI sector, showcasing the potential for enhanced user experiences through its advanced capabilities and accessibility. The future of AI appears to be bright, particularly with advancements like Llama 3.2 leading the charge in creating more intelligent, versatile, and user-friendly technologies.