Artificial Intelligence

Meta’s Llama 3.1: what’s new?

26th July 2024
Harry Fowle
0

Meta has recently introduced its newest range of AI models, Llama 3.1, so what’s new in this range and what might you expect to see?

Within AI, foundation models have become integral building blocks for a wide range of applications ranging from natural language processing (NLP) to complex problem solving tasks. Meta’s latest release, the Llama 3 series (or herd as they call it), is a significant upgrade within this landscape, introducing a new range of advanced capabilities in language understanding, multilingual support, and integrated multimodal processing.

The Llama 3.1 herd is designed to push the boundaries of AI performance, offering new and improved solutions that rival some of the best models out there at the moment, such as OpenAI’s GPT-4.

These models are built upon a dense Transformer architecture, a choice that Meta’s believes underscores its commitment to stability and scalability within AI development. The flagship model, which has 405 billion parameters and can process up to 128,000 tokens in a single context window, is the companies proof of these ideas, providing strong foundations for handling a wide array of AI-based tasks.

A big part of the Llama project is to keep it open-sourced, giving back and contributing directly to the AI research community. These new models are offered under an open license to encourage further innovation and collaboration within the Llama herd.

Meta strongly believes that the release of Llama 3.1 is not just a personal achievement for the company, but one that is a step towards broader AI accessibility and usability, with the end goal being to one day see AI integrated into everyday applications and research.

Development and design

As mentioned above, at the core of Llama 3.1’s design is a blend of scale, innovation, and complex design. To achieve this, Meta followed a set of key philosophies in the design process aimed at optimising the models for performance, reliability, and versatility:

  • Data enrichment and quality: The extensive dataset of 15 trillion multilingual tokens reflects a significant upgrade from previous versions. This dataset was not only larger but also more diverse, incorporating a wide range of languages and content types. This diversity was achieved through advanced data curation techniques, including de-duplication, filtering, and quality scoring, ensuring that the models were trained on the most relevant and high-quality data.
  • Model scaling and architecture choices: The decision to scale up to 405 billion parameters was driven by the need to enhance the model's ability to understand and generate language with higher fidelity. This scale was complemented by the use of a dense Transformer architecture, chosen for its proven stability and efficiency in handling large-scale language models. The focus on a dense Transformer model, rather than more experimental architectures like mixture-of-experts, highlights Meta's priority on training stability and scalability.
  • Advanced training techniques: The Llama 3.1 models were trained using an enormous compute budget, leveraging 3.8 x 10²⁵ FLOPs. This significant investment in computational resources was necessary to fully exploit the potential of the large dataset and to refine the model's capabilities. The training process included innovative methods such as annealing and fine-tuning, which were critical in optimising the models' performance on both general and specific tasks.
  • Focus on multimodality: In addition to textual data, the Llama 3.1 series has begun to explore multimodal capabilities, integrating elements of image, video, and speech processing. This multimodal approach is designed to extend the utility of the models beyond traditional language tasks, positioning them as versatile tools for a broader range of AI applications.
  • Simplicity and scalability in design: The choice of maintaining a relatively simple model architecture, with specific adaptations for scalability, reflects a strategic approach to managing the complexities inherent in training and deploying large-scale models. This simplicity ensures that the models can be reliably scaled and adapted, making them suitable for a wide range of applications and further research.

Specs and performance

Llama 3.1 models offer a range of specifications designed to meet diverse needs across different applications. The series includes models with 8 billion, 70 billion, and 405 billion parameters, catering to varying computational capacities and performance requirements. This range allows for flexibility in deployment, from smaller, more accessible models to the highly powerful 405B version, which stands as one of the largest and most sophisticated models currently available.

The flagship 405B model features a dense Transformer architecture with 126 layers, a model dimension of 16,384, and 128 attention heads. This configuration is designed to optimise the balance between computational efficiency and performance, enabling the model to handle complex tasks with high accuracy. The use of 128,000 tokens in the context window allows for processing extensive sequences of text, which is particularly valuable for tasks requiring deep contextual understanding.

In empirical evaluations, the Llama 3.1 models have demonstrated performance on par with leading models like GPT-4 across various benchmarks. This includes strong results in multilingual tasks, coding, reasoning, and tool usage. Notably, the models excel in areas that require a balance between helpfulness and harmlessness, a critical aspect for ensuring safe and ethical AI interactions.

The Llama 3.1 models have been fine-tuned to enhance specific capabilities, such as coding and reasoning. This fine-tuning process involved extensive human feedback and preference optimisation, ensuring that the models align closely with human expectations and preferences. Additionally, the models are equipped to handle long-context tasks, making them suitable for applications in document analysis, legal text processing, and other domains requiring the processing of large text bodies.

While primarily focused on language, the Llama 3.1 series also explores the integration of multimodal capabilities, such as image and video recognition. This integration is achieved through a compositional approach, where separate encoders for different modalities are combined with the language model. Although these multimodal features are still under development, initial results suggest competitive performance in tasks involving multiple types of data.

Multimodal capabilities

Llama 3.1 also integrates multimodal functionalities, extending its capabilities beyond the traditional text-based applications it is associated with. These developments align with industry trends which are leaning towards models that can understand and generate content across multiple data types, such as text, images, video, and speech.

Llama 3.1 now incorporates image, video, and speech processing through the use of separate encoders for each modality. These encoders are trained specifically for different types of data and work alongside the core language model. This integration is achieved using a compositional approach, where adapters align the outputs of modality-specific encoders with the language model. For example, a vision adapter helps integrate image data, while a speech adapter converts speech signals into a format the language model can process.

Initial experiments suggest that Llama 3.1 performs competitively in tasks involving multimodal data, although these features are still in development. The potential applications are broad, including multimedia content creation, enhanced digital assistants, and educational tools. Furthermore, Meta has highlighted that these capabilities open new avenues in fields like healthcare, where AI could assist in interpreting medical imaging, or in autonomous systems requiring real-time data interpretation.

Post-training enhancements

The Llama 3.1 models have undergone significant post-training enhancements to refine their capabilities and ensure alignment with human preferences. This process included several key stages aimed at improving the model's performance and usability.

Supervised Fine-Tuning (SFT) was a crucial component, where the models were further trained using a curated dataset, including both human-annotated and synthetic data. This fine-tuning helped enhance the models' ability to follow instructions and respond accurately to a wide range of prompts.

Another important method used was Direct Preference Optimisation (DPO), which aligns the models more closely with human feedback. This technique involved adjusting the models based on a reward system that prioritised responses favoured by human evaluators, thereby improving the quality and relevance of the outputs.

Data quality was a central focus during the post-training phase. Rigorous data cleaning processes were employed to remove low-quality or problematic content, ensuring that the models were trained on high-quality inputs. This included filtering out overused phrases and ensuring a balanced representation of different response types.

Overall, these post-training enhancements significantly improved the Llama 3.1 models' performance, making them more reliable and effective for a variety of tasks. The careful refinement of data and alignment techniques helped ensure that the models deliver outputs that are both accurate and aligned with user expectations, thereby enhancing their practical applicability in real-world scenarios.

Summary

Meta's Llama 3.1 series showcases advanced capabilities in language understanding, multilingual support, and multimodal processing. Built on a dense Transformer architecture, these models range from 8 billion to 405 billion parameters, catering to diverse computational needs.

The models excel in various tasks, including coding and reasoning, thanks to their extensive training on a diverse dataset and the application of advanced techniques like annealing and fine-tuning. The inclusion of multimodal features, such as image and speech processing, expands their utility beyond traditional language applications, opening new possibilities in fields like multimedia and healthcare.

Post-training enhancements, including Supervised Fine-Tuning and Direct Preference Optimisation, have refined the models, ensuring alignment with human preferences and high-quality outputs. By open-sourcing the Llama 3.1 models, Meta fosters innovation and collaboration in the AI community, advancing the integration of AI into everyday applications.

Featured products

Product Spotlight

Upcoming Events

View all events
Newsletter
Latest global electronics news
© Copyright 2024 Electronic Specifier