Advancing AI with faster memory systems
Steven Woo of Rambus explains some of the challenges to closing the AI memory gap. Over the last 15 years the use of artificial intelligence (AI) techniques has boomed. We now carry AI assistants in our pockets, trust algorithms to provide image recognition in autonomous vehicles and use machine learning to predict the spread of disease. The development of these modern AI applications has only come about because of advances in AI algorithms, the availability of larger data sets for training, and the development of faster compute, storage and memory.
The training process is critical for this generation of AI applications, as the neural networks they are based on are reliant on this process. Unlike traditional applications that need to be programmed, neural networks learn by examining training data, extracting information on key features in order to learn how to complete a task. The more training data used, the more accurate the neural network’s decisions. This training process, along with the subsequent inference process, are incredibly data intensive and put a huge demand on compute resources and memory bandwidth.
The AI memory gap
To help meet modern neural network demands, processor architectures have evolved, and are capable of ingesting and processing more data than ever before. We’ve rapidly reached a stage where today’s AI processors, including purpose-built chips and other systems based on graphics processor units (GPUs), are now outpacing the abilities of current memory technology. Looking forward, the industry is being challenged to support the memory demands of next generation AI applications. What we’re seeing today is a growing gap between the processing capabilities of AI specific processors and the memory available to support them. Essentially, AI chips are so incredibly fast that they’re being forced to wait for data from memory.
As a result, the heavy reliance on memory bandwidth for modern neural network applications is sparking interest from the AI industry in the highest performance memory systems that can be built, and there are two important reasons for this. The first is that AI is becoming more complicated. To continue providing more advanced AI capabilities with each new chip, the industry must create silicon that can significantly improve performance and run more complicated algorithms on a large scale. The other is that the volume of digital information being created and processed is growing far faster than the ability of virtually any other technology curve to keep up with it.
AI and the evolving internet
Analysts estimate that there are at least 50 different AI processors being developed, all of which are designed to meet the needs of different computing environments across the evolving internet. These environments can be broadly categorised as the data centre, connected endpoints, and the edge – with each having different performance and power efficiency needs.
Data centres afford the luxuries of being able to plug into a wall socket to get power, and of having cooling systems to make sure temperatures remain under control. These benefits allow chips and memory speeds to be pushed to the limits, enabling the highest levels of performance to be achieved. The compute and data intensive task of training neural networks is often relegated to data centres for these reasons.
At the opposite end of the spectrum are connected endpoints such as mobile phones and tablets, which typically run on batteries, making power efficiency critical. AI applications on mobile devices often only perform inference due to the heavy performance and power requirements for training neural networks.
The roll-out of 5G networks is extending the possibility of computing at the edge, in base stations and on-premise locations outside of traditional cloud data centres. This will bring significant compute power geographically closer to connected endpoints. Edge computing is expected to cover a wide range of needs, with some processing looking similar to what happens on connected endpoints, and other tasks looking more similar to those happening in cloud data centres. Memory bandwidth will always be a high priority, with power efficiency varying in importance depending on how close the processing is to the connected endpoints and the nature of the installation location.
Across each of these environments however, a lack of memory bandwidth remains a bottleneck, acting as a critical limiter to AI performance. That leads to the question, how can we close the AI memory gap?
Memory interfaces
With Rambus’ 30-year history in high-performance, power efficient memory systems, there is significant interest in the company’s HBM2 and GDDR6 memory interfaces for AI applications. While both of these memory technologies provide high commercially available bandwidths, they also introduce a set of complex trade-offs that require detailed evaluation and careful engineering to optimise overall system designs.
HBM memory systems provide system architects with two key advantages compared to GDDR. They provide the highest bandwidths per device, and the best power efficiency for high performance memory systems. While these benefits are highly desired across the industry, the critical challenges to consider are the cost of implementing HBM memory systems and the increased design complexity.
HBM uses die stacking, which is more difficult to implement than traditional non-stacked memories like GDDR6, adding cost. At the system level, an additional silicon interposer is needed that provides electrical connectivity between the AI processor and the HBM stack. This improves the communication path between the SoC and the memory, but adds additional cost, design, and manufacturing complexity. Additional care is also needed to manage thermals and long-term reliability issues. With HBM still being relatively new in the industry, broad industry knowledge around best practices is still being developed. For this reason, consideration has to be given when developing AI systems with HBM as to the experience available in your organisation.
By contrast, GDDR6 memory systems follow the more familiar design practices of traditional DRAMs, and are compatible with high volume packaging, PCB, and test methodologies. The primary challenge in designing GDDR6 memory systems is achieving the high data rates required with good signal quality, meaning SoC and board designers must tackle more challenging signal integrity and cooling constraints in order to achieve desired data rates and acceptable thermals.
Looking towards the future
Despite the rapid progress to date, experts agree that AI is still in its infancy. By all accounts, future algorithms and their AI processors will continue to need ever growing memory bandwidth, in order to accelerate training with the increasing volumes of data required to fully deliver on the promise of AI. At present, the AI memory gap has reached a point where, without significant improvements to memory speed, AI processors will be severely hampered because they will frequently be waiting for data. This critical issue is in turn driving new developments and advances in memory systems.
Rambus are working on a range of memory technologies to close the gap between AI processors and their memory bandwidth needs, and to help designers struggling with trade-offs between power and performance. The growing demand for faster memory is shrinking the cycle time between research and commercial implementation. In the interim, design engineers should gain familiarity with the detail of the latest memory systems so that they are comfortable in assessing the trade-offs necessary to minimise the memory gap within the constraints of their system’s primary requirements, and the engineering capabilities of their organisation.