Predictable performance key to successful OpenStack telco deployments
After a few years of waiting, the telecom industry seems to have finally accepted that OpenStack is viable for deployments in networks that require high levels of service availability, performance and security. The key is selecting the right OpenStack platform: you need one that’s been properly hardened, packaged and stress-tested to ensure it meets the stringent requirements of service providers.
Guest blog by Charlie Ashton.
Telco clouds based on Wind River’s Titanium Cloud architecture have been proven to address the critical challenges presented by deployments in a wide range of core, edge and access use cases.
If you’d like to learn more about how the Titanium Core and Titanium Edge virtualisation platforms solve some very key problems for virtual CPE, the white paper ‘Overcoming OpenStack Obstacles in vCPE’ provides an in-depth explanation.
For more on the vCPE topic, you might also be interested to watch the recording of the webcast with Peter Willis from BT, in which Peter discusses specific OpenStack problems identified by BT and review how Wind River’s products have solved them.
Now let’s take a look at another important OpenStack topic: how to guarantee predictable performance for applications like telecom networks with performance requirements much more challenging than found in the kind of enterprise applications for which OpenStack was originally designed.
At OpenStack Summit in Boston this month, two of Wind River’s leading OpenStack experts will be presenting a paper titled ‘Behind the Curtains: Predictable Performance with Nova.' Ian Jolliffe and Chris Friesen will provide detailed insights not only into the techniques that enable predictable performance to be achieved but also into how to troubleshoot problems when something goes wrong. If you’re going to be at the OpenStack Summit you’ll want to check out this session and if not you can always watch the recording later.
There are five fundamental problem areas that must be addressed in order to develop an OpenStack implementation that can deliver performance that’s truly predictable: CPU, memory, PCI bus, networking and storage. (Actually at this Summit they’ll only cover the first four and you’ll have to join us at a subsequent event to learn about storage.)
The primary CPU-related challenges are all about contention. Multiple guests contending for the same resources can result in unpredictable performance for all, so various techniques must be employed to prevent this situation. These include reducing or disabling CPU overcommit, setting the correct CPU thread policy and using dedicated CPUs for each guest.
CPU cache contention can be avoided by the use of an appropriate CPU thread policy and also by leveraging Intel’s Cache Allocation Technology. CPU contention between host and guest(s) should be eliminated by avoiding System Management Interrupts (SMIs) as well as by controlling CPU ‘stealing’ by host processes and kernel threads.
There are three memory-related areas that need to be addressed to avoid performance problems. Memory contention can be avoided by reducing overcommit, by turning it off altogether or by enabling hugepages. Memory bandwidth contention can be more complicated: it’s important to ensure that each guest NUMA node is mapped to a unique host NUMA node, while also distributing Virtual Machines (VMs) across NUMA nodes to spread memory accesses. Finally, host TLB contention (in which the TLB caches a virtual page to a physical page mapping) can be minimised through the use of hugepages as well as dedicated CPUs.
PCI bus contention is eliminated by ensuring that each PCI bus is connected to a single host NUMA node, so that instances using PCI devices will be affined to the same NUMA node as the device itself.
Networking must also be considered. It’s important to avoid cross-NUMA traffic, so VMs should be located on the same NUMA as the physical switch they’re connected to, while virtual switch (vSwitch) processes should be configured with NUMA awareness of their physical NIC PCI termination and NUMA instance placement.
PCI PassThrough and SR-IOV will also be impacted by crossing NUMA nodes. To avoid network bandwidth contention, all instances connecting to the same host network should use the same host NIC, which should be a 10G NIC or better.
There are several approaches that should be considered in order to address limitations in host-guest network bandwidth. Emulated NICs are slow and para-virtualised NICs are faster, as long as the guest supports them. PCI PassThrough and SR-IOV are faster still because they ensure that the PCI device is ‘passed through’ into the guest, though they require the presence of a suitable device driver in the guest, still have the overhead of virtual interrupts and can be challenging to configure initially.
The fastest guest will be one that’s based on DPDK and leverages a Poll Mode Driver (PMD), though this does consume more power because the guest is running a tight loop.
Lastly, Enhanced Platform Awareness (EPA) is critical in ensuring predictable performance for the applications themselves. EPA enables optimised VM placement through techniques that include NUMA, memory requirements, NIC support, acceleration engines and hyperthreading awareness.
By implementing all these techniques and more, the Titanium Cloud architecture has been proven to deliver the level of predictable performance required for telco networks, which is one of the reasons why it’s being widely adopted as service providers transition to network virtualisation.
As a leading contributor to the OpenStack community and with a major focus on solving telco-related problems in Nova, Wind River is pleased to have the opportunity to share more details of these performance techniques with the rest of the community.
Join Ian and Chris for their upcoming OpenStack Summit session and if not there is always the recording to watch afterwards.
Courtesy of Wind River.