Design
Opening Up To Hardware Independence
OpenCL offers the capability to accelerate compute intensive algorithms, completely independent to hardware. This ES Design magazine article explores further, by Wolfgang Eisenbarth, Vice President of Embedded Computer Technology, MSC Vertriebs GmbH, and Philipp Zieboll, Field Applications Engineer, Embedded Computer Technology, MSC Vertriebs GmbH.
The OpenCL is an open and royalty-free programming standard for general-purpose computing on heterogeneous systems. The OpenCL standard was developed by software specialists from leading industrial concerns, who then submitted a draft to the Khronos Group for standardisation.
The Khronos Group, founded in January 2000, is a non-profit, member-funded consortium focused on the creation of royalty-free open standards for parallel computing, graphics and dynamic media for a wide variety of platforms and devices. AMD, Intel, NVIDIA, SGI, Google and Oracle are just a few of the over 100 members. Today, OpenCL is maintained and further developed by Khronos. The OpenCL specification is now available in versions 1.1 and 1.2 (www.khronos.org/opencl/).
Standardisation
The goal of OpenCL is to provide a standardised programming interface for efficient and portable programs (Figure 1). Users can thus get what they have long been asking for; a vendor-independent, non-proprietary solution for accelerating their applications on the basis of their selected multi-core CPU, APU and GPU cores.
Figure 1: OpenCL is an open, royalty-free standard for programming of heterogeneous systems Source: Khronos Group
The OpenCL specification consists of the language specification as well as Application Programming Interfaces (APIs) for the platform layer and the runtime. The language specification describes the syntax and the programming interface for writing compute kernels, which can be executed on multi-core CPUs or GPUs. A compute kernel is the basic unit of executable code. The language used is based on a subset of ISO C99, which is a popular programming language among developers.
OpenCL’s platform model consists of a host, which establishes the connection to one or more OpenCL devices. Host and device are logically separated from each other and this preserves portability. The access to routines is obtained via the platform layer API, which queries the number and the types of devices existing in the system. The developer can select and initialise the desired compute devices in order to execute the tasks. Compute contexts as well as queues for job submission and data transfer requests are created in this layer. The runtime API offers the possibility to queue up compute kernels for execution. It is also responsible for managing the computing and memory resources in the OpenCL system.
Compute Kernels
The execution model describes the types of the compute kernels. Since OpenCL is designed for multi-core CPUs and GPUs, compute kernels can be created either as data-parallel, which fits well to the architecture of GPUs, or task-parallel, which matches better to the architecture of CPUs. When a kernel is submitted for execution on an OpenCL device by the host program, an index space is defined. An instance of the kernel executes for each point in this index space. Each element in the execution domain is a work-item, whereby OpenCL allows to group together work-items to form work-groups for synchronisation and communication purposes.
OpenCL defines a multi-level memory model consisting of four memory spaces: Private Memory (visible only to individual compute units of the device); Local Memory; Constant Memory, and; Global Memory, which can be used by all compute units in the device.
Depending on the actual memory subsystem, different memory spaces can be merged together. Figure 2 shows the memory hierarchy defined by OpenCL. The host processor is responsible for allocating and initialising the memory objects that reside in this memory space. The memory model is also based on the separation of host and device.
Figure 2: Overview of the memory hierarchy defined by OpenCL Source: AMD
Thanks to the hardware-independence and easy portability of OpenCL, companies can reuse their significant investment in source code, hence greatly reducing the development time for today’s complex image processing systems.
COM Support
Further optimisation of the design cycle is possible by making use of standard PC building blocks such as a high-performance processor module. Such a Computer-On-Module (COM) can be easily mounted onto a baseboard via a standardised connector, whereby the baseboard implements the application-specific functions. Computer-On-Modules are available in a range of different versions offering scalable processor power and a choice of interfaces. This module based technology thus provides a simple upgrade path for higher performance. Because the modules offered all meet defined standard specifications regarding form factor and connectivity, they are easily interchangeable with products from different vendors.
The MSC C6C-A7 module family supports OpenCL and is implemented using the well established COM Express form factor. With the new Type 6 pin-out, there are two significant improvements compared with the predecessor Type 2 pin-out: Type 6 pin-out can support up to three independent Digital Display Interfaces (DDIs) and also adds support for USB 3.0. This embedded platform in compact form factor (95x95mm) is based on AMD’s Embedded R-Series Accelerated Processing Units (APUs) and features very powerful graphics and excellent parallel computing performance with low power dissipation.
The MSC module also integrates the AMD R-460L 2.0GHz (2.8GHz Turbo) or AMD R-452L 1.6GHz (2.4GHz Turbo) quad-core processors. The thermal design power (TDP) levels are 25W and 19W, respectively. The two dual-core module versions can be populated with the AMD R-260H 2.1GHz (2.6GHz Turbo) processor or the AMD R-252F 1.7GHz (2.3GHz Turbo) processor — each featuring 17W TDP. All processors support the AMD64 technology and the AMD-V virtualisation technology. The AMD Fusion Controller Hub (FCH) A75 chipset is used in combination with all CPU versions. The main memory can be expanded to 16Gbyte DDR3-1600 dual-channel SDRAM via two SO DIMM sockets.
##IMAGE_1_L##
Figure 3: The MSC C6C-A7 module family is based on AMD’s Embedded R-Series Accelerated Processing Units (APUs) and supports OpenCL
The Radeon HD7000G-Series graphics engine integrated into the AMD R-Series APU, with its excellent graphics capabilities, offers support for OpenCL 1.1, OpenGL 4.2 and DirectX 11. The modules support up to four independent displays for imaging applications. HDMI, MPEG-2 decoding, H.264 and VCE (video compression engine) support is also included.
The MSC C6C-A7 COM Express module family offers six PCI Express x 1 channels and a PCI Express graphics (PEG) x 8 interface. In addition, all modules feature four USB 3.0 and four USB 2.0 ports, LPC, Gbit Ethernet, HD audio and four SATA interfaces at up to 300Mbyte/s. Featuring DisplayPort 1.2 and HDMI interfaces (3x digital display interface) supporting resolutions up to 4096 x 2160 (DP) and 1920 x 1200 (HDMI), along with LCD and VGA interfaces, the MSC C6C-A7 modules offer comprehensive display support.
The platform can run Microsoft Windows Embedded Standard 7 operating system, as well as Linux. The AMI based BIOS includes UEFI support. In addition to the Computer-On-Modules, MSC offers Starter Kits and suitable carrier boards, as well as cooling solutions and memory modules.
Thanks to the powerful computing and graphics capabilities, the platform is especially suited for demanding applications where 3D graphics, high-definition videos or the control of large displays are required. Typically such applications can be found in the fields of medical technology, infotainment, digital signage and gaming.