Coherent Accelerator Processor Interface

Coherent Accelerator Processor Interface
Year created	2014
Created by	IBM;
Supersedes	PCI Express;
Website	opencapi.org

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard, initially designed to be layered on top of PCI Express, for directly connecting CPUs to external accelerators like GPUs, ASICs, FPGAs or fast storage.[1][2] It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

More details and documentation on CAPI can be found on the OpenCAPI Consortium website and IBM Portal for OpenPOWER.

History

The performance scaling traditionally associated with Moore's Law—dating back to 1965—began to taper off around 2004, as both Intel's Prescott architecture and IBM's Cell processor pushed toward a 4 GHz operating frequency. Here both projects ran into a thermal scaling wall, whereby heat extraction problems associated with further increases in operating frequency largely outweighed gains from shorter cycle times.

Over the decade that followed, few commercial CPU products exceeded 4 GHz, with the majority of performance improvements now coming from incrementally improved microarchitectures, better systems integration, and higher compute density—this largely in the form of packing a larger numbers of independent cores onto the same die, often at the expense of peak operating frequency (Intel's 24-core Xeon E7-8890 from June 2016 has a base operating frequency of just 2.2 GHz, so as to operate within the constraints of a single-socket 165 W power consumption and cooling budget).

Where large performance gains have been realized, it was often associated with increasingly specialized compute units, such as GPU units added to the processor die, or external GPU- or FPGA-based accelerators. In many applications, accelerators struggle with limitations of the interconnect's performance (bandwidth and latency) or with limitations due to the interconnect's architecture (such as lacking memory coherence). Especially in the datacenter, improving the interconnect became paramount in moving toward a heterogeneous architecture in which hardware becomes increasingly tailored to specific compute workloads.

CAPI was developed to enable computers to more easily and efficiently attach specialized accelerators. It was designed by IBM for use in its POWER8 based systems which came to market in 2014. At the same time, IBM and several other companies founded the OpenPOWER Foundation to build an ecosystem around POWER based technologies, including CAPI. In October 2016 several OpenPOWER partners formed the OpenCAPI Consortium together with GPU and CPU designer AMD and systems designers Dell EMC and Hewlett Packard Enterprise to spread the technology beyond the scope of OpenPOWER and IBM.[3]

Implementation

CAPI

CAPI is implemented as a functional unit inside the CPU, called the Coherent Accelerator Processor Proxy (CAPP) with a corresponding unit on the accelerator called the Power Service Layer (PSL). The CAPP and PSL units acts like a cache directory so the attached device and the CPU can share the same coherent memory space, and the accelerator becomes an Accelerator Function Unit (AFU), a peer to other functional units integrated in the CPU.[4][5]

Since the CPU and AFU share the same memory space, low latency and high speeds can be achieved since the CPU doesn't have to do memory translations and memory shuffling between the CPU's main memory and the accelerator's memory spaces. An application can make use of the accelerator without specific device drivers as everything is enabled by a general CAPI kernel extension in the host operating system. The CPU and PSL can read and write directly to each other's memories and registers, as demanded by the application.

CAPI

CAPI is layered on top of PCIe Gen 3, using 16 PCIe lanes, and is an additional functionality for the PCIe slots on CAPI enabled systems. Usually there are designated CAPI enabled PCIe slots on such machines. Since there is only one CAPP per POWER8 processor the number of possible CAPI units are determined by the number of POWER8 processors, regardless of how many PCIe slots there are. In certain POWER8 systems, IBM makes use of dual chip modules, thus doubling the CAPI capacity per processor socket.

Traditional transactions between a PCIe device and a CPU can take around 20,000 operations, whereas a CAPI attached device will only use around 500, significantly reducing latency, and effectively increasing bandwidth due to decreased operations overhead.[5]

The total bandwidth of a CAPI port is determined by the underlying PCIe 3.0 x16 technology, peaking at ca 16 GB/s, bidirectional.[6]

CAPI 2

CAPI-2 is an incremental evolution of the technology introduced with IBM POWER9 processor[6]. It runs on top of PCIe Gen 4 that effectively doubles the performance to 32 GB/s. It also introduces some new features like support for DMA and Atomics from the accelerator.

OpenCAPI

The technology behind OpenCAPI is governed by the OpenCAPI Consortium, founded in October 2016 by AMD, Google, IBM, Mellanox and Micron together with partners Nvidia, Hewlett Packard Enterprise, Dell EMC and Xilinx.[7]

OpenCAPI 3

OpenCAPI, formerly New CAPI or CAPI 3.0, is not layered on top of PCIe and will therefore not use PCIe slots. In IBM's CPU POWER9 it will use the Bluelink 25G I/O facility that it shares with NVLink 2.0, peaking at 50 GB/s.[8] OpenCAPI doesn't need the PSL unit (required for CAPI 1 and 2) in the accelerator, as it's not layered on top of PCIe but uses its own transaction protocol.[9]

OpenCAPI 4

Planned for future chip after the General Availability of POWER9.[10]

Serial Attached Memory

Microchip Technology SMC 1000 OpenCapi memory is described as "the next progression in the market adopting serial attached memory."[11]

gollark: Nvidia added real-time raytracing stuff to distract from the lack of significant generational price/performance improvement (because of poor competition in the market), and machine learning stuff for some reason, and then hyped them so much that they threw out stuff like "sanity" and "consistent branding".

gollark: AMD's got a 5700 XT "Anniversary Edition", though...

gollark: Nvidia's done it stupidly *too*! It goes GTX 9x0 (Ti) → GTX 10x0 (Ti) → RTX 20x0 (Ti) + GTX 16x0 (Ti) (in the same generation) → now some of them have "Super" variants for some reason and some are being replaced by the super versions but some aren't.

gollark: The prices seem to be weirdly shifted in Australia.

gollark: If you want something slightly better than your current GPU, there's the RX 580, which is quite cheap, and yes the naming schemes are stupid.

References

Agam Shah (17 December 2014). "IBM's new Power8 doubles performance of Watson chip". PC World. Retrieved 17 December 2014.
"IBM Power8 Processor Detailed - Features 22nm Design With 12 Cores, 96 MB eDRAM L3 Cache and 4 GHz Clock Speed". WCCFtech. Retrieved 17 December 2014.
OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era
Coherent Accelerator Processor Interface (CAPI) for POWER8 Systems – White Paper
Reconfigurable Accelerators for Big Data and Cloud – RAW 2016
Opening Up The Server Bus For Coherent Acceleration
Tech Leaders Unite to Enable New Cloud Datacenter Server Designs for Big Data, Machine Learning, Analytics, and other Emerging Workloads
Big Blue Aims For The Sky With Power9
OpenCAPI Takes on PCIe, Vows 10X Improvement
Stuecheli, Jeff (26 January 2017). "Webinar POWER9" (Video recording / slides). AIX Virtual User Group. - Slides _(PDF) - AIX VUG page has links to slides and video
Patrick Kennedy (August 5, 2019), Microchip SMC 1000 For The Serial Attached Memory Future, Servethehome

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[pcworld-1] Agam Shah (17 December 2014). "IBM's new Power8 doubles performance of Watson chip". PC World. Retrieved 17 December 2014.

[wccftech-2] "IBM Power8 Processor Detailed - Features 22nm Design With 12 Cores, 96 MB eDRAM L3 Cache and 4 GHz Clock Speed". WCCFtech. Retrieved 17 December 2014.

[3] OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era

[4] Coherent Accelerator Processor Interface (CAPI) for POWER8 Systems – White Paper

[RAWkeynote-5] Reconfigurable Accelerators for Big Data and Cloud – RAW 2016

[nextplatform-capi-6] Opening Up The Server Bus For Coherent Acceleration

[7] Tech Leaders Unite to Enable New Cloud Datacenter Server Designs for Big Data, Machine Learning, Analytics, and other Emerging Workloads

[8] Big Blue Aims For The Sky With Power9

[9] OpenCAPI Takes on PCIe, Vows 10X Improvement

[power9_webinar-10] Stuecheli, Jeff (26 January 2017). "Webinar POWER9" (Video recording / slides). AIX Virtual User Group. - Slides _(PDF) - AIX VUG page has links to slides and video

[11] Patrick Kennedy (August 5, 2019), Microchip SMC 1000 For The Serial Attached Memory Future, Servethehome

Technical and de facto standards for wired computer buses
General	System bus Front-side bus Back-side bus Daisy chain Control bus Address bus Bus contention Bus mastering Network on a chip Plug and play List of bus bandwidths
Standards	SS-50 bus S-100 bus Multibus Unibus VAXBI MBus STD Bus SMBus Q-Bus Europe Card Bus ISA STEbus Zorro II Zorro III CAMAC FASTBUS LPC HP Precision Bus EISA VME VXI VXS NuBus TURBOchannel MCA SBus VLB PCI PXI HP GSC bus InfiniBand UPA PCI Extended (PCI-X) AGP PCI Express (PCIe) Compute Express Link (CXL) Direct Media Interface (DMI) RapidIO Intel QuickPath Interconnect NVLink HyperTransport Infinity Fabric Intel Ultra Path Interconnect
Storage	ST-506 ESDI IPI SMD Parallel ATA (PATA) SSA DSSI HIPPI Serial ATA (SATA) SCSI Parallel SAS Fibre Channel SATAe PCI Express (via AHCI or NVMe logical device interface)
Peripheral	Apple Desktop Bus Atari SIO DCB Commodore bus HP-IL HIL MIDI RS-232 RS-422 RS-423 RS-485 Lightning DMX512-A IEEE-488 (GPIB) IEEE-1284 (parallel port) UNI/O 1-Wire I²C (ACCESS.bus, PMBus, SMBus) I3C SPI D²B Parallel SCSI Profibus IEEE 1394 (FireWire) USB Camera Link External PCIe Thunderbolt
Audio	ADAT Lightpipe AES3 Intel HD Audio I²S MADI McASP S/PDIF TOSLINK
Portable	PC Card ExpressCard
Embedded	Multidrop bus CoreConnect AMBA (AXI) Wishbone SLIMbus
Interfaces are listed by their speed in the (roughly) ascending order, so the interface at the end of each section should be the fastest. Category

Coherent Accelerator Processor Interface

History

Implementation

CAPI

CAPI

CAPI 2

OpenCAPI

OpenCAPI 3

OpenCAPI 4

Serial Attached Memory

See also

References

External links