Stephen Smith's Blog

Musings on Machine Learning…

Challenges for Many Core Processors

leave a comment »


Introduction

CPU designers and manufacturers such as Intel, AMD and ARM are relying on adding more and more CPU cores to each chip they manufacture. Each CPU core is in itself a complete CPU that can execute programs independently. AMD has 64-core threadripper CPUs, there are now 128-core ARM CPUs and Intel can go as high as 18 cores. In this article we’ll look at some of the challenges of getting the full benefit from all these processors.

Memory Bandwidth

Good DDR4 memory runs at 3.6GHz these days. Let’s consider our processor cores running at 2GHz. An ARM CPU Core on average executes one instruction per clock cycle, and hence 2GHz means it can execute 2 billion instructions per second. On an ARM processor, each instruction is 32-bits in width, so if you have a 64-bit memory bus, then 3.6GHz memory can deliver 7.2 billion instructions per second. Note that some lower end systems will only have a 32-bit memory bus and hence have half this performance. This is plenty for one core, but can only keep 3 cores busy and this only counts Assembly language instructions and not data.

All these processors are 64-bit, with 64-bit registers. Whenever they load or store a memory address, that requires loading or storing a 64-bit quantity through the memory bus. The CPU can perform arithmetic on various size quantities whether 8, 16, 32 or 64 bits and all these must be moved to and from memory. Further all these processors have floating point coprocessors and some sort of SIMD processor, whether Intel’s AVX or ARM’s NEON. These can operate in parallel with the integer processing unit, if you can keep everything busy. All of this makes loading and saving data to and from memory a huge bottleneck.

If you have 128 cores, each running at 2GHz, you need your memory running at 128GHz to keep all the cores busy, just running code. Obviously this is impossible so what do system designers do to make more than 2 or 3 cores useful?

  1. Each CPU core has a local cache of a few megabytes of data. This data can be accessed immediately and once loaded doesn’t require much interaction with the memory controller. If the cores can keep their working set within the cache then they are very efficient.
  2. The manufacturer of the System on a Chip or the PC motherboard designer can include more than one memory bus, many server systems have 5 or 6 independent channels to main memory.
  3. Programming discipline. When writing C or Assembly Language code, it is tempting to use all 64-bit quantities, after all these are 64-bit processors and can perform 64-bit arithmetic in a single instruction cycle. The problem with this is the use of memory bandwidth. If you keep your integers to smaller sizes then this reduces the contention on the memory bus. This is why both Intel and ARM CPUs keep supporting instructions for smaller Arithmetic in their instruction sets. Good optimizing compilers are excellent at keeping data in registers and minimizing saving intermediate results to memory, so make sure you have all these options turned on, except while debugging.

Cache Consistency Protocol

Having a large CPU cache is touted as the solution to memory bandwidth problems. However, these introduce their own bottlenecks. When you write a value to memory and that value is in the cache, the individual CPU core updates its local cached value, but the cache isn’t necessarily written through to main memory right away. That is fine for an individual core, but the rule is that all cores have to see the same view of memory, you can’t have different CPU cores seeing different values at a given memory address.

There are quite a few different memory cache architectures as well as different protocols for maintaining cache consistency across all cores. A typical way of maintaining cache consistency is as follows:

  1. One core writes a new value to a memory address contained in its cache.
  2. The cache controller now checks to see if any other CPU has that memory address in its cache. If it isn’t in another cache, then the write is complete and the core continues processing.
  3. If the value is in another core’s cache then the CPU must first write the value to main memory and then send a notification to the other affected CPUs to invalidate that memory address in their cache, so next time it is read, it is read from main memory.
  4. Now the CPU core continues processing.

The advent of security vulnerabilities like Meltdown and Spectre that exploit these cache protocols to leak data across CPU cores has greatly reduced the performance of some of these schemes. Sometimes the advent of a new security problem can require a lot of the cache mechanisms to be disabled, badly affecting performance.

At some point cache contention becomes a problem and the circuitry to handle this is expensive.

Controlling Heat

Packing 128 CPU cores including their floating point unit and SIMD processor is a lot of circuitry on a single chip. Every active element on this chip generates heat that has to be dissipated. Chips control heat by slowing down when they get too hot or they shut down some of the CPU cores. Having 128 cores doesn’t help you if half of them aren’t running to cool down, or if they are all running at quarter speed. One of the bottlenecks in the Raspberry Pi is that if you keep all four CPU cores busy then the system overheats and slows down. Heat is the big enemy of modern CPU design and an important reason why ARM has been so successful, but even though ARM does better than Intel or AMD, it still runs into heat dissipation problems. This is partly why server farms have huge air conditioning bills and why liquid cooling is often incorporated into the design.

ARM CPUs have the idea that on a single chip, you can have a combination of different ARM CPU cores. In cell phones, half the cores are high performance, but generate more heat and use more battery power and then half are slower but more power efficient models. It is then up to the operating system to manage the process/thread scheduling to get a good balance of power and performance.

Operating Systems

If you solve the memory bandwidth problem, if you are running general processes like database or web servers on all the cores, then you are going to have other bottlenecks accessing operating system services. For instance when you access Linux services, then there is going to be more contention on operating system memory and resources such as devices like SSD drives. Typically you only have one channel to these and these will become a bottleneck very quickly. Interrupts are another problem as these may lock things and are typically tied to a single core. There are experimental extensions to the Linux kernel to dynamically allocate interrupt processing to less busy cores, but these are still a way off from being incorporated into the mainline.

The most efficient use of large cores is via specialize programs, typically used in supercomputing modelling systems. These are carefully crafted to avoid bottlenecks. Using a 128 core processor as a general purpose server may not give you the same boost.

Alternative Strategies

A few alternative strategies like you commonly see in high end graphics cards includes much more use of SIMD processing. If each core is executing the same instruction, then they can share the instruction loading and reduce memory bandwidth. Including RAM as part of the CPU is another approach to make memory access faster. There are a lot of innovative application specific solutions that are appearing in various AI processing and graphics chips. Slowly some of these will be incorporated into mainstream CPUs.

Summary

The prospect of having a workstation with 128 cores is exciting, the only thing is that to fully utilize this power, you will need expensive liquid cooling, expensive memory with multiple buses along with a number of other high performance components. This is why these systems are expensive and why discount systems, even with a lot of cores, typically don’t perform well in real benchmarks. AMD and ARM have processors on the market now with 64+ cores and the challenge for system designers of the next year is to solve all these bottlenecks, while maintaining security of each core’s data.

To learn more about the internal architecture of the ARM Processor, consider my book: Programming with 64-Bit ARM Assembly Language.

Written by smist08

July 17, 2020 at 11:35 am

Posted in Business

Tagged with , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: