Lecture 8: Parallel Computers

So far, we have only considered single-processor systems. To the user, it appears that only one instruction at a time is being executed, although we know that the microarchitecture can exploit some instruction-level parallelism to speed things up.

However, there is another model of computation that has been around for almost as long as computing itself. Parallel computers allow the user to explicitly manage data and instructions on multiple processing elements. Many workloads exhibit large amounts of parallelism at larger granularities than the instruction-level.

Today, we will see:

Basics of parallel computers
- Diminishing returns from ILP
Shared memory versus message passing
- Parallel software models
- Communication and synchronization
- Message-passing and cluster computing
- Shared-memory multiprocessors
- Cache coherence
- Granularity and cost effectiveness
Future of computer architecture
- What to do with billions of transistors and slow wires
- Single-chip multiprocessors
- Exploiting fine-grain threads
- Processor with DRAM (PIM)
- Reconfigurable processors

Parallel Computing

Why should we consider parallel computing as opposed to building better single-processor systems?

Diminishing returns of ILP
- Programs have only so much instruction-level parallelism.
- It's very expensive to extract the last bit of ILP.
- We are limited by various bottlenecks like fetch bandwidth, memory hierarchy, etc. that are hard to parallelize in a single-threaded system.
Peak performance increases linearly with more processors. If a problem is nearly perfectly parallelizable, a parallel system with N processors can run the program N times faster. In the single-processor world, we're very happy if we can get 2 IPC. But with parallel computers, some programs can get N IPC where N is as large as you like, if you can afford N processors.
Adding processors is easier. Rather than changing a single-processor through expensive re-design of the microarchitecture, we can simply add many of the same old processors together. We still have to design the logic to glue them together, but that is easier. We can generalize this to cluster computers that are connected by commodity communication technologies.
Adding processors is cheap. I can buy a dual-processor Athlon MP 1500 with a motherboard for $390.00, vs. $439 for the top-of-the-line Athlon XP 2700. For parallelizable workloads, the cheaper dual-processor system will be significantly faster.
But if the workload has a large serial component, the extra parallelism will not give us a linear speedup.

Parallel Software Models

There are two main parallel software models, i.e., two ways the multiple processing elements (nodes) are represented to the programmer. They differ in how they communicate data and how they synchronize. Synchronization is essentially how processors communicate control flow among nodes.

Message passing.
- Processors fork processes, typically one per node.
- Processes communicate only by passing messages.
  - Message-passing code is inserted by the programmer or by high-level compiler optimizations. Some systems use message passing to implement parallel languages.
  - Send a message to some specific process using a unique tag to name the message. send(pid, tag, message);
  - receive(tag, message); Wait until a message with a given tag is received.
- Processes can synchronize by blocking on messages. The processor waits until a message arrives.
Shared memory.
- Processors fork threads.
  - Threads all share the same address space.
  - Typically one thread per node.
  - Threads are either explicitly programmed by the programmer or found through high-level analysis by the compiler.
- Threads communicate through the shared address space.
  - Loads and stores.
  - Cache contents are an issue.
- Threads synchronize by atomic memory operations, e.g. test-and-set.

Message Passing Multicomputers
and Cluster Computers

This kind of architecture is a collection of computers (nodes) connected by a network. The processors may all be on the same motherboard, or each on different motherboards connected by some communication technology, or some of each.

Computers are augmented with fast network interface
- send, receive, barrier
- user-level, memory mapped
Computer is otherwise indistinguishable from a conventional PC or workstation.
One approach is to network many workstations with a very fast network. This idea is called "cluster computers." Off-the-shelf APIs like MPI can be used over fast Ethernet or ATM, or other fast communication technology.

Shared-Memory Multiprocessors

These systems have many processors but present a single, coherent address space to the threads.

Several processors share one address space.
- Conceptually like a shared memory.
- Often implemented just like a message-passing multicomputer.
- E.g. Address space distributed over private memories.
Communication is implicit.
- Read and write accesses to shared memory locations.
- Simply by doing a load instruction, one processor may communicate with another.
Synchronization
- Via shared memory locations.
- E.g. spin waiting for non-zero.
- Or e.g. special synchronization instructions

Cache Coherence

Cache coherence is an issue with shared memory multiprocessors. Although the system conceptually uses a large shared memory, we know what is really going on behind the scenes: each processor has its own cache.

With caches, action is required to prevent access to stale data.
- Some processor may read old data from its cache instead of new data from memory.
- Some processor may read old data from memory instead of new data in some other processor's cache.
Solutions
- No caching of shared data.
- Cache coherence protocol.
  - Keep track of copies
  - Notify (udpate or invalidate) on writes
  - Ignore the problem and leave it up to the programmer.
  - Implementation eased by inclusive caches.

Granularity and Cost-Effectiveness of Parallel Computers

Parallel computers built for:
- Capability - run problems that are too big or take too long to solve any other way. Absolute performance at any cost. Predicting the weather, simulating nuclear bombs, code-breaking, etc.
- Capacity - get throughput on lots of small problems. Transaction processing, web-serving, parameter searching.
A parallel computer built from workstation size nodes will always have a lower performance/cost than a workstation.
- Sublinear speedup
- Economies of scale
A parallel computer with less memory per node can have better performance/cost than a workstation

Future of Multiprocessing

Moore's Law gives us billions of transistors on a die, but with relatively slow wires. How can we build a computer out of these components?

Technology changes the cost and performance of computer elements in a non-uniform way.
- Logic and arithmetic is becoming plentiful and cheap.
- Wires are becoming slow and scarce.
This changes the trade-offs between alternative architectures.
- Superscalar doesn't scale well.
- Global control and data lead aren't good when we have slow wires.
So what will the architectures of the future look like?

Single-chip Multiprocessors

A quote from Andy Glew, a noted microarchitect:
"It seems to me that CPU groups fall back to explicit parallelism when they have run out of ideas for improving uniprocessor performance. If your workload has parallelism, great; even if it doesn't currently have parallelism, sometimes occasionally it is easy to write multithreaded code than single threaded code. But, if your workload doesn't have enough natural parallelism, it is far too easy to persuade yourself that software should be rewritten to expose more parallelism... because explicit parallelism is easy to microarchitect for."

Build a multiprocessor on a single chip.
Linear increase in peak performance.
Advantage of fast communication between processors, relative to going off-chip.
But memory bandwidth problem is multiplied; each processor will be making demands on the memory system.
Exploiting CMPs. Where will the parallelism come from to keep all of these processors busy?
- ILP - limited to about 5.
- Outer-loop parallelism, e.g., domain decomposition. Requires big problems to get lots of parallelism.
- TLP (thread-level parallelism). Fine threads:
  - Make communication and syncronization very fast (1 cycle)
  - Break the problem into smaller pieces
  - More parallelism
Examples of CMPs:
- Cell. This processor developed by IBM has one Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE) linked by a high-speed bus on chip. The PPE is a POWER-compatible processor that acts as a controller for the SPEs and performs the single-threaded component of the program. The SPEs are 128-bit SIMD RISC processors, i.e., RISC processors with vector instructions. Each SPE has its own local memory that can also be accessed by the PPE. Its peak computing power for single-precision floating point should far outrun other platforms such as Intel. However, it is hard to program.
- Niagara. This processor developed by Sun Microsystems has 4, 6, or 8 processor cores on a single chip. Each processor can run 4 independent threads. It's kind of like a supercomputer on a chip. Each core is simple compared with contemporary single-core CPUs. They are inorder, have small caches, and no branch prediction.
- Athlon 64 X2. This line of processors developed by AMD has two processor cores on a single chip. Each processor has a private cache. They are connected by a HyperTransport bus.
- POWER4 and POWER5 from IBM are dual core.
- Intel's Pentium D, Pentium 4 Extreme Edition, and Core Duo are dual-core processors.
- There are others.

Processors with DRAM (PIM)

Main idea - put the processor and main memory onto a single chip.
Much lower memory latency.
Much higher memory bandwidth.
But it's kind of weird.
Graphics processors (GPUs) make use of this kind of idea.

Reconfigurable Processors

Adapt the processor to the application.
- Special functional units.
- Special wiring between functional units.
Builds on FPGA technology (field programmable gate arrays).
- FPGAs are inefficient.
- A multiplier built from an FPGA is about 100x larger and 10x slower than a custom multiplier.
- Need to raise the granularity from the gate-level. Configure ALUs, or whole processors.
Memory and communication are usually the bottleneck. Not addressed by configuring a lot of ALUs.