CS 5513 Fall 2008 Midterm Exam

CS 5513 Fall 2008 Midterm Exam Solutions

Answer 'T' for True, 'F' for False:
1. _F_ On average, given a window of 64 instructions from the same thread of a typical desktop application, at least 32 of them could be executed in parallel.
2. _F_ If we think of the virtual memory system as a DRAM cache for the disk, then the best placement policy is direct-mapped.
3. _T_ A fully associative cache never has conflict misses.
4. _T_ Using tags from virtual addresses constrains cache parameters.
5. _F_ Moore's Law states that CPU clock rates double every 18 months.
6. _T_ It is possible for a data dependence to exist without causing a data hazard.
7. _T_ A data hazard causes by an anti-dependence can be avoided with register renaming.
8. _F_ In Amdahl's Law, the Speedup_enhanced is the number of cycles taken by the new machine divided by the number of cycles taken by the old machine.
9. _F_ LRU replacement is always better than random replacement.
10. _T_ For the MIPS-like ISA in the second homework, copying a value from one register to another can be done using the ADD instruction.
11. _F_ For the MIPS-like ISA in the second homework, integers are 8 bytes.
12. _F_ In most ISAs we have seen, a branch target is encoded as the exclusive-OR of the branch PC and an immediate value in the branch instruction.
13. _F_ Dhrystone is a benchmark that measures memory performance.
14. _F_ The SPEC CPU benchmarks characterize the performance of a system using the arithmetic mean running time of a set of programs.
15. _F_ A structural hazard between the ID stage and EX stage of the classic 5-stage pipeline can be solved by adding another memory port.
16. _F_ Another term for a correlating branch predictor is a "Smith" predictor.
17. _F_ All page table entries are kept in the TLB.
18. _F_ Spatial locality is the tendency of the same data item to be accessed repeatedly.
19. _F_ The three data hazard are: WAR, RAW, and RAR.
20. _F_ Loop unrolling improves performance by decreasing the number of instructions in a basic block.

Machine A has a data cache with a miss rate of 15% and a clock rate of 1.8 GHz. Machine B has a data cache with a miss rate of 5% and a clock rate of 1.0 GHz. A memory access instruction that misses in the cache takes 100 cycles. All other instructions take 1 cycle. Assume that 20% of instructions are memory access instructions.

Which machine is faster, machine A or machine B?

Solution:
First we find the cycles-per-instruction (CPI) for each machine:

CPI_A = 1 + (.20 * .15 * 99) = 3.97
CPI_B = 1 + (.20 * .05 * 99) = 1.99

This is a measure of performance normalized to clock rate.  Now we need to
get the absolute performance in terms of time so we can compare the two.
Let's compute nsPI, i.e., nanoseconds per instruction.  We can do this
by divided the CPI by the number of gigahertz (GHz) because on an X GHz
system, one clock cycle is 1 / X nanoseconds.

nsPI_A = CPI_A / GHz_A = 3.97 / 1.8 = 2.21
nsPI_B = CPI_B / GHz_B = 1.99 / 1.0 = 1.99

Machine B has a lower CPI than Machine A, so Machine B is faster.

What is the speedup of the faster machine over the slower machine?
```
The speedup is 2.21 / 1.99 ~= 1.11, or 11%.
```

Consider the following MIPS-like assembly code:

Loop:	L	R2, 0(R1)	; R2 = memory[R1]
	L	R3, 4(R1)	; R3 = memory[R1+4]
	DADD	R4, R2, R3	; R4 = R2 + R3
	S	R4, 0(R1)	; memory[R1] = R4
	DADDIU	R1, R1, #4	; R1 = R1 + 4
	DADDIU	R2, R1, #-400	; compare R1 to 400
	BEQZ	R2, Loop	; if R2 = 0 then goto Loop

Assume that:

R1 is initially 0.
Branches resolve in the ID stage.
WB writes registers in the 1st half and ID reads registers in the 2nd half of the cycle.
Branches are handled by stalling until they are resolved.
All functional units, including load/store, take one cycle.
There are no structural hazards.

In the classic 5-stage pipeline without forwarding over a bypass network, how many cycles will this loop take?

Solution:
Let's look at a pipeline diagram for this problem:


        cycle      01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
L       R2, 0(R1)  IF ID EX MM WB
L       R3, 4(R1)     IF ID EX MM WB
DADD    R4, R2, R3       IF ID s  s  EX MM WB
S       R4, 0(R1)           IF s  s  ID s  s  EX MM WB
DADDIU  R1, R1, #4                   IF s  s  ID EX MM WB
DADDIU  R2, R1, #-400                         IF ID s  s  EX MM WB
BEQZ    R2, Loop                                 IF ID s  s  s  s  EX MM WB
L       R2, 0(R1)                                                  IF ID EX MM WB

From the first fetch of the first load to the second fetch of the first
load, there are 16 cycles.  From the code and assumption that R1 starts
as 0 it is clear that the loop iterates 100 times, so the total number of
cycles is 1600 plus a few extra cycles to drain the pipeline.

Suppose we reorder the code like this:

Loop:	L	R2, 0(R1)	; R2 = memory[R1]
	L	R3, 4(R1)	; R3 = memory[R1+4]
	DADDIU	R1, R1, #4	; R1 = R1 + 4
	DADD	R4, R2, R3	; R4 = R2 + R3
	DADDIU	R2, R1, #-400	; compare R1 to 404
	S	R4, -4(R1)	; memory[R1-4] = R4
	BEQZ	R2, Loop	; if R2 = 0 then goto Loop

Assume all the previous assumptions, but this time we allow forwarding via a bypass network.

How many cycles does this loop take?

Solution:
Let's look at the pipeline diagram for the new code:

        cycle      01 02 03 04 05 06 07 08 09 10 11 12 13
L       R2, 0(R1)  IF ID EX MM WB
L       R3, 4(R1)     IF ID EX MM WB
DADDIU  R1, R1, #4       IF ID EX MM WB
DADD    R4, R2, R3          IF ID EX MM WB
DADDIU  R2, R1, #-400          IF ID EX MM WB
S       R4, -4(R1)                IF ID EX MM WB
BEQZ    R2, Loop                     IF ID EX MM WB
L       R2, 0(R1)                       ss IF ID EX MM WB

Now there are no stall cycles except for the one waiting for the branch
to resolve.  The MM stage of the second load can forward to the EX stage of
the DADD instruction because of the instruction in between.  The branch no
longer has to wait for the value of R2 to be computed because of forwarding
and the fact that the store is moved between the DADDIU and the branch.

From the first fetch of the first load to the second fetch of the first
load there are 8 cycles.  So this loop takes 100 * 8 = 800 cycles plus a
few to drain the pipeline.

What is the speedup over the previous situation?

Solution:

The speedup is 1600 / 800 = 2, or 100%

The following problems relate to caches:

Consider a byte-addressable machine with 32-bit addresses. Cache blocks are 16 bytes. The cache is 4-way set-associative with an LRU replacement policy. Tags are 19 bits. What is the capacity of the cache in bytes?

Solution:

The block offset is log2(16) = 4.

32 address bits - 19 tag bits - 4 block offset bits = 9 set index bits.

So there are 2^9 = 512 sets.  Each set has 4 blocks, so there are 512 *
4 = 2048 blocks.  Each block has 16 bytes, so there are 2048 * 16 = 32768
bytes, i.e. 32KB.

Consider a 4096-byte direct-mapped cache with 16 byte blocks and 256 blocks. Initially, all of the blocks are invalid. Suppose the following sequence of addresses are loaded using 4-byte load instructions:

0x0567
0x03c6
0x18b7
0x0873
0x087c
0x0872
0x18ec
0x1874
0x18ba
0x17ab
0x0870
0x056c
0x18e1
0x0146
0x02c5
0x02c2

How many of these loads will miss in the cache? Hint: the block offset is the least significant hex digit, and the set index is the next two hex digits.

Solution:

The way to do this problem is to go through the addresses sequentially.
The first time we see a set index, it is a miss.  The next time we see it,
it is a miss if the tag (i.e. the first digit) doesn't match the last time
we saw the same set index.  We can safely ignore the block offset because
it is never > 12, so a memory access never spans two blocks.  The cache
is direct mapped so we don't have to worry about replacement or trying to
find a match among several blocks in a set.

Thus:
0x0567	miss - invalid block
0x03c6	miss - invalid block
0x18b7	miss - invalid block
0x0873	miss - invalid block
0x087c  hit - set 87, tag 0
0x0872  hit - set 87, tag 0
0x18ec  miss - invalid
0x1874  miss - set 87, tag 1 (!=0)
0x18ba  hit - set 8b, tag 1
0x17ab  miss - invalid block
0x0870  miss - set 87, tag 0
0x056c  hit - set 56, tag 0
0x18e1  hit - set 8e, tag 1
0x0146  miss - invalid
0x02c5  miss - invalid
0x02c2  hit - set 2c, tag 0

So there are 10 misses.

Consider the following code:

int code (void) {
        int i, j;
        int c;
        i = 1;
loop:   j = 1;
loop2:  c = c + i + j;
        j++;
        if (j <= 3) goto loop2;
        i++;
        if (i <= 1000) goto loop;
}

The compiler will produce conditional branches for the two if statements.

Using a branch history table with 2-bit saturating counters, exactly how many mispredictions will there be? Assume there are enough table entries to avoid conflicts. Assume the table entries are initially 0.

Solution:

For the outer loop branch, which is taken 999 times and not taken
once, there will be 2 initial mispredictions as the counter has
values 0 and 1 and predicts not taken, and then there will be
1 last misprediction as the loop exists and the counter is 3.  So
the outer loop branch contributes 3 mispredictions.

The first iteration of the inner loop branch will look like this:

old counter value	prediction	outcome		new counter value
0			not taken	taken		1
1			not taken	taken		2
2			taken		not taken	1

So the first iteration of the outer loop will see 3 mispredictions from
the inner loop.

The second iteration of the inner loop branch will look like this:

old counter value	prediction	outcome		new counter value
1			not taken	taken		2
2			taken		taken		3
3			taken		not taken	2

So the second iteration of the outer loop branch see 2 mispredictions from
the inner loop.

Each subsequent iteration will look like this:

old counter value	prediction	outcome		new counter value
2			taken		taken		3
3			taken		taken		3
3			taken		not taken	2

So each of the 998 subsequent iterations of the outer loop branch will see 1 misprediction from the inner loop. So in total, there will be 3 + 3 + 2 + 1 * 998 = 1006 mispredictions.

Consider a (4,2) correlating predictor that uses global history. Assume the history and table entries are initially 0. Assume there are enough table entries to avoid conflicts. How many mispredictions will there be?

Solution:

The global history looks like this:

code				outcome	
if (j <= 3) goto loop2;		taken
if (j <= 3) goto loop2;		taken
if (j <= 3) goto loop2;		not taken
if (i <= 1000) goto loop2;	taken

resulting in a pattern like this:
110111011101110111011101.....

The following histories are seen only once, at the beginning:

history		outcome of next branch	result due to table value being initially 0
0000		taken			misprediction
0001		taken			misprediction
0011		not taken		correct prediction
0110		taken			misprediction

Now we get into a steady state with the following histories:

history		outcome of next branch	result due to table value being initially 0
1101		taken			two initial mispredictions
1011		taken			two initial mispredictions
0111		not taken		no mispredictions
1110		taken			two initial and one final misprediction

Thus, in the steady state, all of the inner and outer loop branches are
predicted correctly.  The only mispredictions are a few initial ones and
the final one that comes as the outer loop exists.  Counding from the
above tables, there are a total of 10 mispredictions.