CS 5513 Fall 2008 Midterm Exam Solutions

  1. Answer 'T' for True, 'F' for False:
    1. _F_ On average, given a window of 64 instructions from the same thread of a typical desktop application, at least 32 of them could be executed in parallel.
    2. _F_ If we think of the virtual memory system as a DRAM cache for the disk, then the best placement policy is direct-mapped.
    3. _T_ A fully associative cache never has conflict misses.
    4. _T_ Using tags from virtual addresses constrains cache parameters.
    5. _F_ Moore's Law states that CPU clock rates double every 18 months.
    6. _T_ It is possible for a data dependence to exist without causing a data hazard.
    7. _T_ A data hazard causes by an anti-dependence can be avoided with register renaming.
    8. _F_ In Amdahl's Law, the Speedupenhanced is the number of cycles taken by the new machine divided by the number of cycles taken by the old machine.
    9. _F_ LRU replacement is always better than random replacement.
    10. _T_ For the MIPS-like ISA in the second homework, copying a value from one register to another can be done using the ADD instruction.
    11. _F_ For the MIPS-like ISA in the second homework, integers are 8 bytes.
    12. _F_ In most ISAs we have seen, a branch target is encoded as the exclusive-OR of the branch PC and an immediate value in the branch instruction.
    13. _F_ Dhrystone is a benchmark that measures memory performance.
    14. _F_ The SPEC CPU benchmarks characterize the performance of a system using the arithmetic mean running time of a set of programs.
    15. _F_ A structural hazard between the ID stage and EX stage of the classic 5-stage pipeline can be solved by adding another memory port.
    16. _F_ Another term for a correlating branch predictor is a "Smith" predictor.
    17. _F_ All page table entries are kept in the TLB.
    18. _F_ Spatial locality is the tendency of the same data item to be accessed repeatedly.
    19. _F_ The three data hazard are: WAR, RAW, and RAR.
    20. _F_ Loop unrolling improves performance by decreasing the number of instructions in a basic block.
  2. Machine A has a data cache with a miss rate of 15% and a clock rate of 1.8 GHz. Machine B has a data cache with a miss rate of 5% and a clock rate of 1.0 GHz. A memory access instruction that misses in the cache takes 100 cycles. All other instructions take 1 cycle. Assume that 20% of instructions are memory access instructions.
  3. Consider the following MIPS-like assembly code:
    Loop:	L	R2, 0(R1)	; R2 = memory[R1]
    	L	R3, 4(R1)	; R3 = memory[R1+4]
    	DADD	R4, R2, R3	; R4 = R2 + R3
    	S	R4, 0(R1)	; memory[R1] = R4
    	DADDIU	R1, R1, #4	; R1 = R1 + 4
    	DADDIU	R2, R1, #-400	; compare R1 to 400
    	BEQZ	R2, Loop	; if R2 = 0 then goto Loop
    
    Assume that:
    1. In the classic 5-stage pipeline without forwarding over a bypass network, how many cycles will this loop take?
      Solution:
      Let's look at a pipeline diagram for this problem:
      
      
              cycle      01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
      L       R2, 0(R1)  IF ID EX MM WB
      L       R3, 4(R1)     IF ID EX MM WB
      DADD    R4, R2, R3       IF ID s  s  EX MM WB
      S       R4, 0(R1)           IF s  s  ID s  s  EX MM WB
      DADDIU  R1, R1, #4                   IF s  s  ID EX MM WB
      DADDIU  R2, R1, #-400                         IF ID s  s  EX MM WB
      BEQZ    R2, Loop                                 IF ID s  s  s  s  EX MM WB
      L       R2, 0(R1)                                                  IF ID EX MM WB
      
      From the first fetch of the first load to the second fetch of the first
      load, there are 16 cycles.  From the code and assumption that R1 starts
      as 0 it is clear that the loop iterates 100 times, so the total number of
      cycles is 1600 plus a few extra cycles to drain the pipeline.
      
    Suppose we reorder the code like this:
    Loop:	L	R2, 0(R1)	; R2 = memory[R1]
    	L	R3, 4(R1)	; R3 = memory[R1+4]
    	DADDIU	R1, R1, #4	; R1 = R1 + 4
    	DADD	R4, R2, R3	; R4 = R2 + R3
    	DADDIU	R2, R1, #-400	; compare R1 to 404
    	S	R4, -4(R1)	; memory[R1-4] = R4
    	BEQZ	R2, Loop	; if R2 = 0 then goto Loop
    
    1. Assume all the previous assumptions, but this time we allow forwarding via a bypass network.
      • How many cycles does this loop take?
        Solution:
        Let's look at the pipeline diagram for the new code:
        
                cycle      01 02 03 04 05 06 07 08 09 10 11 12 13
        L       R2, 0(R1)  IF ID EX MM WB
        L       R3, 4(R1)     IF ID EX MM WB
        DADDIU  R1, R1, #4       IF ID EX MM WB
        DADD    R4, R2, R3          IF ID EX MM WB
        DADDIU  R2, R1, #-400          IF ID EX MM WB
        S       R4, -4(R1)                IF ID EX MM WB
        BEQZ    R2, Loop                     IF ID EX MM WB
        L       R2, 0(R1)                       ss IF ID EX MM WB
        
        Now there are no stall cycles except for the one waiting for the branch
        to resolve.  The MM stage of the second load can forward to the EX stage of
        the DADD instruction because of the instruction in between.  The branch no
        longer has to wait for the value of R2 to be computed because of forwarding
        and the fact that the store is moved between the DADDIU and the branch.
        
        From the first fetch of the first load to the second fetch of the first
        load there are 8 cycles.  So this loop takes 100 * 8 = 800 cycles plus a
        few to drain the pipeline.
        
        
      • What is the speedup over the previous situation?
        Solution:
        
        The speedup is 1600 / 800 = 2, or 100%
        
        
  4. The following problems relate to caches:
    1. Consider a byte-addressable machine with 32-bit addresses. Cache blocks are 16 bytes. The cache is 4-way set-associative with an LRU replacement policy. Tags are 19 bits. What is the capacity of the cache in bytes?
      Solution:
      
      The block offset is log2(16) = 4.
      
      32 address bits - 19 tag bits - 4 block offset bits = 9 set index bits.
      
      So there are 2^9 = 512 sets.  Each set has 4 blocks, so there are 512 *
      4 = 2048 blocks.  Each block has 16 bytes, so there are 2048 * 16 = 32768
      bytes, i.e. 32KB.
      
      
    2. Consider a 4096-byte direct-mapped cache with 16 byte blocks and 256 blocks. Initially, all of the blocks are invalid. Suppose the following sequence of addresses are loaded using 4-byte load instructions:
      0x0567
      0x03c6
      0x18b7
      0x0873
      0x087c
      0x0872
      0x18ec
      0x1874
      0x18ba
      0x17ab
      0x0870
      0x056c
      0x18e1
      0x0146
      0x02c5
      0x02c2
      
      How many of these loads will miss in the cache? Hint: the block offset is the least significant hex digit, and the set index is the next two hex digits.
      Solution:
      
      The way to do this problem is to go through the addresses sequentially.
      The first time we see a set index, it is a miss.  The next time we see it,
      it is a miss if the tag (i.e. the first digit) doesn't match the last time
      we saw the same set index.  We can safely ignore the block offset because
      it is never > 12, so a memory access never spans two blocks.  The cache
      is direct mapped so we don't have to worry about replacement or trying to
      find a match among several blocks in a set.
      
      Thus:
      0x0567	miss - invalid block
      0x03c6	miss - invalid block
      0x18b7	miss - invalid block
      0x0873	miss - invalid block
      0x087c  hit - set 87, tag 0
      0x0872  hit - set 87, tag 0
      0x18ec  miss - invalid
      0x1874  miss - set 87, tag 1 (!=0)
      0x18ba  hit - set 8b, tag 1
      0x17ab  miss - invalid block
      0x0870  miss - set 87, tag 0
      0x056c  hit - set 56, tag 0
      0x18e1  hit - set 8e, tag 1
      0x0146  miss - invalid
      0x02c5  miss - invalid
      0x02c2  hit - set 2c, tag 0
      
      So there are 10 misses.
      
  5. Consider the following code:
    int code (void) {
            int i, j;
            int c;
            i = 1;
    loop:   j = 1;
    loop2:  c = c + i + j;
            j++;
            if (j <= 3) goto loop2;
            i++;
            if (i <= 1000) goto loop;
    }
    
    The compiler will produce conditional branches for the two if statements.
    1. Using a branch history table with 2-bit saturating counters, exactly how many mispredictions will there be? Assume there are enough table entries to avoid conflicts. Assume the table entries are initially 0.
      Solution:
      
      For the outer loop branch, which is taken 999 times and not taken
      once, there will be 2 initial mispredictions as the counter has
      values 0 and 1 and predicts not taken, and then there will be
      1 last misprediction as the loop exists and the counter is 3.  So
      the outer loop branch contributes 3 mispredictions.
      
      The first iteration of the inner loop branch will look like this:
      
      old counter value	prediction	outcome		new counter value
      0			not taken	taken		1
      1			not taken	taken		2
      2			taken		not taken	1
      
      So the first iteration of the outer loop will see 3 mispredictions from
      the inner loop.
      
      The second iteration of the inner loop branch will look like this:
      
      old counter value	prediction	outcome		new counter value
      1			not taken	taken		2
      2			taken		taken		3
      3			taken		not taken	2
      
      So the second iteration of the outer loop branch see 2 mispredictions from
      the inner loop.
      
      Each subsequent iteration will look like this:
      
      old counter value	prediction	outcome		new counter value
      2			taken		taken		3
      3			taken		taken		3
      3			taken		not taken	2
      
      So each of the 998 subsequent iterations of the outer loop branch will see 1 misprediction from the inner loop. So in total, there will be 3 + 3 + 2 + 1 * 998 = 1006 mispredictions.
    2. Consider a (4,2) correlating predictor that uses global history. Assume the history and table entries are initially 0. Assume there are enough table entries to avoid conflicts. How many mispredictions will there be?
      Solution:
      
      The global history looks like this:
      
      code				outcome	
      if (j <= 3) goto loop2;		taken
      if (j <= 3) goto loop2;		taken
      if (j <= 3) goto loop2;		not taken
      if (i <= 1000) goto loop2;	taken
      
      resulting in a pattern like this:
      110111011101110111011101.....
      
      The following histories are seen only once, at the beginning:
      
      history		outcome of next branch	result due to table value being initially 0
      0000		taken			misprediction
      0001		taken			misprediction
      0011		not taken		correct prediction
      0110		taken			misprediction
      
      Now we get into a steady state with the following histories:
      
      history		outcome of next branch	result due to table value being initially 0
      1101		taken			two initial mispredictions
      1011		taken			two initial mispredictions
      0111		not taken		no mispredictions
      1110		taken			two initial and one final misprediction
      
      Thus, in the steady state, all of the inner and outer loop branches are
      predicted correctly.  The only mispredictions are a few initial ones and
      the final one that comes as the outer loop exists.  Counding from the
      above tables, there are a total of 10 mispredictions.