CS 5513 Fall 2008 Second Exam

Write your answers on this question paper. Your answers must be clear and legible. You may not use your book, your notes, or any other such material during this exam. You may use a calculator. Use the space provided to show how you arrived at your answer for partial credit. Only the answers you write in the spaces provided will be graded; you may use the back of the pages for your own private calculations. If a question seems ambiguous, state any assumptions you need to make to work the problem.
  1. True/False. For each of the following answer 'T' for True or 'F' for False.
    1. ___ On a system with 32-bit addresses, a 64KB 2-way set associative cache with 128-byte blocks has 16-bit tags.
    2. ___ A message passing style of parallel programming can be used on a centralized memory processor.
    3. ___ A shared memory style of parallel programming can be used on a centralized memory processor.
    4. ___ A shared memory style of parallel programming does not allow processes to have local variables.
    5. ___ Once the valid bit is set to "true" in a cache block, it never becomes "false" again until the process using the block terminates.
    6. ___ Making a one-processor system into a p-processor system automatically improves single-threaded performance by a factor of p.
    7. ___ Register renaming by the compiler can improve in-order pipeline performance.
    8. ___ The reorder buffer was invented to allow speculative execution.
    9. ___ Branch prediction is not useful without speculative execution.
    10. ___ In the classic 5-stage pipeline, by executing branches in the decode stage, it is possible for branches to incur only a single-cycle bubble in the pipeline.
    11. ___ It is possible for one instruction to be involved in both a data hazard and a control hazard.
    12. ___ Current performance can be improved by exploiting either TLP or ILP; however, it is more likely that doing it with TLP will be more efficient in terms of energy and design complexity.
    13. ___ It is possible to spread the work of one program across p processors and achieve a speedup greater than p.
    14. ___ In the speculative version of Tomasulo's algorithm, store instructions are done in reverse order to support precise interrupts.
    15. ___ VLIW processors perform better than speculative superscalar processors when it comes to irregular, pointer-chasing codes like the SPEC CPU integer benchmarks.
    16. ___ Superscalar processors can achieve an IPC greater than one.
    17. ___ Control hazards are less of an impediment to performance on superscalar processors than they are on single-issue processors.
    18. ___ A RAW hazard can be eliminated by register renaming.
    19. ___ Reservation stations take on the role of extra registers that the normal registers can be renamed into.
    20. ___ Reorder buffer entries take on the role of extra registers that the normal registers can be renamed into.

  2. Consider a program that goes through the following steps in sequence:
    1. Read data from a file
    2. Operate on the data
    3. Write results to a file
    Step 1 takes 1,000 cycles. Step 2 takes 10,000 cycles on a single-processor system. Step 3 takes 1,000 cycles. Now suppose that step 2 is perfectly linearly parallelizable. That is, by using p processors, we can perform step 2 p times faster than we can using one processor. Steps 1 and 3 cannot be parallelized.

    1. With 5 processors in the system, what is the speedup of the parallel system over the single-processor system?











    2. How many processors will the system need to achieve a speedup of 5 over the single-processor system?










  3. Consider a (4,2) correlating branch predictor, i.e., a two-level adaptive branch predictor with a history length of 4 and with 2-bit saturating counters. Assume no conflict between branches, i.e., each branch gets its own 16-entry table of counters. Given this assumption work the following two problems:

    1. Consider the following C code:
      int code (void) {
              int i, j;
              int c = 0;
              i = 1;
      loop:   j = 1;
      loop2:  c = c + i + j;
              j++;
              if (j <= 7) goto loop2;
              i++;
              if (i <= 1000) goto loop;
              return c;
      }
      
      Each of the two if statement corresponds to a conditional branch. How many mispredictions will there be? Your answer may be correct within a tolerance of 10 mispredictions, i.e., if the correct answer is n and your answer is n±10 then your answer is still counted as correct.





















    2. Now consider the following code that performs the same computation:
      int code2 (void) {
              int i, j;
              int c = 0;
              i = 1;
      loop:   j = 1;
      loop2:  c = c + i + j;
              j++;
              if (j <= 3) goto loop2;
      loop3:  c = c + i + j;
              j++;
              if (j <= 7) goto loop3;
              i++;
              if (i <= 1000) goto loop;
              return c;
      }
      
      Each of the three if statements corresponds to a conditional branch. How many mispredictions will there be? Again, your answer may be correct within a tolerance of 10 mispredictions.


























  4. Assume we have a MIPS machine that uses the speculative version of Tomasulo's algorithm. It can issue up to two instructions per clock cycle. Make the following assumptions about the machine:
    1. FP addition executes in 3 cycles
    2. FP multiplication executes in 5 cycles
    3. Integer operations and effective address addition execute in 1 cycle
    4. Data can be accessed in the data cache in 1 cycle
    5. The FPUs are fully pipelined and there are 2 FP adders (FA1,FA2), 1 FP multiplier (FM1)
    6. There are two ALUs (A1,A2), which handle effective address calculation along with all other integer operations.
    7. There are 8 entries in the reorder buffer
    8. The Issue and Commit columns of the table can handle up to two instructions (each) per cycle
    9. The CDB can handle up to two instructions per clock cycle
    10. At most one instruction may be accessing the data cache in each clock cycle
    11. The remaining FUs all have two reservation stations are, named: FMUL[1,2], FAD[1,2], ALU[1,2], ST[1,2], LD[1,2]
    Copy and fill in the following table indicating the cycles in which each instruction performs each step. Note that this table has been augmented to allow you to track the funtional unit, reservation station and ROB entry being used, which should ease your job of finding hazards. Notice that the FU/EX entry indicates what functional unit is performing computation for the instruction, and what cycles it begins on and ends on. Three copies of the Tomasulo worksheet are provided; use the first two for practice if you like, but write your final answer in the third one (we will grade only the third worksheet).

    Tomasulo Worksheet #1 (use this for practice)
    Instruction
    Issue
    ROB/ResStat
    FU/EX
    MEM
    CDB
    Commit
    L.D F1,0(R1)       
    MUL.D F1,F1,F6       
    L.D F2,0(R2)       
    ADD.D F2,F2,F1      
    S.D F2,0(R1)       
    DADDIU R1,R1,#8       
    DADDIU R2,R2,#8       

             

    Tomasulo Worksheet #2 (use this for practice)
    Instruction
    Issue
    ROB/ResStat
    FU/EX
    MEM
    CDB
    Commit
    L.D F1,0(R1)       
    MUL.D F1,F1,F6       
    L.D F2,0(R2)       
    ADD.D F2,F2,F1      
    S.D F2,0(R1)       
    DADDIU R1,R1,#8       
    DADDIU R2,R2,#8       

           
    Tomasulo Worksheet #3 (write your final answer here)
    Instruction
    Issue
    ROB/ResStat
    FU/EX
    MEM
    CDB
    Commit
    L.D F1,0(R1)       
    MUL.D F1,F1,F6       
    L.D F2,0(R2)       
    ADD.D F2,F2,F1      
    S.D F2,0(R1)       
    DADDIU R1,R1,#8       
    DADDIU R2,R2,#8