CMPEN 341 Second Midterm
Question-1 [Parallelism] [28 pts] This question has five parts [a through e]
Consider two threads, T1 and T2, with the following codes (‘$zero’ represents a register that is hardwired to value 0):
lw r2, 0(r1) lw r4, 4(r1) add r2, r2, r8 add r4, r4, r8 sw r2, 4(r1) sw r4, 0(r1) sub r1, r1, -4 bne r1, $zero, my_again:
your_again: lw r2, 0(r1) add r2, r2, r9 sw r2, 0(r1) lw r4, 4(r1) add r4, r4, r9 sw r4, 4(r1) sub r1, r1, -12 bne r1, $zero, your_again:
Each of these threads is to be separately scheduled and executed on a two-slot VLIW machine with the goal of achieving a minimum number of cycles. A ‘bundle’ in this architecture has two instructions. The first of these instructions can be only an ALU or branch instruction, whereas the second one can be only a load or store instruction. You are allowed to reorder independent instructions and change the offset of addressing (if needed). You are not allowed to combine instructions.
(a) Map T1 to this VLIW machine. Explain each instruction-to-execution slot mapping decision you make in sufficient detail (i.e., why you have decided so; couldn’t instruction be scheduled in an earlier slot (cycle)?). [5 pts]
(b) Map T2 to this VLIW machine. Explain each instruction-to-execution slot mapping decision you make in sufficient detail (i.e., why you have decided so; couldn’t instruction be scheduled in an earlier slot (cycle)?). [5 pts]
(c) Repeat (a), but this time assuming that any instruction can be mapped to any execution slot. [5 pts]
(d) Repeat (b), but this time assuming that any instruction can be mapped to any execution slot. [5 pts]
(e) Suppose you decided to move to a simultaneous multi-threading architecture (SMT). The SMT architecture you are considering can execute up to 4 instructions in parallel. Further, in a given cycle, any combination of independent instructions (from the same or different threads) can be executed in parallel. Show a scheduling of these two threads together on the SMT machine with the goal of improving throughput. [8 pts]
Question-2 [Branch Prediction and Stalls] [21 pts] This question has five parts [a through e]
(a) Explain the functionalities of the following hardware components: BTB (Branch Target Buffer) and BHT (Branch History Table). Discuss how these two components complement each other. [2 pts]
(b) Consider a branch that has the following outcome pattern (T for taken, N for not taken).
N T N N T T T N N T T T N T N N
How many branches are predicted correctly with a static (0-bit) always-taken branch predictor for this branch outcome pattern. [5 pts]
(c) Using the same sequence from B. How many branches are predicted correctly with a dynamic 1-bit predictor where the initial state is Taken (T) for this branch outcome pattern? [5 pts]
(d) How many branches are predicted correctly with a dynamic, saturating counter 2-bit predictor for the branch outcome pattern in B? Suppose the four states are strong not taken (SN), weak not taken
(wn), weak taken (wt), and strong taken (ST). Assume that the initial prediction state is wt (weak taken).
(e) What are the fundamental differences between ‘pipeline stall’ and ‘pipeline flush’? Provide examples to highlight what instructions will cause stall and what instructions will cause flush, and why. [4 pts]
Question-3 [Load/Store Queues] [25 pts] This question has three parts [a through c]
Consider the following sequence of instructions:
add r1,r6,r1 sw r1, 0(r12) lw r7, 8(r9) lw r6, -4(r10) lw r8, 4(r11) add r4,r6,r7 add r8,r8,r4 sw r8, 8(r9) lw r2, -8(r3)
(a ) Explain how ‘store queue’ can be used to improve/optimize the performance of this code sequence.
(b) Discuss the relationship of the optimization in part (a) to ‘data forwarding’ (bypassing) used in pipelining. [7 pts]
(c) Explain how ‘load queue’ can be used to improve/optimize the performance of this code sequence [10 pts]
Question-4 [Hazards and Unrolling] [26 pts] This question has three parts [a through c]
(a) Identify all RAW, WAW and WAR dependencies in the loop shown below. Write down the dependencies within a ‘single iteration’ only. Use the following notation, for example, to indicate a dependency between Ix and Iy through r3 register: Ix- Iy (r3). (‘$zero’ represents a register that is hardwired to value 0, and ‘mul’ is opcode for multiplication): [8 pts]
I1: lw s1, 0(r1)
I2: mul s2, s1, s0
I3: add s3, s3, s2
I4: mul s2, s1, s1
I5: add s2, s2, s3
I6: sw s2, 0(r1)
I7: sub r1, r1, 8
I8: bne r1, $zero, loop
(b) Unroll the loop above once and eliminate as many dependences as you can via ‘register renaming’.
What dependences remain after renaming? Why can’t they be eliminated via renaming? [8 pts]
(c) Two important techniques have emerged recently for achieving high performance in processors. One is ‘speculative execution’, where instructions (or sequences of instructions) are executed before all the information needed to commit the instruction has been nailed down. The other is ‘simultaneous multithreading’ (SMT), where the processor can issue instructions from multiple threads (or processes), potentially in the same cycle. Describe the pros and cons of each approach with respect to the other. Do these approaches capture different opportunities for parallelism, or are they different ways to achieve the same result? If both were implemented in a single system, would you expect their projected improvements to be additive? Why or why not? [10 pts]