1. Our group tested the cache structure of the Intel Pentium III, Pentium II and the Ultra Sparc 2. After some initial difficulties, we wrote 4 functions that generated good data about the respective system's caches. All three processors have 32kb of cache divided into 2 16kb sections. One of these sections is for instructions, the other for data. Both processors also use 32 byte lines. As a result we used the same test functions for both processors.
The first two functions we wrote were "nice" to all caches. The first function saw us mallocing a cache sized block of memory, and accessing the block in sequence a constant number of times. The second function involved accessing the first byte of each line. We ran this test 32 times the constant from the first function. The additional 32 iterations ensured that both loops ran an equal number of times.
The third function was "mean" to all caches. It malloced 4 times more memory than could fit in the cache. We then accessed each byte of the malloced area in succession. We performed these tests .25 times the constant used in the first function, again to ensure that the same number of calculations was performed. The fourth function repeatedly accesses two variables that hash to the same location in memory. The inner loop is iterated the same number of times as the loop in function three, and two accesses take place, thus the outer loop constant is .125 times the constant in the first function.
The data we gathered suggests that, when code is written "nicely", level 1 cache speeds up execution time significantly. Across all of the platforms we tested, the average performance increase in running code that is nice to the cache is 24% (19% + 33% + 20%). The Pentium II runs the "nice" tests at 60s and 62s, creating an average "nice" time of 61 seconds. 1 - (61s [average "nice" time]/ 75s [cache stress]) = 19%, which is the performance increase in the "nice" code. The performance increase for the Pentium III is 20%, and the performance increase for the Ultra Sprac II is 33% (using the same equation). Once could read these numbers as the difference between access time in L1 cache versus L2, because all our cache stress test allocate less memory than is in L2 cache.
2. We implemented the super scalar functions as a series of 4 functions. Each function performs 4 calculations, an addition, a multiplication, a division followed by a multiplication, and a subtraction. Each statement also includes an assignment statement. The same constants are used in each statement as well. The only difference between the 4 calculations in each function are the dependencies. The dependencies follow the scheme in the lab description. The first function has no dependencies. The second function's fourth statement is dependent on the value of the first assignment statement. The third function's third statement is dependent on the function's first statement. In the fourth function each statement is dependent on the statement before it. Each series of 4 statements is performed a large (i.e. 10e9 times).
The final two functions should have created situations that failed to take advantage of a processors super scalar. But the third function runs as fast or faster than the baselines on the Pentium II and III. Perhaps this is a result of compiler tricks that we failed to catch. Also a possibility is that the Pentium II and III can only perform two integer calculations in a cycle. Either way, for the Pentium II and III the third function runs within 3% of the first two functions, which is negligible and can be attributed to various processes taking up more or less of the processor.
For the ultra sprac, there was a noticeable slowdown from the baseline's for function 3. There is a 9% improvement between the average of the baseline times, and the third functions time. This leads us to believe, since for both the Pentium's and the Ultra Sparc we used gcc, that the Pentium can only do 2 integer calculations concurrently. The data also suggests that the Ultra Sparc can do three integer operations concurrently. Function 2 is only 1% less effective than function 1. Because function 3 is 9% less effective, there may be an extra integer operation going on.
The biggest hit to super scalar processing's effectiveness was as a result of function 4. The average improvement involved in having a super scalar architecture is 16% ((10% + 14% + 23%)/3). The performance increase in the Pentium II was 10%, in the Pentium III it was 14%, and in the Ultra Sprac the increase was 23%. These numbers were arrived at via the equation: 1 - ( average baseline times ((base1 + base2) /2) / function 4 time).
3. This project will change our coding in the future. It has forced the realization that we must take the architecture and organization of the processor we are writing for into account.
Without knowledge of a processors cache structure, we would be inclined to allocate blocks of memory larger than cache. This creates a situation where the processor must load and re-load lines into cache, forcing the use of slower L2, L3 or main memory. Directly accessing lines of memory that will hash to the same location in cache will also be avoided as a result of this lab.
Knowledge of data types will also help our coding. We discovered that floating-point operations are faster than integer operations. Previously we believed that floating-point operations were slower than integer operations, as a result we were less likely to use floating-point variables in our code. Now we will be more likely to use floating-point numbers in our code, where previously we desperately tried to avoid them. Additionally, because we know that data types that are smaller than the native word size of a processor then to run slower than data types that are the same size, we will be more likely to use longs, instead of shorts or chars when optimizing for time. And because we are aware of the constraints of super scalar processors, we will be more likely to write code without complicated dependencies. These dependencies will prevent the processor from using its super scalar capabilities.
4. See the extensions section for the answer to this question.
Click here to go to the previous part of the lab write-up.