E25 Computer Architecture: Lab 4
Real-time Video Processing Using MMX
in partnership with mustafa paksoy

labs home    |    lab1    |    lab2    |    lab3    |    lab4    |    final project

1. abstract.
For this lab, we wrote video processing routines that modify a 8-bit grayscale image feed on the fly. We implemented 5 different algorithms both in plain C and in in-line assembly using MMX SIMD capabilities. We compared the performances of using plain C and using MMX instructions. As expected MMX algorithms performed better, yet the performance difference was not reflected on frame rate. However, a significant performance improment was observable once we started clocking the filters themselves.

2. routines.
The difference routine was already provided to us in MMX assembly form. All other routines were implemented in both C and MMX.

2.1 Edge Detect

Finds edges by taking the gradient of the image in and x and y directions. A threshold is applied to the result and can be used to increase or decrease the sensitivity of the filter.

Basic algorithm:
Current pixel = (|left pixel-right pixel|+|pix above-pix below|)-threshold

MMX and C differences:
This routine was implemented with the same algorithm on both platforms. For the MMX version we had to create two different pointers for accessing right and left pixels since the double pointer was moving 8 pixels at a time. This was causing undesired overlapping images to appear.


2.2 Brightness

Changes the overall brightness of an image by adding a fixed signed value to all pixels twice. Since the signed char value had half the maximum absolute value unsigned char value had, we added it twice to allow the extreme brightness outcomes.

Basic algorithm:
Current pixel = current pixel + (brightness change*2)

MMX and C differences:
Both versions use the same basic algorithm. Since we could not combine signed and unsigned values in assembly we needed to use the absolute value of the brightness change variable and have two different algorithms, one adding and one subtracting pixels.


2.3 Contrast

Changes the deviation of pixels from the average brightness value (assumed 128). Increased contrast means increased deviation, so brighter pixels become brighter and darker pixels becoming darker and vice versa. A double precision floating point number signifies the change in contrast. The deviation of each pixel is augmented or diminished by that value.

Basic algorithm:
Current pixel = 128 + (|current pixel-128|*contrast change)

MMX and C differences:
The C algorithm was very straight forward, the deviation of each pixel was found, applied change, recombined with average brightness. The MMX algorithm was a lot more complex. We generated a "mask" for differentiating between pixels brighter than average and darker than average. We also had two different assembly instruction sets based on weather contrast was going to be increased or decreased. To increase contrast, brighter/darker pixels were made brighter/darker by a fixed value using the aforementioned mask, and vice versa. The mask was inverted by subtracting from an array of all 1s, between applying to brighter and darker pixels. The fixed number was generated based on the range the contrast change variable was in. The if statements required to make this determination turned out to be significantly faster than doing any math with double precision floats.


2.4 Difference

This algorithm was provided as a sample in the set of files provided for this lab. We also implemented it in plain C. It takes the absolute value of the differences between pixels of current frame and a reference frame. The reference frame in this case is the previous frame. A threshold value is subtracted from the result and can be used to modify the sensitivity of the filter.

Basic algorithm:
Current pixel = |current pixel - previous pixel|-threshold

MMX and C differences:
Both versions use the same algorithm. We had to implement the absolute value using the "?" operator in C. In MMX saturated math was relied on, the saturated differences x-y and y-x were added together.


2.5 Motion detect

Detects motion based on the difference algorithm. Pixels in motion are white and pixels not in motion are black. A pixel is taken to be in motion if its difference with previous pixel is greater than a given threshold.

Basic algorithm:
If difference(current pixel) > threshold
current pixel = 255
current pixel = 0

MMX and C differences:
The C algorithm determined if a given pixel was in motion by checking if difference - threshold is greater than zero. The MMX algorithm generated a mask using the packed byte greater than comparator. This comparator assumed signed bytes, but since out threshold value was small and its sign bit was zero there were no problems and the algorithm ran like the C algorithm. This mask was outputted to the screen.


2.6 Motion Overlay

Built on the motion detect filter. Mixes pixels in motion with pixels from previous frame, as a result objects in motion fade.

Basic algorithm:
If pixel is moving
current pixel = (current pixel+previous pixel)/2
current pixel = current pixel

MMX and C differences:
The C algorithm mixes the pixels by mathematically adding their values together and dividing by two. The MMX algorithm replaces moving pixels with pixels from previous frame.


3. about the application.
The application we generated was mostly based on the files provided for the lab. It can apply edge detect, brightness and contrast filters all at the same time. But only one of difference, motion detect, motion overlay filters can run at a time, since they all rely on the same reference previous image. It alternatingly uses registers B and C as source and target, which allows the cumulative filter behavior.

Modified files: mmx/xmain.c, mmx/mmxLib.c, include/mmxLib.h
Created files: mmx/cOperLib.c, include/cOperLib.h

3.1 Key bindings


Capture screen: c
Freeze screen: f
Start feed: s
End feed: e
End application: q

Brightness filter:
Toggle: b
Increase: .
Decrease: /

Contrast filter:
Toggle: o
Increase: [
Decrease: ]

Difference filter (was already implemented):
Toggle: d

Edge detect filter:
Toggle: g

Motion detect filter
Toggle: n

Motion overlay filter
Toggle: v

MMX/C Controls:
Switch between using plain C and MMX assembly instructions: m
Display time to apply filters: r

performance comparison.
We could observe no uniform change in frame rate when filters were on and off. So we measured the time time it took to apply all filters by putting a timer in the while loop that refreshes the image. The process of moving the output to the xim1 was left outside timers. It turned out that the process of updating the reference frame with the current frame at the end took a
significant portion of the total time, 1.4*10^-4s of a total of 1.6*10^-4 in some cases.

Theoretically, since 8 pixels were processed simultaneously with MMX instructions and two instructions can run simultaneosly, if our instuctions are paralellizable 50% of the time we should get an average speed up of around 12x. We did not expect this to be actual case since there usually were other significant differences between our plain C and MMX implementations. For
instance the contrast filter's implementaion the plain C had double precision floating point multiplication, which slowed it down significantly, and resulted in our greatest speedup value, ~13x. In all MMX cases we tried to minimize mathematical computations outside of the assembly block, and magnify the performance impact of using MMX instuctions itself. The plain C also used branching much more extensively than MMX code which is another reason which slowed it down. Still in all cases a significant speedup was observed with MMX instructions barely taking longer than the base case.

MMX (x 10-4 s)
C (x 10-4 s)
motion detect 2.0 6.2 3.1
motion overlay 2.1 6.1 2.9
difference 1.9 9.6 5.1
contrast 1.6 21.2 13.3
brightness 1.5 3.7 2.5
edge detect 1.5 8.8 5.9

Table 1.
Performance comparison and speedup for different filters. ( base case, time with all filters off: 1.4E-4 s )

Apart from implementing extra routines, we modified the application to allow for filters to have cumulative effect. We also optimized our assembly code to make most out of the two available pipelines. This had no recordable effect which was probably because PIV's instruction horizon saw far enough into our instructions to avoid dependency conflicts whenever possible. We also improved the application to allow for cumulative filter effect (ex. decreased contrast, increased brightness).

conclusions/what we learned.
Using MMX through in-line assembly is a powerful tool. Yet debugging assembly and coming up with methods of differentiation in absence of branching is a significant challenge. We estimate a ratio of 5 to 1 in time spent working on MMX to time spent working on plain C. With a 4x speed up on average, MMX does not have that good of effort-to-payoff ratio. Regardless, we learned a lot about using in-line assembly, SIMD operations on packed values and saturated math.