| 1. abstract. For this lab, we wrote video processing routines that modify a 8-bit grayscale image feed on the fly. We implemented 5 different algorithms both in plain C and in in-line assembly using MMX SIMD capabilities. We compared the performances of using plain C and using MMX instructions. As expected MMX algorithms performed better, yet the performance difference was not reflected on frame rate. However, a significant performance improment was observable once we started clocking the filters themselves.
 2. routines.The difference routine was already provided to us in MMX assembly form. All other routines were implemented in both C and MMX.
 2.1 Edge Detect
 Outline:
 Finds edges by taking the gradient of the image in and x and y directions. A
                   threshold is applied to the result and can be used to increase or decrease the
                   sensitivity of the filter.
 Basic algorithm:Current pixel = (|left pixel-right pixel|+|pix above-pix below|)-threshold
 MMX and C differences:This routine was implemented with the same algorithm on both platforms. For
                    the MMX version we had to create two different pointers for accessing right
                    and left pixels since the double pointer was moving 8 pixels at a time. This
                    was causing undesired overlapping images to appear.
 Image:
  2.2 Brightness Outline:Changes the overall brightness of an image by adding a fixed signed value to
                    all pixels twice. Since the signed char value had half the maximum absolute
                    value unsigned char value had, we added it twice to allow the extreme
                    brightness outcomes.
 Basic algorithm:Current pixel = current pixel + (brightness change*2)
 MMX and C differences:Both versions use the same basic algorithm. Since we could not combine signed
                    and unsigned values in assembly we needed to use the absolute value of the
                    brightness change variable and have two different algorithms, one adding and
                    one subtracting pixels.
 Images:
                   2.3 Contrast Outline: Changes the deviation of pixels from the average brightness value (assumed
                    128). Increased contrast means increased deviation, so brighter pixels
                    become brighter and darker pixels becoming darker and vice versa. A double
                    precision floating point number signifies the change in contrast. The
                  deviation of each pixel is augmented or diminished by that value.
 Basic algorithm:Current pixel = 128 + (|current pixel-128|*contrast change)
 MMX and C differences:The C algorithm was very straight forward, the deviation of each pixel was
                    found, applied change, recombined with average brightness. The MMX algorithm
                    was a lot more complex. We generated a "mask" for differentiating between
                    pixels brighter than average and darker than average. We also had two
                    different assembly instruction sets based on weather contrast was going to be
                    increased or decreased. To increase contrast, brighter/darker pixels were
                    made brighter/darker by a fixed value using the aforementioned mask, and vice
                    versa. The mask was inverted by subtracting from an array of all 1s, between
                    applying to brighter and darker pixels. The fixed number was generated
                    based on the range the contrast change variable was in. The if statements
                    required to make this determination turned out to be significantly faster than
                    doing any math with double precision floats.
 Images:
                   2.4 Difference Outline:This algorithm was provided as a sample in the set of files provided for this
                    lab. We also implemented it in plain C. It takes the absolute value of the
                    differences between pixels of current frame and a reference frame. The
                    reference frame in this case is the previous frame. A threshold value is
                    subtracted from the result and can be used to modify the sensitivity of the
                    filter.
 Basic algorithm:Current pixel = |current pixel - previous pixel|-threshold
 MMX and C differences:Both versions use the same algorithm. We had to implement the absolute value
                    using the "?" operator in C. In MMX saturated math was relied on, the
                    saturated differences x-y and y-x were added together.
 Image:
  2.5 Motion detect
 Outline:
 Detects motion based on the difference algorithm. Pixels in motion are white
                    and pixels not in motion are black. A pixel is taken to be in motion if its
                    difference with previous pixel is greater than a given threshold.
 Basic algorithm:If difference(current pixel) > threshold
 current pixel = 255
 else
 current pixel = 0
 MMX and C differences:The C algorithm determined if a given pixel was in motion by checking if
                    difference - threshold is greater than zero. The MMX algorithm generated a mask
                    using the packed byte greater than comparator. This comparator assumed signed
                    bytes, but since out threshold value was small and its sign bit was zero there
                    were no problems and the algorithm ran like the C algorithm. This mask was
                    outputted to the screen.
 Images:
                         2.6 Motion Overlay
 Outline:
 Built on the motion detect filter. Mixes pixels in motion with pixels from
                    previous frame, as a result objects in motion fade.
 Basic algorithm:If pixel is moving
 current pixel = (current pixel+previous pixel)/2
 else
 current pixel = current pixel
 MMX and C differences:The C algorithm mixes the pixels by mathematically adding their values
                    together and dividing by two. The MMX algorithm replaces moving pixels with
                    pixels from previous frame.
 Images:
           3. about the application.
 The application we generated was mostly based on the files provided for the
                    lab. It can apply edge detect, brightness and contrast filters all at the same
                    time. But only one of difference, motion detect, motion overlay filters can
                    run at a time, since they all rely on the same reference previous image.
                    It alternatingly uses registers B and C as source and target, which allows the
                    cumulative filter behavior.
 Modified files: mmx/xmain.c, mmx/mmxLib.c, include/mmxLib.hCreated files: mmx/cOperLib.c, include/cOperLib.h
 3.1 Key bindings
 Control:
 Capture screen: c
 Freeze screen: f
 Start feed: s
 End feed: e
 End application: q
 Filters:Brightness filter:
 Toggle: b
 Increase: .
 Decrease: /
 Contrast filter:Toggle: o
 Increase: [
 Decrease: ]
 Difference filter (was already implemented):Toggle: d
 Edge detect filter:Toggle: g
 Motion detect filterToggle: n
 Motion overlay filterToggle: v
 MMX/C Controls:Switch between using plain C and MMX assembly instructions: m
 Display time to apply filters: r
 performance comparison.
 We could observe no uniform change in frame rate when filters were on and off.
                    So we measured the time time it took to apply all filters by putting a timer
                    in the while loop that refreshes the image. The process of moving the output
                    to the xim1 was left outside timers. It turned out that the process of
                    updating the reference frame with the current frame at the end took a
 significant portion of the total time, 1.4*10^-4s of a total of 1.6*10^-4 in
                    some cases.
 Theoretically, since 8 pixels were processed simultaneously with MMX
                    instructions and two instructions can run simultaneosly, if our instuctions
                    are paralellizable 50% of the time we should get an average speed up of around
                    12x. We did not expect this to be actual case since there usually were other
                    significant differences between our plain C and MMX implementations. Forinstance the contrast filter's implementaion the plain C had double precision
                    floating point multiplication, which slowed it down significantly, and
                    resulted in our greatest speedup value, ~13x. In all MMX cases we tried to
                    minimize mathematical computations outside of the assembly block, and magnify
                    the performance impact of using MMX instuctions itself. The plain C also used
                    branching much more extensively than MMX code which is another reason which
                    slowed it down. Still in all cases a significant speedup was observed with MMX
                  instructions barely taking longer than the base case.
 
                    
                      | FILTER | MMX (x 10-4 s)  | C (x 10-4 s) | speedup |  
                      | motion detect | 2.0 | 6.2 | 3.1 |  
                      | motion overlay | 2.1 | 6.1 | 2.9 |  
                      | difference | 1.9 | 9.6 | 5.1 |  
                      | contrast | 1.6 | 21.2 | 13.3 |  
                      | brightness | 1.5 | 3.7 | 2.5 |  
                      | edge detect | 1.5 | 8.8 | 5.9 |  Table 1. Performance comparison and speedup for different filters. ( base case, time with all filters off: 1.4E-4 s
  )
                     extensions.Apart from implementing extra routines, we modified the application to allow
                    for filters to have cumulative effect. We also optimized our assembly code to
                    make most out of the two available pipelines. This had no recordable effect 
                    which was probably because PIV's instruction horizon saw far enough into 
                    our instructions to avoid dependency conflicts whenever possible. We also
                    improved the application to allow for cumulative filter effect (ex. decreased
                  contrast, increased brightness).
 conclusions/what we learned.
 Using MMX through in-line assembly is a powerful tool. Yet debugging assembly
                    and coming up with methods of differentiation in absence of branching is a
                    significant challenge. We estimate a ratio of 5 to 1 in time spent working on
                    MMX to time spent working on plain C. With a 4x speed up on average, MMX
                    does not have that good of effort-to-payoff ratio.
                    Regardless, we learned a lot about using in-line assembly, SIMD operations on
                    packed values and saturated math.
 |