Methodology
Initially, we calculated the cross-correlation of each of the unknown hemoglobin DNA sequences with the human globin sequence and plotted the results for each base (a, c, g, and t). These plots were not particularly insightful, since the peaks were lost in the noise. However, when we added the cross-correlations of all four bases together, distinct peaks appeared. These peaks corresponded to instances of the hemoglobin DNA sequences within the larger human genetic sequence. Figure 1 shows the resulting cross-correlations. Unknown A is shown in blue, while unknown B is shown in red. The source code to generate this figure can be found here. The associated data can be downloaded from the course website.
![]() |
Figure 1: Cross-correlation of Each of Two Hemoglobin Sequences with the Human Sequence |
Be aware that the units in Figure 1 are arbitrary on both axes. A large amount of the cross-correlation is zero because MATLAB's xcorr() function left-pads the smaller data array with a string of zeros. We cropped this portion off in the above figure to focus on the interesting part of the cross-correlation. The portion that appears in the above graph is the second half of the data, rounded up.
As mentioned before, the peaks corresponding to locations of alignment appear quite readily in the graph. Each large peak from unknown A (blue) is aligned with a corresponding peak in unknown B (red). Two of the peaks, the first one and the largest one, have datatips marking the blue and red coordinates. Figure 2, below, shows the cross-correlations re-normalized to the number of base pairs in each sequence. Thus, a value of 1.0 indicates that all base pairs match between the unknown and the human genome. We did this to compensate for the fact that the two sequences are different lengths. Note that, despite the re-scaled axes, unknown A still has higher-valued peaks than unknown B.
![]() |
Figure 2: Normalized Cross-correlation |
It is interesting to note that each peak of unknown A has an echo peak soon afterwards, averaging 868 with a standard deviation of 20 bases following the original peak. It is possible that unknown B also has echo peaks, but if present, they are lost in the noise.
We expected the noise to fall around .25 on the normalized scale since approximately one quarter of the bases would match at any given location. Figure 3 shows that the data reflects this quite well.
![]() |
Figure 3: Normalized Cross-correlation without Stems |
Table 1, shown below, tracks the major peaks' magnitudes and locations from Figure 1. The difference in peak locations reflects a slight difference in width and starting point of each hemoglobin gene. The differenece in magnitude reflects the greater similarity of unknown A with the human gene, as compared with unknown B. Since mice are more closely related genetically to humans, we can conclude from this that that unknown A is the gene from the mouse and that unknown B is the gene from the frog.
| Unknown A | Unknown B | Difference | |||
| Location | Magnitude | Location | Magnitude | Location | Magnitude |
| 19626 | 287 | 19631 | 224 | 5 | +28.1% |
| 34616 | 292 | 34621 | 222 | 5 | +31.5% |
| 39552 | 292 | 39557 | 222 | 5 | +31.5% |
| 45793 | 277 | 45798 | 211 | 5 | +31.3% |
| 54881 | 295 | 54886 | 218 | 5 | +35.3% |
| 62280 | 288 | 62285 | 229 | 5 | +25.8% |
We resused the same code from part 1 to calculate the cross-correlations, this time between the Genbank sequence and each of the unknown sequences. The only changes to code were the new input dataset file names and addition of renomalization of the cross-correlations. The code can be found here.
Both unknowns have several peaks near the center, with the most prominent of each data set labeled in Figure 4 below. Unknown B also has two additional peaks near each end. Since the magnitude of the peak in the center is much larger for unknown B than it is for unknown A, it is clear that unknown B corresponds to the HIV strain while unknown A corresponds to the SIV strain. The noise is again around .25 for the same reasons mentioned above.
![]() |
Figure 4: Normalized Cross-correlation of Unknowns with HIV Sequence |
One way of possibly reducing the noise would be to filter out all values below .3, after normalization. There may be other, more complicated methods of removing noise, but none using simple cross-correlation.