Matt Fowles and Sven Olsen's AI Lab 3


In this lab, Sven and I worked together to design and analyze neural networks that identified digits between 0 and 9. In general our analysis found that for small networks, feature based network architectures outperform fully connected networks that have an equivalent number of weights. However for large networks, the different architectures yield indistinguishable results.


Terminology

Number shaped regions-44 Nodes
matt.png
Fully Connected-44 Nodes
f44.png
When we initially began comparing feature based networks to fully connected ones, we used networks with an equal number of nodes for each. It quickly became apparent to us that the fully connected networks were dominating their featured friends. As you can see here, from the steep initial slope and absence of outlying "noise".

After considering this phenomenon for a short while, we reasoned that, since a fully connected network has more weights than a feature based one, it simply has more learning potential. We felt that this made for an uneven playing field, especially when one considers that it takes the fully connected network much longer to do each trial.

Based on these ideas, we chose to define equivalence based not on the number of nodes, but on the number of weights a network has. (This includes connections to the bias node.)



Results

In the images below we compare several networks that have feature areas in them with fully connected networks that have roughly the same number of weights. The data indicates that as the number of weights grows, the actual configuration of the network become less important, as the network can adapt around whatever connections it has. In the smaller networks, the extra direction pushes it toward convergence, in what is essentially an unstable situation for the fully connected counterparts.

Number shaped regions
matt.png
Fully Connected 36 Nodes
f36.png
Our first idea:

We know about what form, more or less, each of the digits is going to take. So we use that knowledge to create a set of 10 features, each of which corresponds to the area that a particular digit fills. Each of these 10 feature areas is fully connected to a set of four hidden nodes. There are an additional four hidden nodes fully connected to all of the input nodes. The output nodes are completely connected to all of the hidden nodes.

As you can see, this produces faster convergence with a smaller final error then any of the smaller networks shown below. Unfortunately, it is hard to determine whether this is a side effect of our feature areas or simply the result of such a large number of weights in a relatively well connected network. As you can see from the second image, a fully connected network with a comparable number of weights has results that are almost indistinguishable from our feature rich case.

We found this disappointing. So we decided to see if smaller, less well connected networks would benefit more from feature use.


Overlapping Square Regions (click here for network architecture diagram)
matt++.png
Fully Connected 26 Nodes
f26.png
Our second brain child:

This network connects 3x3 areas in the image with 2x2 hidden nodes. These areas cover the entire image, overlapping heavily. This leads to 16 different feature areas, each of which has 4 associated hidden nodes. These hidden nodes are completely connected to the output nodes.

Unlike our first attempt, we restrict the size of the features and have a greater number of them. Thus in this network, each hidden node connected to a much smaller number of input nodes. Thus we have a greater number of nodes then a fully connected network with an equivalent number of weights. As seen by the fact that this network of 64 hidden nodes contains roughly the same number of weights as a fully connected network with a 26 node hidden layer.

Again, the results are nearly indistinguishable. One could argue that the feature based network provides a better match; however, that could just as easily be attributed to the random seed and the convergence that it happened to provide.


Each Side, Center, and Full Coverage (click here for network architecture diagram)
sven+-.png
Lucky Fully Connected 16 Nodes (click here for network architecture diagram)
f16good.png
Unlucky Fully Connected 16 Nodes
f16bad.png
Our final stroke of genius:

In this case we decided to reduce the number of weights in the network as much as possible. We divided the network into six groups. One group for each edge, one for the center, and one that connects fully with the input nodes. These groups each have five nodes. These hidden nodes are then fully connected to the output nodes, as always.

This network provides consistent, but slightly slower convergence to a slightly larger error range than the above networks. However, it does consistently converge, and the confidence values are almost always above 70% and correct. The equivalent fully connected network does not do so well.

As you can see, each run might or might not converge well, depending entirely on luck. Also worth note, is that the fully connected network always results in more noise, while in the feature based network that noise tends to disappear.

In this case, we have at last found a situation where a feature based network visibly outperforms its fully connected counterpart.




Cluster Analysis

Our analyses found little correlation between strong clustering and success at matching digits. Interestingly, some of our cluster charts showed a cluster of problem cases, while most simply clustered similarly shaped numbers, like 3, 5, 6, and 8. Even these blocks of similar numbers were not always perfect, with the odd 7 or 4 sneaking into such a branch.

Our last feature based network from above is a successful, minimalistic network that outperforms equivalently sized fully connected networks. However, all of our cluster analyses (for which the program didn't segfault) are essentially indistinguishable. This motivated us to investigate the theoretical underpinnings of cluster analysis.

Simply put, a cluster diagram graphically depicts which inputs have similar internal node values. However, the term "similar" has not been rigorously defined for us. One could conceive of many different ways of defining "similar" in this context, any number of which might arguably be the best definition.

Applying a little bit of mathematical abstraction, we see that checking each of the N internal node's values for each input provides an N-tuple to represent each element. These N-tuples can be consider points in a vector space. In this abstraction similarity can be defined as distance between these points. Of course, our choice of basis for this space was arbitrary and our choice of distance function must also be arbitrary unless it has been properly motivated. One might argue that a basis consisting of the eigenvectors of the matrix formed from these tuples, would be a more natural basis, as it emerges from the data. Furthermore, each eigenvector would have associated eigenvalues specifying their relative importance in the space. A distance function which relies on the eigenvalues for weighting might then provide better or more informative cluster diagrams. Whether this alternative is in fact better than the current definition of distance we are using or any other, we don't know. But it seems like a problem that could generate several months worth of research and a few interesting papers. Sadly the deadline is too mean for our purposes...

Thus, we do not feel that the lack of apparent clustering in our feature based networks necessarily indicates that the design of our networks could be improved. While the design of any network could almost certainly be improved, we believe that our cluster analyses cannot motivate reworking a design.


Further Comparison of Side, Center, Full Network vs Equivalent Fully Connected Network

We used a modified version of Nori's perl script, as well as a second script of our own, to collect our data.

tar of data % matched with .8 threshold number missed min threshold for perfection
feature 1 0.9917 2 0.55
feature 2 0.9791 5 0.6
feature 3 0.9833 4 0.65
feature 4 0.9917 2 0.7
feature 5 0.9875 3 0.45
fully 1 0.9541 11 none
fully 2 0.9458 13 none
fully 3 0.8708 31 none
fully 4 0.975 6 0.7
fully 5 0.9292 17 none

This data, clearly supports our above analysis. On occasion the equivalent fully connected network will converge, but it does not do so consistently. Even the exceptionally good results from the fully connected network fall short of the average feature based network. The feature based network, which converges well, has some of the worst clustering we have found, while some of the fully connected versions which yield miserable results have beautiful clusters.


Conclusions

From our experiments, we have come to the conclusion that feature detection is only beneficial when used with relatively small networks. Once networks reach a sufficient size, fully connected networks and feature based networks of equivalent size provide equivalent results. However, in smaller networks, appropriately selected features can greatly aid in convergence and the reduction of outlying data.

Last modified 2003-10-07 22:15 EST