In this lab, Sven and I worked together to design and
analyze neural networks that identified digits between 0 and 9. In
general our analysis found that for small networks, feature based
network architectures outperform fully connected networks that have an
equivalent number of weights. However for large networks, the different
architectures yield indistinguishable results.
Terminology
|
When we initially began comparing feature based
networks to fully connected ones, we used networks with an equal number
of nodes for each. It quickly became apparent to us that the fully
connected networks were dominating their featured friends. As you can
see here, from the steep initial slope and absence of outlying "noise". After considering this phenomenon for a short while, we reasoned that, since a fully connected network has more weights than a feature based one, it simply has more learning potential. We felt that this made for an uneven playing field, especially when one considers that it takes the fully connected network much longer to do each trial. Based on these ideas, we chose to define equivalence based not on the number of nodes, but on the number of weights a network has. (This includes connections to the bias node.) |
Results
In the images below we compare several networks that
have feature areas in them with fully connected networks that have
roughly the same number of weights. The data indicates that as the
number of weights grows, the actual configuration of the network become
less important, as the network can adapt around whatever connections it
has. In the smaller networks, the extra direction pushes it toward
convergence, in what is essentially an unstable situation for the fully
connected counterparts.
|
Our first idea: We know about what form, more or less, each of the digits is going to take. So we use that knowledge to create a set of 10 features, each of which corresponds to the area that a particular digit fills. Each of these 10 feature areas is fully connected to a set of four hidden nodes. There are an additional four hidden nodes fully connected to all of the input nodes. The output nodes are completely connected to all of the hidden nodes. As you can see, this produces faster convergence with a smaller final error then any of the smaller networks shown below. Unfortunately, it is hard to determine whether this is a side effect of our feature areas or simply the result of such a large number of weights in a relatively well connected network. As you can see from the second image, a fully connected network with a comparable number of weights has results that are almost indistinguishable from our feature rich case. We found this disappointing. So we decided to see if smaller, less well connected networks would benefit more from feature use. |
|||
|
Our second
brain child: This network connects 3x3 areas in the image with 2x2 hidden nodes. These areas cover the entire image, overlapping heavily. This leads to 16 different feature areas, each of which has 4 associated hidden nodes. These hidden nodes are completely connected to the output nodes. Unlike our first attempt, we restrict the size of the features and have a greater number of them. Thus in this network, each hidden node connected to a much smaller number of input nodes. Thus we have a greater number of nodes then a fully connected network with an equivalent number of weights. As seen by the fact that this network of 64 hidden nodes contains roughly the same number of weights as a fully connected network with a 26 node hidden layer. Again, the results are nearly indistinguishable. One could argue that the feature based network provides a better match; however, that could just as easily be attributed to the random seed and the convergence that it happened to provide. |
|||
|
Our final stroke of
genius: In this case we decided to reduce the number of weights in the network as much as possible. We divided the network into six groups. One group for each edge, one for the center, and one that connects fully with the input nodes. These groups each have five nodes. These hidden nodes are then fully connected to the output nodes, as always. This network provides consistent, but slightly slower convergence to a slightly larger error range than the above networks. However, it does consistently converge, and the confidence values are almost always above 70% and correct. The equivalent fully connected network does not do so well. As you can see, each run might or might not converge well, depending entirely on luck. Also worth note, is that the fully connected network always results in more noise, while in the feature based network that noise tends to disappear. In this case, we have at last found a situation where a feature based network visibly outperforms its fully connected counterpart. |
|||
| |
Cluster Analysis
Our analyses found little correlation between strong
clustering and success at matching digits. Interestingly, some of our
cluster charts showed a cluster of problem
cases, while most simply clustered similarly shaped numbers, like 3, 5,
6, and 8. Even these blocks of similar numbers
were
not always perfect, with the odd 7 or 4 sneaking into such a branch.
Our last feature based network from above is a successful, minimalistic
network that outperforms equivalently sized fully connected networks.
However, all of our cluster analyses (for which the program didn't
segfault) are essentially indistinguishable. This motivated us to
investigate the theoretical underpinnings of cluster analysis.
Simply put, a cluster diagram graphically depicts which inputs have
similar internal node values. However, the term "similar" has not been
rigorously defined for us. One could conceive of many different ways of
defining "similar" in this context, any number of which might arguably
be the best definition.
Applying a little bit of mathematical abstraction, we see that checking
each of the N internal node's values for each input provides an N-tuple
to represent each element. These N-tuples can be consider points in a
vector space. In this abstraction similarity can be defined as distance
between these points. Of course, our choice of basis for this space was
arbitrary and our choice of distance function must also be arbitrary
unless it has been properly motivated. One might argue that a basis
consisting of the eigenvectors of the matrix formed from these tuples,
would be a more natural basis, as it emerges from the data.
Furthermore, each eigenvector would have associated eigenvalues
specifying their relative importance in the space. A distance function
which relies on the eigenvalues for weighting might then provide better
or more informative cluster diagrams. Whether this alternative is in
fact better than the current definition of distance we are using or any
other, we don't know. But it seems like a problem that could generate
several months worth of research and a few interesting papers. Sadly
the deadline is too mean for our purposes...
Thus, we do not feel that the lack of apparent clustering in our
feature based networks necessarily indicates that the design of our
networks could be improved. While the design of any network could
almost certainly be improved, we believe that our cluster analyses
cannot motivate reworking a design.
Further Comparison of Side, Center, Full Network vs Equivalent Fully Connected Network
We used a modified version of Nori's perl script, as
well as a second script
of our own, to collect our data.
| tar of data | % matched with .8 threshold | number missed | min threshold for perfection |
| feature 1 | 0.9917 | 2 | 0.55 |
| feature 2 | 0.9791 | 5 | 0.6 |
| feature 3 | 0.9833 | 4 | 0.65 |
| feature 4 | 0.9917 | 2 | 0.7 |
| feature 5 | 0.9875 | 3 | 0.45 |
| fully 1 | 0.9541 | 11 | none |
| fully 2 | 0.9458 | 13 | none |
| fully 3 | 0.8708 | 31 | none |
| fully 4 | 0.975 | 6 | 0.7 |
| fully 5 | 0.9292 | 17 | none |
This data, clearly supports our above analysis. On occasion the equivalent fully connected network will converge, but it does not do so consistently. Even the exceptionally good results from the fully connected network fall short of the average feature based network. The feature based network, which converges well, has some of the worst clustering we have found, while some of the fully connected versions which yield miserable results have beautiful clusters.
Conclusions
From our experiments, we have come to the conclusion that feature detection is only beneficial when used with relatively small networks. Once networks reach a sufficient size, fully connected networks and feature based networks of equivalent size provide equivalent results. However, in smaller networks, appropriately selected features can greatly aid in convergence and the reduction of outlying data.Last modified 2003-10-07 22:15 EST