Research Diary, Summer 2003

Emily Thomforde '04
Swarthmore College
ethomfo1@swarthmore.edu

"'The language' is a statistical abstraction." (Steels 1999)


Swarm Doc

Allspice

Mark

David

Edinburgh


VHFC Results

Swarm Primer

Simulation


Last Week
Next Week
Current

4 August 2003

I've fixed the Tatar and Turkmen numbers in the VHFC chart.

Runs 27-35 of the simulation are now available.

I got a 4,000-word start on the Hungarian corpus, from hirek.mti.hu. I think I've found 14 vowel symbols; 7 short and long. aeiou are standard; O and U are expressed as lower-case with umlaut. Long vowels have an accent, except O and U, which have a ~. It may be the case that I'm mixing up the two front representations, but that is a problem for another day. The strategy for formatting the corpus will be first to reduce all capital letters. Then vowel symbols must be translated. All puctuation and numbers should be removed. I haven't decided yet whether or not to keep foreign words (such as New York Times) or include them in the 'language.' Finally, we need to get one word to a line. David mentioned that duplicates should not be removed.

Daily total hours: 5.5

____________________________________________________

5 August 2003

My Hungarian corpus is now almost 9,000 words. What is the target size?

Runs 36-47 test boundaries of thresholdToZeroHarmony when MaxPromHarm is lowered to 8.0. I've found that anything over seven isn't sensitive enough to develop harmony, but five and under gets harmony very quickly and the downward curve is not analogous. I wonder if we should have a MaxProbHomoph, but that would involve unfreezing the program. I don't know how long it would take to defrost.

Daily total hours: 6.5

____________________________________________________

6 August 2003

The Hungarian news corpus now consists of three consecutive days' worth of news, entertainment, and sports articles from the MTI (hirek.mti.hu). It is currently formatted as a Word docement containing non-ASCII characters. The corpus contains over 12,000 words.

I've gotten up to 68 runs on the simulation. I'm now into testing the effects of probMisspeak on the downward curve. I'm not quite comfortable that it's the only downward parameter, but Mark assures me we're doing it the (a?) right way. I think I'm leaning towards very small values. probMisspeak really has nothing to do with deharmonisation, so it doesn't affect the shape of the curve too significantly. What is does affect is the lowest point of the graph. Because this parameter defines how often bugs will deharmonise words for no reason, higher values will force harmony down more quickly, and lower values will force the process to depend on homophony. Logically 0.05 is much too high.

Next I'd like to undertake adjusting the level of lanugage interaction (lifespan, density) to maximise lexical homophony. The more words you are exposed to, the higher chance you will encounter homophony. So when probMisspeak is as low as 0.005, homophony avoidance will still affect enough lexemes to make a difference in the pattern.

Daily total hours: 6

_____________________________________________________

7 August 2003

This morning I'm undertaking David's Hungarian news corpus.

The results of VHFC runs on both corpora are now available on the VHFC page. Results were consistent with the hypothesis that harmony is more pervasive at the ends of words than at the beginnings. Now with larger corpora, we can look closely at where that occurs, starting with the list of disharmonic words, also available on the VHFC page.

By way of simulation, I'm now hypothesizing that probMisspeak should stay at zero. This way, harmony only decreases to the level at which homophony is eliminated. If deharmonising for clarity is motivated by an intolerance for homophony, then when the stimulus is eliminated, the behavior should cease. Unless I'm thinking about this the wrong way. So I predict that harmony and homophony will find a stable point dependent on the composition of the population lexicon. If agents have more words in their lexica, there will have to be more deharmonising to reach equilibrium, forcing the final harmony levels lower.

Daily total hours: 6

______________________________________________________

8 August 2003

VHFC problems.

More simulation runs; I'm formulating an idea about homophony tolerance. I added it as a parameter, so now I can adjust it at will. I had picked 0.2 in the first place because it was a figure reached relatively rarely. Now I'm trying to work out what the effect of raising or lowering agents' tolerance for lexical homopony is. I think this is a much better variable than probMisspeak, which should have minimal effect on population behaviour.

Daily total hours: 6

Weekly total hours: 30