Adventures in Vowel Harmony 2003

Research Diary, Summer 2003

Emily Thomforde '04
Swarthmore College
ethomfo1@swarthmore.edu

Swarm Doc

Allspice

Mark

VHFC Results

Next Week

Current

15 July 2003

This morning I started out trying to run the Swarm simulations on Penguin, but got a TkExtra error. I think this means I have to change some settings to let me open windows dynamically, but I'll have to find a computer person to help me with that. I'm also having problems compiling test files. I think I'll have to ask Sean where the local copy of Swarm is running, so I can modify my paths.

I started compiling a Spanish corpus. This involved some Google searching, resulting in downloading a Spanish Bible translation. A lot of this needs to be cleaned up, including: 1) Accents removed (accented letters replaced with unaccented versions); 2) Re-formatting to one word per line; 3) Duplicates eliminated; 4) Diphthongs handled. I think I'll try this last one by both suggested methods (delete words with diphthongs / reduce clusters to first vowel) and see how they come out.

I figured out how to run my simulations remotely by using ssh -X to enable Xwin.

Added homophony*20 to Observer.m harmonyGraph. This should help illustrate causation, as harmony can now be seen to rise before homophony and fall after. This makes sense in my head and eventually will on paper.

Preliminary Spanish results:

front vowels: ie
back vowels: aou
Input file: Spanish.txt

Average number of syllables per word is 2.82.
Weighted average is 2.82.
Harmony threshold = 30.41%
New harmony threshold = 33.42%
Harmony threshold for first two syllables = 52.81%

Entire word harmony:
973 out of 2749 words were harmonic. (35.39%)
1776 out of 2749 words were disharmonic. (64.61%)

Harmony in the first two syllables:
2161 out of 2749 words were harmonic. (78.61%)
588 out of 2749 words were disharmonic. (21.39%)

Daily total hours: 6

_______________________________________________________

16 July 2003

This morning I went to an Emergence group at Bryn Mawr. This counts as work. There was a presentation on scale as it relates to multi-agent systems. There was much discussion of granularity of processes.

I'm trying to start on the Hungarian corpus, but am having some formatting errors. Right now Unix sees the file as one string, but emacs does not. I may have to reformat on a PC and port the data over later.

Preliminary results for REVERSED Spanish corpus:

front vowels: ie
back vowels: aou
Input file: reversed.txt

Average number of syllables per word is 2.82.
Weighted average is 2.82.
Harmony threshold = 29.60%
New harmony threshold = 32.93%
Harmony threshold for first two syllables = 53.86%

Entire word harmony:
973 out of 2749 words were harmonic. (35.39%)
1776 out of 2749 words were disharmonic. (64.61%)
Harmony in the first two syllables:
1941 out of 2749 words were harmonic. (70.61%)
808 out of 2749 words were disharmonic. (29.39%)

Preliminary results for Hungarian corpus:

front vowels: ieOU
back vowels: Iaou
Input file: Hungarian_new_modified_2

Average number of syllables per word is 2.64.
Weighted average is 2.63.
Harmony threshold = 25.03%
New harmony threshold = 32.35%
Harmony threshold for first two syllables = 49.89%

Entire word harmony:
568 out of 951 words were harmonic. (59.73%)
383 out of 951 words were disharmonic. (40.27%)

Harmony in the first two syllables:
853 out of 951 words were harmonic. (89.70%)
98 out of 951 words were disharmonic. (10.30%)

Preliminary results for REVERSED Hungarian corpus:

front vowels: ieOU
back vowels: Iaou
Input file: reversed.txt

Average number of syllables per word is 2.64.
Weighted average is 2.63.
Harmony threshold = 25.02%
New harmony threshold = 32.39%
Harmony threshold for first two syllables = 50.02%

Entire word harmony:
568 out of 951 words were harmonic. (59.73%)
383 out of 951 words were disharmonic. (40.27%)

Harmony in the first two syllables:
805 out of 951 words were harmonic. (84.65%)
146 out of 951 words were disharmonic. (15.35%)

It bothers me that the Hungarian corpus has only 951 words. I feel that, in order to claim robustness of results, we would need to test a larger data set. The original Hungarian corpus was 995, and the bible corpus 1066. I think I may combine the three corpora and run it over the composite word list.

Preliminary results for COMBINED Hungarian corpus:

front vowels: ieOU
back vowels: Iaou
Input file: Hungarian_comp_modified

Average number of syllables per word is 2.64.
Weighted average is 2.63.
Harmony threshold = 26.03%
New harmony threshold = 33.15%
Harmony threshold for first two syllables = 50.63%

Entire word harmony:
973 out of 1602 words were harmonic. (60.74%)
629 out of 1602 words were disharmonic. (39.26%)

Harmony in the first two syllables:
1402 out of 1602 words were harmonic. (87.52%)
200 out of 1602 words were disharmonic. (12.48%)

Preliminary results for REVERSED COMBINED Hungarian corpus:

front vowels: ieOU
back vowels: Iaou
Input file: reversed.txt

Average number of syllables per word is 2.69.
Weighted average is 2.68.
Harmony threshold = 25.02%
New harmony threshold = 31.24%
Harmony threshold for first two syllables = 49.98%

Entire word harmony:
1813 out of 2772 words were harmonic. (65.40%)
959 out of 2772 words were disharmonic. (34.60%)

Harmony in the first two syllables:
2423 out of 2772 words were harmonic. (87.41%)
349 out of 2772 words were disharmonic. (12.59%)

Today I removed the homophony probe from the harmony graph because it was more confusing than helpful and altered the scale.

Daily total hours: 6

__________________________________________________________

17 July 2003

I think something may be wrong with my Hungarian data. The other Hungarian corpora I ran treated (i,e) as neutral vowels, placing them in both the 'front' and 'back' categories. The results I posted yesterday were run under a distribution of ieOU/Iaou. In order for me to get conclusive results, I have to decide which is right. I also need to confirm that the VHFC successfully handles neutral vowels.

I spent all day trying to get a good measure for harmony threshold with respect to neutral vowels:

P(front & 1)*P(front & 2)^len + P(back & 1)*P(back & 2)^len

with neutral vowels: P((front|n) & 1)*P((front|n) & 2)^len + P((back|n) & 1)*P((back|n) & 2)^len >?100

The inherent problem in this is that the added probabilities of harmonic words can exceed 1. I think this is because it counts twice those words with two neutral vowels.

P(f&1)*P(f&2)^avg*P(n)+P(b&1)*P(b&2)^avg*P(n)?

Daily total hours: 5.5

_____________________________________________________

18 July 2003

My agenda today is to finalise the harmony threshold formula for neutral vowels. It must work for languages with no neutral vowels, as well. My last resort will be to add a field for neutral vowels. When I finish this, I will have to re-run all the harmony corpora.

My solution to the neutral vowel problem is now implemented. Not only did I have to change the formula for the harmony threshold, but I had to redesign the algorithm for calculating harmony in words. I added a new field to the program to account for languages like Hungarian, so now the user is prompted to enter neutral vowels before the other classes.

The current incarnation of the threshold formula exploits the nature of neutral vowels. Because neutral vowels do not affect the harmony (or disharmony) of the word that contains them, it is like they are not vowels at all. At least, any information about their interaction with other vowels is irrelevant. Therefore, we can predict harmony levels by taking into account the average number of NON_NEUTRAL syllables in a word. The old formula stood at:

P(front & 1)*P(front & 2)^len-1 + P(back & 1)*P(back & 2)^len-1

Ignoring neutral vowels, the formula is now:

P(front & 1)*P(front & 2)^navg-1 + P(back & 1)*P(back & 2)^navg-1

Where 'navg' is equal to the avgerage number of syllables per word that do not contain neutral vowels.

I think this works. I'm doing tests to determine whether or not this is a useful figure. I'm also wondering if we should use the average syllable length of polysyllabic words only, given that those are the only ones subject to harmony. However, we would have to be consistent and not take frequency data from monosyllables in order to justify this.

NO. The above measure for a harmony threshold is way too low, as it does not contain those words with only neutral vowels, of which there are many. Maybe we could add this figure to P(n&1)*P(n&2). !!Must also delete neutrals from count_first and count_rest!!

I should probably scrap the whole thing and rewrite it in python...

Daily total hours: 6.5

Weekly total hours: 24

___________________________________________________

20 July 2003

P(harm) = (P(f&1)*P((f|n)2)^avg-1)) + (P(b&1)*P((b|n)&2)^avg-1) + (P(n&1)*P((n|f)&2)^avg-1) + (P(n&1)*P((n|b)&2)^avg-1) - (P(n&1)*P(n&2)^avg-1)

VICTORY!

Daily total hours: 1

Weekly total hours: 25