19. Phonotactics I (with language identifier)


      Here are written samples of six languages, taken from the BBC World Service Web site. Do you recognize any of these languages? How can you identify a language by just looking at its written form? An unusual (with reference to English or Chinese) writing system would of course stand out right away, and you might or might not be able to identify the language based on the particular shapes of the symbols used. Here, however, we are considering only languages that use a Latin alphabet, though in some cases, the diacritical marks used will offer some good clues. (Hopefully the diacritics are displaying correctly on your Browser. Charts for entering these symbols in ISO-10646 Unicode can be found here.)


1. Iraqi ta ce ta gayyaci shugaban sufetoci masu duba makamai na Majalissar dinkin duniya, Hans Blix, ya je Baghdaza domin shawarwari a cikin watan Janairu. Kotun soji a Russia ta yanke hukuncin cewa, babban jami'in sojin da ake zargi da aikata laifuka kan farar hulla a Chechnya bashi da laifi. A Najeriya Jam'iyar PDP mai mulkin kasar ta bayyana shirye-shiryenta na fitar da dan takarar shugaban kasa a babban taron jamiyyar da za'a fara ranar juma'a mai zuwa, a Abuja babban birnin kasar.

2. Udhëheqësi i qipriotëve turq Rauf Denktash ka thënë se do të japë dorëheqjen nëse Turqia do të përpiqet ta detyrojë për të nënshkruar një marrëveshje për ribashkimin e Qipros. Në një koment për gazetën turke Hurriyet, zoti Denktash tha se ai nuk po mendon për dorëheqjen dhe se deri tani nuk i është bërë presion për ta nënshkruar marrëveshjen, por do të ndryshojë mendim nëse kjo ndodh.

3. Tim hak asasi manusia PBB untuk Pantai Gading mengukuhkan bahwa telah terjadi pelanggaran serius oleh berbagai pihak dalam pemberontakan berkepanjangan di negara itu. Ketua tim, yang juga wakil komisaris tinggi Hak Asasi Manusia, Bertrand Ramcharan, mengatakan proses perdamaian perlu sekali dipercepat, guna menghindari ancaman terjadinya pelanggaran lebih jauh.

4. Almanya tarafından inşa edilen hızlı tren Çin'in liman kenti Şangay'da törenle işletmeye açıldı. Ancak, Şangay havaalanıyla iş merkezini birbirine başlayan hızlı trenin kamuya açılması, bir yılı bulacak. Saatte 400 kilometreye kadar hız yapan tren, havaalanıyla kent merkezi arasındaki suüreyi 7 dakikada katediyor.

5. Waxana uu Kibaki ku ballan qaaday in maamulkiisu uu dib u habayn baahsan ku samayn doono waxyaabaha ay ka mid yihiin; waxbarashada dugsiga hoose oo lacag la'aan laga dhigi doono, dhismaha daryeel caafimaad, dhaqaale xooggan iyo in la dabar gooyo musuqmaasuqa. Waxa uu Kibaki ku yiri dadkii uu la hadlayey. Waxa aan idiin ballan qaaday in aanan idin qalbi jabin...anigoo idiin mahad haya ayaan noqon doonaa khaadim aad leedihiin oon kibir iyo isla weynin lahayn.

6. Podmetnuti požari uništili su do sada dijelove četiriju središta za azilante u Australiji i na australskom Božićnom Otoku u Indijskom Oceanu, pri čemu je počinjena šteta od više milijuna dolara. Posljednji slučaj podmetanja požara dogodio se upravo u izbjegličkom logoru na Božićnom Otoku; tamo su azilanti naoružani cijevima svladali stražare te zapalili veliku prostoriju za ručavanje. Nepunih 24 sata prije toga, sličan se prosvjed dogodio i u izbjegličkom logoru Woomera, smještenom duboko u pustinji na jugu kontinenta.


     Unless you know the language in question and recognize at least some of the words or its use of diacritics if there are any, you do not have much other choice than to observe how the letters are combined with each other, that is, which letters follow which other letters, and then determine which language typically has the patterns of letter sequences you notice. Word length may also give some clues.

     First, try to identify the languages above based only on your familiarity with the language, using either individual words or patterns of letter combinations that you recognize. When you have finished, you can check your answers by copying and pasting the above passages into this online language identifier, the TextCat Language Guesser:

http://odur.let.rug.nl/~vannoord/TextCat/Demo/textcat.html

or this language identifier, from Xerox:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser


     If you like, you can also download onto your own computer (it's 2.01MB) a free Language Identifier from a company called Lextek:

http://www.lextek.com/langid/

Download:
http://www.libertypages.com/langid/li/language_identifier_setup.zip

     After you have identified the languages, go to Ethnologue to learn more about where each one is spoken, how many speakers it has, what languages it is related to, and so forth. Note that both online identifiers 'guess' the wrong language for the first and fifth language samples since they do not support these languages, but the Lextek identifier, which claims to be able to identify 260 different languages, gets both right. However it gets the second and fourth ones wrong, since the diacritics seem to get lost in the pasting in process. The TextCat and Xerox identifiers, on the other hand, are able to preserve the diacritics, and get these two right. You can look on the Internet for more samples of different languages to try out on the language identifiers – be resourceful in your search methods!

    If you had to design a method for identifying a language based on a written sample, how would you go about it? If you are interested in some ways to do this, you can link to a 6-page paper (in .pdf format) on this subject from the Xerox site here (local copy here).

      It should be clear by now that each language has its own patterns or rules of allowable letter sequences in its writing system, based on the permissible sound sequences in the spoken language. This set of rules is called phonotactics. Phonotactics is mainly a phonological rather than phonetic issue, but it is something phoneticians are also interested in. Here is a short essay on the phonotactics of different languages. An understanding of the phonotactics of a language is important in such applications as synthesized speech (samples here) and automatic speech recognition (ASR).

      In the next two pages, we will look more closely at syllable structure, allowable sequences of phonemes in English, and ways to get help with exercises C, D and E of chapter 4 in Ladefoged.

Next: Phonotactics II: Syllable structure


on to next page       back       index I        index II      home