19. Phonotactics I (with language identifier)


      Here are written samples of six languages, taken from the BBC World Service Web site. Do you recognize any of these languages? How can you identify a language by just looking at its written form? An unusual (with reference to English or Chinese) writing system would of course stand out right away, and you might or might not be able to identify the language based on the particular shapes of the symbols used. Here, however, we are considering only languages that use a Latin alphabet, though in some cases, the diacritical marks used will offer some good clues. (Hopefully the diacritics are displaying correctly on your Browser. Charts for entering these symbols in ISO-10646 Unicode can be found here.)


1. Iraqi ta ce ta gayyaci shugaban sufetoci masu duba makamai na Majalissar dinkin duniya, Hans Blix, ya je Baghdaza domin shawarwari a cikin watan Janairu. Kotun soji a Russia ta yanke hukuncin cewa, babban jami'in sojin da ake zargi da aikata laifuka kan farar hulla a Chechnya bashi da laifi. A Najeriya Jam'iyar PDP mai mulkin kasar ta bayyana shirye-shiryenta na fitar da dan takarar shugaban kasa a babban taron jamiyyar da za'a fara ranar juma'a mai zuwa, a Abuja babban birnin kasar.

2. Udhëheqësi i qipriotëve turq Rauf Denktash ka thënë se do të japë dorëheqjen nëse Turqia do të përpiqet ta detyrojë për të nënshkruar një marrëveshje për ribashkimin e Qipros. Në një koment për gazetën turke Hurriyet, zoti Denktash tha se ai nuk po mendon për dorëheqjen dhe se deri tani nuk i është bërë presion për ta nënshkruar marrëveshjen, por do të ndryshojë mendim nëse kjo ndodh.

3. Tim hak asasi manusia PBB untuk Pantai Gading mengukuhkan bahwa telah terjadi pelanggaran serius oleh berbagai pihak dalam pemberontakan berkepanjangan di negara itu. Ketua tim, yang juga wakil komisaris tinggi Hak Asasi Manusia, Bertrand Ramcharan, mengatakan proses perdamaian perlu sekali dipercepat, guna menghindari ancaman terjadinya pelanggaran lebih jauh.

4. Almanya tarafından inşa edilen hızlı tren Çin'in liman kenti Şangay'da törenle işletmeye açıldı. Ancak, Şangay havaalanıyla iş merkezini birbirine başlayan hızlı trenin kamuya açılması, bir yılı bulacak. Saatte 400 kilometreye kadar hız yapan tren, havaalanıyla kent merkezi arasındaki suüreyi 7 dakikada katediyor.

5. Waxana uu Kibaki ku ballan qaaday in maamulkiisu uu dib u habayn baahsan ku samayn doono waxyaabaha ay ka mid yihiin; waxbarashada dugsiga hoose oo lacag la'aan laga dhigi doono, dhismaha daryeel caafimaad, dhaqaale xooggan iyo in la dabar gooyo musuqmaasuqa. Waxa uu Kibaki ku yiri dadkii uu la hadlayey. Waxa aan idiin ballan qaaday in aanan idin qalbi jabin...anigoo idiin mahad haya ayaan noqon doonaa khaadim aad leedihiin oon kibir iyo isla weynin lahayn.

6. Podmetnuti požari uništili su do sada dijelove četiriju središta za azilante u Australiji i na australskom Božićnom Otoku u Indijskom Oceanu, pri čemu je počinjena šteta od više milijuna dolara. Posljednji slučaj podmetanja požara dogodio se upravo u izbjegličkom logoru na Božićnom Otoku; tamo su azilanti naoružani cijevima svladali stražare te zapalili veliku prostoriju za ručavanje. Nepunih 24 sata prije toga, sličan se prosvjed dogodio i u izbjegličkom logoru Woomera, smještenom duboko u pustinji na jugu kontinenta.


     Unless you know the language in question and recognize at least some of the words or its use of diacritics if there are any, you do not have much other choice than to observe how the letters are combined with each other, that is, which letters follow which other letters, and then determine which language typically has the patterns of letter sequences you notice. Word length may also give some clues.

     First, try to identify the languages above based only on your familiarity with the language, using either individual words or patterns of letter combinations that you recognize. When you have finished, you can check your answers by copying and pasting the above passages into this online language identifier, from Xerox (it does, however, get some of the samples wrong):

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser


     Or use Google Translate's automatic language detector — though it has previously misidentified two of the samples (how can you tell which two?):

http://translate.google.com/

     Can you think of a more accurate way to identify the languages doing just a plain Google search?

     The following is a downloadable application called Polyglot 3000 which claims to be able to identify more than 400 languages. It will also tell you how confident it is, in percent, of each language identification. It is considerably more accurate than the online identifiers — it gets all the above samples right:

http://www.polyglot3000.com/download.shtml
    

     After you have identified the languages, go to Ethnologue to learn more about where each one is spoken, how many speakers it has, what languages it is related to, and so forth. You can look on the Internet for more samples of different languages to try out on the language identifiers – be resourceful in your search methods!

     If you had to design a method for identifying a language based on a written sample, how would you go about it? If you are interested in some ways to do this, you can link to a 6-page paper (in .pdf format) on this subject from the Xerox site here (local copy here).

      It should be clear by now that each language has its own patterns or rules of allowable letter sequences in its writing system, based on the permissible sound sequences in the spoken language. This set of rules is called phonotactics. Phonotactics is mainly a phonological rather than phonetic issue, but it is something phoneticians are also interested in. Here is the World Phonotactics Database, hosted by the Australian National University in Canberra, where you can compare and contrast phonotactic patterns in different languages, among many other things. An understanding of the phonotactics of a language is important in such applications as synthesized speech (samples here) and automatic speech recognition (ASR; demo here, or try MS Word's dictation function: click on → Tools → Speech and start training the program).

     According to a New Scientist report in April 2012, it would seem that baboons have some ability to infer English phonotactic rules!

http://www.newscientist.com/article/dn21697-baboons-and-4letter-words-point-to-origins-of-reading.html


      In the next two pages, we will look more closely at syllable structure, allowable sequences of phonemes in English, and ways to get help with exercises C, D and E of chapter 4 in Ladefoged.

Next: Phonotactics II: Syllable structure


on to next page       back       index I        index II      home