19. Phonotactics I (with language identifier)
          Here 
    are written samples of six languages, taken from the BBC 
    World Service Web site. Do you recognize any of these languages? How can 
    you identify a language by just looking at its written form? An unusual (with 
    reference to English or Chinese) writing system would of course stand out 
    right away, and you might or might not be able to identify the language based 
    on the particular shapes of the symbols used. Here, however, we are considering 
    only languages that use a Latin alphabet, though in some cases, the diacritical 
    marks used will offer some good clues. (Hopefully the diacritics are displaying 
    correctly on your Browser. Charts for entering these symbols in ISO-10646 
    Unicode can be found here.)
    
    
    1. Iraqi ta ce ta gayyaci 
    shugaban sufetoci masu duba makamai na Majalissar dinkin duniya, Hans Blix, 
    ya je Baghdaza domin shawarwari a cikin watan Janairu. Kotun soji a Russia 
    ta yanke hukuncin cewa, babban jami'in sojin da ake zargi da aikata laifuka 
    kan farar hulla a Chechnya bashi da laifi. A Najeriya Jam'iyar PDP mai mulkin 
    kasar ta bayyana shirye-shiryenta na fitar da dan takarar shugaban kasa a 
    babban taron jamiyyar da za'a fara ranar juma'a mai zuwa, a Abuja babban birnin 
    kasar.
    
    2. Udhëheqësi 
    i qipriotëve turq Rauf Denktash ka thënë se do të japë 
    dorëheqjen nëse Turqia do të përpiqet ta detyrojë 
    për të nënshkruar një marrëveshje për ribashkimin 
    e Qipros. Në një koment për gazetën turke Hurriyet, zoti 
    Denktash tha se ai nuk po mendon për dorëheqjen dhe se deri tani 
    nuk i është bërë presion për ta nënshkruar marrëveshjen, 
    por do të ndryshojë mendim nëse kjo ndodh.
    
    3. Tim hak asasi manusia PBB untuk Pantai Gading mengukuhkan 
    bahwa telah terjadi pelanggaran serius oleh berbagai pihak dalam pemberontakan 
    berkepanjangan di negara itu. Ketua tim, yang juga wakil komisaris tinggi 
    Hak Asasi Manusia, Bertrand Ramcharan, mengatakan proses perdamaian perlu 
    sekali dipercepat, guna menghindari ancaman terjadinya pelanggaran lebih jauh.
    
    4. 
    Almanya tarafından inşa edilen hızlı tren Çin'in 
    liman kenti Şangay'da törenle işletmeye açıldı. 
    Ancak, Şangay havaalanıyla iş merkezini birbirine başlayan 
    hızlı trenin kamuya açılması, bir yılı 
    bulacak. Saatte 400 kilometreye kadar hız yapan tren, havaalanıyla 
    kent merkezi arasındaki suüreyi 7 dakikada katediyor.
    
    5. Waxana uu Kibaki ku ballan qaaday in maamulkiisu 
    uu dib u habayn baahsan ku samayn doono waxyaabaha ay ka mid yihiin; waxbarashada 
    dugsiga hoose oo lacag la'aan laga dhigi doono, dhismaha daryeel caafimaad, 
    dhaqaale xooggan iyo in la dabar gooyo musuqmaasuqa. Waxa uu Kibaki ku yiri 
    dadkii uu la hadlayey. Waxa aan idiin ballan qaaday in aanan idin qalbi jabin...anigoo 
    idiin mahad haya ayaan noqon doonaa khaadim aad leedihiin oon kibir iyo isla 
    weynin lahayn.
    
    6. Podmetnuti požari 
    uništili 
    su do sada dijelove četiriju 
    središta 
    za azilante u Australiji i na australskom Božićnom 
    Otoku u Indijskom Oceanu, pri čemu 
    je počinjena 
    šteta 
    od više 
    milijuna dolara. Posljednji slučaj 
    podmetanja požara 
    dogodio se upravo u izbjegličkom 
    logoru na Božićnom 
    Otoku; tamo su azilanti naoružani 
    cijevima svladali stražare 
    te zapalili veliku prostoriju za ručavanje. 
    Nepunih 24 sata prije toga, sličan 
    se prosvjed dogodio i u izbjegličkom 
    logoru Woomera, smještenom 
    duboko u pustinji na jugu kontinenta.
    
    
         Unless you know the language in question and 
    recognize at least some of the words or its use of diacritics if there are 
    any, you do not have much other choice than to observe how the letters are 
    combined with each other, that is, which letters follow which other 
    letters, and then determine which language typically has the patterns of letter 
    sequences you notice. Word length may also give some clues. 
    
         First, try to identify the languages above based 
    only on your familiarity with the language, using either individual words 
    or patterns of letter combinations that you recognize. When you have finished, 
    you can check your answers by copying and pasting the above passages into 
    this online language identifier, from Xerox (it does, however, get some of 
    the samples wrong): 
    
    http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser 
     
    
         Or use Google Translate's automatic language 
    detector — though it has previously misidentified two of the samples 
    (how can you tell which two?):
    
    http://translate.google.com/ 
    
    
         Can you think of a more accurate way to identify 
    the languages doing just a plain Google search?
    
         The following is a downloadable application 
    called Polyglot 3000 which claims to be able to identify more than 400 languages. 
    It will also tell you how confident it is, in percent, of each language identification. 
    It is considerably more accurate than the online identifiers — it gets 
    all the above samples right:
    
    http://www.polyglot3000.com/download.shtml 
    
        
         After you have identified the languages, go 
    to Ethnologue 
    to learn more about where each one is spoken, how many speakers it has, what 
    languages it is related to, and so forth. You 
    can look on the Internet for more samples of different languages to try out 
    on the language identifiers  be resourceful in your search methods!
     If 
    you had to design a method for identifying a language based on a written sample, 
    how would you go about it? If you are interested in some ways to do this, 
    you can link to a 6-page paper (in .pdf format) on this subject from the Xerox 
    site here 
    (local copy here). 
    
          It should be clear by now that each language 
    has its own patterns or rules of allowable letter sequences in its writing 
    system, based on the permissible sound sequences in the spoken language. This 
    set of rules is called phonotactics. 
    Phonotactics is mainly a phonological rather than phonetic issue, but it is 
    something phoneticians are also interested in. Here is the World 
    Phonotactics Database, hosted by the Australian National University in 
    Canberra, where you can compare and contrast phonotactic patterns in different 
    languages, among many other things. An understanding of the phonotactics of 
    a language is important in such applications as synthesized 
    speech (samples here) 
    and automatic speech recognition (ASR; demo here, 
    or try MS Word's dictation function: click on → Tools → Speech 
    and start training the program). 
    
         According to a New Scientist report in April 2012, 
    it would seem that baboons have some ability to infer English phonotactic 
    rules!
    
    http://www.newscientist.com/article/dn21697-baboons-and-4letter-words-point-to-origins-of-reading.html 
    
    
    
          In the next two pages, we will look more closely 
    at syllable structure, allowable sequences of phonemes in English, 
    and ways to get help with exercises C, D and E of chapter 4 in Ladefoged.
    
    Next: Phonotactics 
    II: Syllable structure
 
on 
to next page       back 
      index 
I        index 
II      home