19. Phonotactics I (with language identifier)
Here
are written samples of six languages, taken from the BBC
World Service Web site. Do you recognize any of these languages? How can
you identify a language by just looking at its written form? An unusual (with
reference to English or Chinese) writing system would of course stand out
right away, and you might or might not be able to identify the language based
on the particular shapes of the symbols used. Here, however, we are considering
only languages that use a Latin alphabet, though in some cases, the diacritical
marks used will offer some good clues. (Hopefully the diacritics are displaying
correctly on your Browser. Charts for entering these symbols in ISO-10646
Unicode can be found here.)
1. Iraqi ta ce ta gayyaci
shugaban sufetoci masu duba makamai na Majalissar dinkin duniya, Hans Blix,
ya je Baghdaza domin shawarwari a cikin watan Janairu. Kotun soji a Russia
ta yanke hukuncin cewa, babban jami'in sojin da ake zargi da aikata laifuka
kan farar hulla a Chechnya bashi da laifi. A Najeriya Jam'iyar PDP mai mulkin
kasar ta bayyana shirye-shiryenta na fitar da dan takarar shugaban kasa a
babban taron jamiyyar da za'a fara ranar juma'a mai zuwa, a Abuja babban birnin
kasar.
2. Udhëheqësi
i qipriotëve turq Rauf Denktash ka thënë se do të japë
dorëheqjen nëse Turqia do të përpiqet ta detyrojë
për të nënshkruar një marrëveshje për ribashkimin
e Qipros. Në një koment për gazetën turke Hurriyet, zoti
Denktash tha se ai nuk po mendon për dorëheqjen dhe se deri tani
nuk i është bërë presion për ta nënshkruar marrëveshjen,
por do të ndryshojë mendim nëse kjo ndodh.
3. Tim hak asasi manusia PBB untuk Pantai Gading mengukuhkan
bahwa telah terjadi pelanggaran serius oleh berbagai pihak dalam pemberontakan
berkepanjangan di negara itu. Ketua tim, yang juga wakil komisaris tinggi
Hak Asasi Manusia, Bertrand Ramcharan, mengatakan proses perdamaian perlu
sekali dipercepat, guna menghindari ancaman terjadinya pelanggaran lebih jauh.
4.
Almanya tarafından inşa edilen hızlı tren Çin'in
liman kenti Şangay'da törenle işletmeye açıldı.
Ancak, Şangay havaalanıyla iş merkezini birbirine başlayan
hızlı trenin kamuya açılması, bir yılı
bulacak. Saatte 400 kilometreye kadar hız yapan tren, havaalanıyla
kent merkezi arasındaki suüreyi 7 dakikada katediyor.
5. Waxana uu Kibaki ku ballan qaaday in maamulkiisu
uu dib u habayn baahsan ku samayn doono waxyaabaha ay ka mid yihiin; waxbarashada
dugsiga hoose oo lacag la'aan laga dhigi doono, dhismaha daryeel caafimaad,
dhaqaale xooggan iyo in la dabar gooyo musuqmaasuqa. Waxa uu Kibaki ku yiri
dadkii uu la hadlayey. Waxa aan idiin ballan qaaday in aanan idin qalbi jabin...anigoo
idiin mahad haya ayaan noqon doonaa khaadim aad leedihiin oon kibir iyo isla
weynin lahayn.
6. Podmetnuti požari
uništili
su do sada dijelove četiriju
središta
za azilante u Australiji i na australskom Božićnom
Otoku u Indijskom Oceanu, pri čemu
je počinjena
šteta
od više
milijuna dolara. Posljednji slučaj
podmetanja požara
dogodio se upravo u izbjegličkom
logoru na Božićnom
Otoku; tamo su azilanti naoružani
cijevima svladali stražare
te zapalili veliku prostoriju za ručavanje.
Nepunih 24 sata prije toga, sličan
se prosvjed dogodio i u izbjegličkom
logoru Woomera, smještenom
duboko u pustinji na jugu kontinenta.
Unless you know the language in question and
recognize at least some of the words or its use of diacritics if there are
any, you do not have much other choice than to observe how the letters are
combined with each other, that is, which letters follow which other
letters, and then determine which language typically has the patterns of letter
sequences you notice. Word length may also give some clues.
First, try to identify the languages above based
only on your familiarity with the language, using either individual words
or patterns of letter combinations that you recognize. When you have finished,
you can check your answers by copying and pasting the above passages into
this online language identifier, from Xerox (it does, however, get some of
the samples wrong):
http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser
Or use Google Translate's automatic language
detector — though it has previously misidentified two of the samples
(how can you tell which two?):
http://translate.google.com/
Can you think of a more accurate way to identify
the languages doing just a plain Google search?
The following is a downloadable application
called Polyglot 3000 which claims to be able to identify more than 400 languages.
It will also tell you how confident it is, in percent, of each language identification.
It is considerably more accurate than the online identifiers — it gets
all the above samples right:
http://www.polyglot3000.com/download.shtml
After you have identified the languages, go
to Ethnologue
to learn more about where each one is spoken, how many speakers it has, what
languages it is related to, and so forth. You
can look on the Internet for more samples of different languages to try out
on the language identifiers be resourceful in your search methods!
If
you had to design a method for identifying a language based on a written sample,
how would you go about it? If you are interested in some ways to do this,
you can link to a 6-page paper (in .pdf format) on this subject from the Xerox
site here
(local copy here).
It should be clear by now that each language
has its own patterns or rules of allowable letter sequences in its writing
system, based on the permissible sound sequences in the spoken language. This
set of rules is called phonotactics.
Phonotactics is mainly a phonological rather than phonetic issue, but it is
something phoneticians are also interested in. Here is the World
Phonotactics Database, hosted by the Australian National University in
Canberra, where you can compare and contrast phonotactic patterns in different
languages, among many other things. An understanding of the phonotactics of
a language is important in such applications as synthesized
speech (samples here)
and automatic speech recognition (ASR; demo here,
or try MS Word's dictation function: click on → Tools → Speech
and start training the program).
According to a New Scientist report in April 2012,
it would seem that baboons have some ability to infer English phonotactic
rules!
http://www.newscientist.com/article/dn21697-baboons-and-4letter-words-point-to-origins-of-reading.html
In the next two pages, we will look more closely
at syllable structure, allowable sequences of phonemes in English,
and ways to get help with exercises C, D and E of chapter 4 in Ladefoged.
Next: Phonotactics
II: Syllable structure
on
to next page back
index
I index
II home