In the age of big data, we have all been witness to the mind-bogglingly beautiful phyologentic trees presented by the equally big names in linguistics, and some of us got through the awe to consider how did they make them. Given that I am fortunate enough to be standing shoulder to shoulder among these giants for a couple of months here at the Max-Planck-Institut für Menschheitsgeschichte, I wanted to attempt to act as translator to those of you whom may be interested how to speak the languages of R and Python, and then how to impart that knowledge into tree-drawing programs like Splits Tree, Densi Tree and the (in)famous BEAUTi and BEAST.
I have learned all this so quickly that I feel like Trinity downloading helicopter-flying instructions in the Matrix, and a huge part of that was thanks to the Eurasia3angle Mini-Bayesian School for Transeurasian Linguists allowing me to sit in on their tutorials, but in truth it has been a collaborative effort wherein literally almost every person in the Department of Linguistic and Cultural Evolution at the MPI here in Jena has contributed to making me feel and be part of their team through teaching and sharing. Articles written by current colleagues such as The Potential of Automatic Word Comparison for Historical Linguistics can teach you much better than I can the why and the how behind the complex methods, but maybe this brief tutorial can make the process seem a bit less intimidating!
One of the most obvious obstacles to comparative/historical/typological linguistics is consistency in data representation. Despite the existence of a perfectly reasonable international phonetic alphabet, linguists insist on using their own conventions to transcribe languages, often languages that no other person on earth has ever written, thus rendering the data more or less obsolete when it comes to attempting any sort of reconstruction. However, thanks to the authors of The Unicode Cookbook and code therein, if we at least know what sound the linguist heard, (as in the Africanist tradition of using y for [j]), then a simple script is all it takes to quickly and efficiently change each of the characters in a file such as a lexical word-list into those with which linguists are more familiar (just as an aside, for those of you who store your data in FLEx, the simplest way to extract it into tabular format is to go to the lexicon view and literally copy and paste the entire sheet, filtered with the columns you want to show).
Lets take just a few words as an example to start out. I have been doing some translation of Fula varieties in my spare time and have been curious to know if they all have the same types of consonant mutation. Often, I review other speakers’ transcriptions. As we see here, each of related these words for the lexeme ‘SHEEP’ is written with its own transcription conventions.
Adamawa – mbaala / baali
Pulaar – mba:lu/baali:
Despite the fact that the form in Maasina is phonetically identical to that of Pulaar, the orthography might mask the similarities due to the transcription conventions. In fact, one of the distinctions here is subtle; even if your data-set is tiny, spotting the difference between the IPA symbol for length : and the colon : is not easy. Thus, for the sake of time and accuracy, let’s let the computer handle these issues.
Now, everything I have to show from this point forward is other people’s work – I take no credit for designing or writing these amazing tools and scripts. My only goal here is to translate programmer-speak into that which can be understood by us mere mortals.
The first task is to create a folder in which you will house all your word-lists and any scripts that you need. For the sake of ease, it is best not to store it too deeply on your computer, but that is up to you. In this folder create a plain text file such as .txt (or ideally .tsv with Texmate) and call it LANGUAGE.tsv where LANGUAGE is the language on which you work. For example, I will put the words above into a file called Fula-wordlist.tsv (remember not to use spaces or special characters in file or folder names). The file must have three tabbed columns in the following order: DOCULECT (name of the source language), ENGLISH (the gloss in English – or otherwise target language), and IPA (the source language transcription):
DOCULECT CONCEPT IPA
Maasina sheep (sg) mba:lu
Adamawa sheep (sg) mba:la
Pulaar sheep (sg) mba:lu
The tools you will need to install are called Lingpy, which you can retrieve, learn about, and install from this link: Lingpy, and Concepticon. If you are using a Mac, then all is well and using Python will be straightforward as it is already installed. Just fire up that ominous terminal and type each of the following (without the $ – this just means you are using the Shell rather than Python, basically different languages like English and German):
$ pip install lingpy
$ pip install pyconcepticon
If you get any sort of error about user rights and restrictions, you may, as indicated have to invoke something called sudo rights. This is not so ideal but if necessary, type
$ sudo pip install lingpy
$ sudo pip install pyconcepticon
Assuming all goes well, you can begin to use Lingpy and Pyconcepticon through Python, by literally typing the word Python into the command prompt, you will start python – you should see a message telling you about the version that is installed on your computer. You will know you are in Python because of the three greater than symbols. Next type the following (excluding the three >>>):
>> from lingpy import *
You may get two warnings:
2017-11-16 18:51:36,746 [WARNING] Module ‘sklearn’ could not be loaded. Some methods may not work properly.
2017-11-16 18:51:36,747 [WARNING] Module ‘igraph’ could not be loaded. Some methods may not work properly.
That is ok, proceed with the next command which will be the location of where you have saved your word-list file (but be sure to write in your specific path, this can be located by right clicking on the name of the file in Finder, it will begin with /Users and the … will be the part that is specific to your path):
>>> import os
to make sure that the working directory is set, type the following:
If you set up your working directory correctly, Python will list it for you.
Now we are ready to tell Lingpy to convert the orthography of the word-list into IPA:
(I just learned that this feature is not yet available for the currently stable version of Lingpy, so wait a bit for the new version to be released before using this feature)
$ lingpy profile -i Fula-wordlist.tsv -o Fula-profile.tsv –context –column=form
In any case, once you have a working orthography profile, which is essentially a sheet with two tabulated columns, the first of which has the symbols used in your language’s orthography, and the second with the IPA accepted characters, you can type/copy the following into the terminal (still running Python) to simply and effectively convert the symbols used in your language’s orthography into IPA characters.
>>> from segments.tokenizer import Tokenizer
>>> wl = Wordlist(‘Fula–wordlist.tsv’)
>>> op = Tokenizer(‘Fula-profile.tsv’)
>>> wl.add_entries(‘tokens’, ‘ipa’, op, column=’IPA’)
>>> wl.output(‘tsv’, filename=’Fula_output’, ignore=’all’, prettify=False)
The other wonderful thing that this script will do is to assign each entry a numeric ID and to make spaces between the IPA characters in a new column called TOKENS so that you will have a list that will be quite easy for alignment.
ID DOCULECT CONCEPT IPA TOKENS COGID ALIGNMENT
Maasina sheep (sg) mba:lu m b aː l u 1 m b aː l u 2
Adamawa sheep (sg) mba:la m b aː l a 1 m b aː l a 3
Pulaar sheep (sg) mba:lu m b aː l u 1 m b aː l u
Now, just like how we converted arbitrary symbols into the internationally-accepted orthography of IPA, we need to use Concepticon to map the definitions in our word-list file into those recognized and accepted by the standards of the linguistic community. For instance, what does ‘belly’ mean? Stomach, intestines, even the heart in many languages. Concepticon will suggest such meanings and then the researcher must decide from precise concepts with explicit meanings what actually is the accurate gloss.
We can take the output of the orthography processed file, Fula-wordlist.tsv with the three columns as the input, and then call the output file, Fula-wordlist-output.tsv with this simple command (in the terminal, but not in Python, open a new window if Python is still running) as follows:
$ concepticon map_concepts Fula-wordlist.tsv > Fula-wordlist-output.tsv
The output file will have (suggested) accepted meanings with their ID numbers in Concepticon for that which we have provided in our word-list, so that our word sheep is recognized as:
|1331||SHEEP||A common, four-legged animal (Ovis) that is commonly kept by humans for its wool.||Animals||Person/Thing|
Now, because we have those precious ID’s that we assigned our word-list items, we can easily merge the concept output file with the orthography output file into one, fully comparable file (which we will save as Fula_output.tsv) from which we will use for the automatic cognate detection in our final two processing steps:
>>> lex = LexStat(‘Fula_output.tsv’)
>>> lex.cluster(method=’sca’, threshold=0.45, ref=’cogid’)
>>> lex.output(‘tsv’, filename=’Cognates’, subset=True, prettify=False, ignore=’all’)
>>> alm=Alignments(lex, ref=’cogid’)
>>> alm.output(‘tsv’, filename=’Cognates_align’, subset=True,
>>> cols=[‘doculect’, ‘concept’, ‘ipa’, ‘tokens’, ‘cogid’, ‘alignment’], prettify=False, ignore=’all’)
The result will be a file called Cognates_align which contains a wealth of information that will be read and understood better than me by the amazing program Edictor.
ID DOCULECT CONCEPT IPA TOKENS SONARS PROSTRINGS CLASSES LANGID NUMBERS WEIGHTS DUPLICATES COGID
Maasina sheep (sg) mba:lu m b aː l u 4 1 7 5 7 ACYBZ MPALY 2 2.M.C 2.P.C 2.A.V 2.L.C 2.Y.V 2.0 1.5 1.3 1.75 0.8 0 12
Adamawa sheep (sg) mba:la m b aː l a 4 1 7 5 7 ACYBZ MPALA 1 1.M.C 1.P.C 1.A.V 1.L.C 1.A.V 2.0 1.5 1.3 1.75 0.8 0 13
Pulaar sheep (sg) mba:lu m b aː l u 4 1 7 5 7 ACYBZ MPALY 3 3.M.C 3.P.C 3.A.V 3.L.C 3.Y.V 2.0 1.5 1.3 1.75 0.8 0 1
# Created using the LexStat class of LingPy-2.5# Cluster: sca_upgma_0.45
For Edictor, there is no need to download, you can simply drag your newly created Cognates file into the window provided, minimally select to view the Alignment and COGID columns, and right click in the COGID column to see which words the automatic cognate detector chose as cognates for your language. The cognate-detecting algorithm of course uses sound correspondences so the more words in your list, the better. Also, Edictor only takes one source file at a time (as does the Cognate mapping script) so it will be necessary to merge separate languages into one file if they are saved and processed separately.
Within Edictor, there may be many adjustments and alignments that must be corrected by the researcher, but when you have some cognates with which you are satisfied, you can download the results into a Nexus file. This extension is necessary for those tree-drawing programs mentioned above. The easiest is SplitsTree, from which you can directly open the .nex file and the tree will appear! (For the more complex programs such as BEAST, you will need to create a .trees file in R. For this, you can use packages Ape and Phangorn.) Our little Fula tree is essentially one straight line as the cognates are all so obvious but the tree featured above in this post was actually made with only three lexical items! Not at all accurate in terms of linguistic relations, but it shows that I am now ready to input the *big data*!
Hope this helped, if not, feel free to post questions in the comment section. In the next post, I’ll talk a bit about the non-computational side of our collective investigatory updates on Bangime!