Sufficiently Wrong: May 2015

Historical linguistics is a field that does not produce very "practically" useful results - in a way, I guess it can be compared to G.H. Hardy's characterization of pure maths. However, historically there has been a great interest in it - many of the greatest minds of the late 19th century invested huge amounts of effort in this area. In part this can be attributed to nationalism.

Since historical linguistics deals with long-term history in a way that naturally is associated with the history of ethnicities and tribes, it is natural that methods have been developed to differentiate claims that have been made just to appeal to nationalist sensibilities from accurate claims about linguistic relationship that tell us something confirmably real.

Of course, the fact of two languages being related does not really convey any ethical obligations between the two groups, does not create any reason for speakers of one of the languages to be conflated with speakers of the other, or to assume that speakers of the two languages think in similar fashions and have similar political desires. Nor should it be seen as a way for the other group to claim part in more recent achievements of the other group.

Of course, various idiots have failed to understand this and used as well as abused proper methodology with an axe to grind for whatever retarded reason. Still, the ability to at least somewhat differentiate nationalist propaganda from stuff that is more likely to have some real historical core to it has probably helped rein in some of the most inane pseudolinguistic hypotheses. Such pseudolinguistics still are aired frequently in various venues for nationalistic, religious or possibly even other reasons. I will return to these pseudolinguistic claims later on, so as to tie back to the main topic of this blog.

Of course, another important part in trying to figure out the history of languages was probably general human curiosity - you get a very real puzzle with some very real information about prehistory.

The first observation we'll make is that some languages are similar to each other. These similarities come in many shapes: German and Dutch have similar phonologies, a lot of shared vocabulary, and a large amount of shared grammar. The same goes for, say, Italian and Spanish. Realizing that such pairings (or even larger tuplets) of languages are somehow related was a starting point, in some sense.

A natural follow-up question is how we rank the relatedness of several closely related languages? Say German, Dutch and English, or Spanish, Romanian and Italian. And by 'how', I don't mean "what is the rank", I mean "how do we go about to rank them".

Let's first define what we mean when we say two languages are "related":

Def. 1: Two languages are related if they have come about by distinct sets of historical changes that have happened to a shared ancestral language.

Graph 1: an example

The edges in the graph represent changes. We could also use an empty set of changes:

Graph 2

Latin and Italian are related, because both Italian and Latin can be derived from Latin by different sets of changes. Now we can start wondering about things like relative distances.

It is clear that Latin is closer related to Latin than Italian is to Latin. This is a trivial statement, but it can be developed a bit. The model I will present will have a slight flaw, but one that is rather 'acceptable' as far as practical consequences go: we will not be able to distinguish whether Spanish or Italian is closer to Latin. I will later give a closer justification for this particular gap in the system.

Graph 3: A small tree of related languages

Now, we can clearly form a bigger structure like graph 3. Since I have not yet described why we know this structure, we shall just see this as an example of how relations work, rather than as a statement of facts about these languages. With the assumptions given in the graph, we a situation where Spanish, Italian and French form a rather tight group, and German, English and Dutch likewise. We basically can say that languages that share a node, are closer related to each other than they are to languages that do not share that node.

Graph 4: a tree with a complication

We might think that this gives us an opening as for how to compare how close a language is to the root: count the number of intervening nodes. Under that assumption, Latin is closer to Indo-European than Spanish, Italian, French, English, German or Dutch. However, since we really don't know the exact distance of Germanic or Latin to the root, this is putting a lot of stock into the intervening nodes. Also, we could have had a situation like graph 4, where a node we've got no record of ever existing actually makes the distances slightly off (if we just go by the number of nodes).

Of course, positing an intermediate node if in order to make an assertion regarding distances of some set of languages violates Occam's razor - we can't just willy nilly insert a Pre-Germanic node without evidence, and then assert that Spanish, Italian and French are closer to Indo-European than German, English and Dutch are. What we can do, however, is use this example to show why the number of nodes between our proto-Language and the descendants isn't particularly informative: it only tells us about how well we know the number of languages to have split from a branch and in what order - it does not tell us anything about how great the changes between the nodes are, and thus nothing about which particular branches' nodes are closer to the root.

However, now we do have a hierarchical way of comparing whether (language a, language b) or (language a, language c) are the closer pair of relatives. This is only really meaningful as long as one of the languages is held constant - when we're comparing (language a, language b) with (language c, language d), our measurement gets somewhat less meaningful.

In the next post, I will present what linguistic content is most often used for finding out things about relations between languages - but also point out how ignoring other content makes the idea of any kind of 'objective measure' of relatedness beyond hierarchies along the lines presented above. (Although one could imagine improvements in method that would fix that problem as well.)

I will also present some considerations as to why Occam's razor makes the conclusions those methods reach rather likely, and why we can consider families such as Indo-European, Uralic, Turkic, Afro-Asiatic and a number of other families overwhelmingly likely accurately to represent how these groups of languages are internally related.

Sufficiently Wrong

Wednesday, May 27, 2015

On Historical Linguistics: Part 1