Notes on the Lexicostatistical Comparison
of the Main Indo-European Language Groups


The methodology and the preliminary outcome
of the manual lexicostatistical analysis
performed on the 46-wordlist
for the 30 main Indo-European language groups
(draft notes)
(Prepared 2007-2008, published 09/2008,
minor corrections 03/2012)


[In 04.2009, the ASJP group published the results of a theoretically similar but a much more resourceful study using a 40-list that embraced most languages of the world. They say they began to move along the same lines c. 01/2008, whereas this page was published online in 09/2008, and it was an independent move. (A note added in 06/2009)]

[Essentially, this is an unfinished lexicostatistical study of the internal classification of the main Indo-European groups that was started in 2007, but was stopped due to problems with the program development. However, it still contains many interesting points concerning the taxonomic position of some the IE groups based on the lexical and phonological similarity in the basic vocabulary of the Indo-European languages. (A note added in 03/2012)]



How lexemes were counted


The internal classification of the Indo-European languages was conducted based on the lexicostatistical calculations and direct lexical comparison of the 46 basic words collected in this table of basic lexemes of the Indo-European languages.

It's not the matter of how the lexemes in the table were counted, it's rather the matter of maintaining a standardized, uniform, unbiased, objective method of computation throughout the dataset. Initially, it was planned to feed the table into a program to automate all the calculations, but a Python program, prepared by a friend of mine, has never been finished, so as of 2008, I simply had to do all the calculations manually using subjective analysis when comparing words in the table. Therefore, think of this article as a quick-and-dirty, preliminary analysis.

Again, it should be emphasized that the uniformity in calculations is what mattered here, because the lexicostatistical percentages have no meaning as absolute values; they only have a meaning when compared to each other within these particular 46-wordlists, so in case someone attempts to repeat some of the counting and gets a different value for a pair of languages, keep in mind that it's only the relation of any two values that actually counts.

Similar considerations would be true for possible statistical errors. As in all statistical studies, occasional errors do not matter and cannot affect the final outcome directly (see the law of large numbers). This is what makes statistical methods so robust. Even if one finds some false cognates or lose some of the right ones, that would not impact the final result, as long as one can consistently maintain the same method of counting throughout the whole study. It may not even be relevant, whether one uses too a strict or too a lax method of counting, because all that he or she will get in either case is an overall negative or positive local offset value that will be effectively canceled out after comparing these values globally to each other.

As a result, it's in fact irrelevant which method of counting was used. However, as a matter of digression, I can say that I have not used the commonly accepted method of cognates solely (as, for instance, Dyen et al. have). This classical method was enhanced by considering the actual phonological similarity of two sets of possible cognates.

The commonly accepted method of counting cognates in Swadesh lists is based on the presumption that the genetic relation of two sets of lexemes, and finally two languages, can be shown by demonstrating that just two members of each pair are related to each other by the method of regular correspondences, and then counting up the number of such matches in a sufficiently long list.

But what if we use a short list? That would make this method susceptible to the following error in reasoning. Suppose, we compare English, German and Armenian:

<two>, /tu:/ <zwei>, /tsvai/ yerek'
<three>, /ßri:/ <drei>, /drai/ yerku
<four>, /fo:r/ <fier>, /fi:r/ chors

The relation of Armenian numerals is thought to have been demonstrated by the method of regular correspondences, therefore in the classical method the genetic relation of these three columns is exactly the same, and we should count all three pairs as cognates. However, it is quite obvious that the English-to-German pairs stand much closer to each other, whereas the Armenian pairs look as something completely different. That happened because we put the theory before the facts, overestimating the importance of theoretical analysis, instead of looking at facts of life and phonology as they are. The facts demand to take notice of actual phonological similarity and compare words in a phoneme-by-phoneme rather than in a word-by-word fashion. Consequently, while /tu:/ and /tsvai/ look related, /yerku/ does not, and should not be regarded as an exact match for the former two lexemes.

Curiously, mathematicians Serva & Petroni have already attempted a letter-by-letter comparison in 2007-2008 (no cognates at all!) using Dyen's database of Indo-European languages, and obtained some consistent (though not completely waterproof) results, despite the obvious straightforwardness and simplicity of their method.

There's also a good survey of computational methods in historical linguistics by Agarwal & Adams (2007).
This point is so important, it deserves a separate article, but we'll now have to leave it aside.

Briefly speaking, the counting in the table was done according to the following manual procedure of cognate comparison. For each pair of words a number was assigned that reflected the level of similarity between these two words.

1Phonologically similar cognates with at least one very similar stable phoneme (mostly, an initial consonant), as in /t/:/ts/ in <two>:<zwei>.
0.5Obviously phonologically dissimilar cognates, with less than one stable phoneme, as in Eng. /ßri:/ : Shugni /arái/ — here the initial *th is lacking, whereas /r/ is not enough to make the two lexemes look "sufficiently similar".
0.5-0.2Probable unproved cognates with some phonological similarity, as in Pashto /xulá/ : Shugni /Gaiv/ 'mouth', or in other uncertain cases. These cases were rather rare, and could not statistically affect the results.
0.5If one of the compared languages has a second synonym with a different root, which is not explained in quotes as having a different meaning or semantic connotation, e.g. Old Eng. <wind> : Welsh <gwynt>; <avel>.
0.3-0.2Highly dissimilar cognates or complex systems of possible cognates (more than 2-3 lexemes) with several synonyms that had to be cross-compared, as in Sanskrit <putraH>; <sutà> : Latin <filius>; <puer>, unless these possible cognates seemed to clearly match phonologically or have previously established cognates in all of the cases under consideration.

Consequently, a lexicostatistical distance matrix was obtained that will be used in the further discussion.

At this point, I can hear someone cry out, "Aha! Mass comparison!". You can name it as you please, although "mass comparison" should normally refer to churning out uncontrolled data, whereas I attempted to maintain strict control throughout the study. If you do not believe me, you can do your own calculations using your own set of rules — it doesn't take much time, actually (a matter of days), because the list is so short. Basically, all I did here was use probabilistic logic, adding "maybe" to the much too inflexible true-or-false classical cognate approach, as well as developing a very simple system to count synonyms.

Moreover, the following comments should be taken into consideration:

(1) In case of doubt, calculations were done several times until stable results were obtained, or the results were statistically averaged over several calculations.

(2) If calculation results seemed much too uncertain, they were temporarily aborted (as in the case with positioning Balto-Slavic and Armenian).

(3) The results are not just the result of dead reckoning, they are supported by additional considerations, such as showing which particular shared innovations were produced within a particular pair of language group, or even adding some material from outside of the table. So the percentages should rather be thought of as a preliminary indicator where to look further, not the final diagnosis.

(4) Averaging over many language groups makes a result particularly stable, as in the case of 35-36%-figure for the relatedness of the most distant main IE groups to each other (excluding Hittite). On the other hand, the cases of contrasting just two individual groups (e.g., just Irish and Welsh) are particularly prone to possible statistical errors, as long as no additional data are taken into consideration.

(5) The table was designed to be analyzed by a computer program using a special algorithm described elsewhere, so all these manual results should be seen as preliminary (=just re-stating the obvious).

From distance matrix to dendrogram

There are many mathematical methods of cluster analysis and computational phylogenetics developed mostly for biological and financial applications. These methods were tested on an IE dataset by Ringe, Warnow, et al. in a series of articles (2002-2005), and...well, none of them seems to work properly in their papers. Actually, the problem is not in the cluster analysis, the problem is in how you count your words. If you have built the correct distance matrix, everything would work almost automatically, but no cluster method can savior the flawed analysis if one compares languages in a wrong way. Besides, building a tree for a relatively small number of language groups is hardly a computational problem, and can be done manually in many cases.

The error margin

Again, since these are just manual calculations, the error margin for standalone pairs, not contrasted to other groups, can be rather high (+/-7%). This figure follows from statistical fluctuations in the distance matrix, which were obtained when the same calculations were repeated under different subjective biases or when different languages of the same group, which supposedly have the same phylogenetic depth, were compared to a particular language (for instance, Hindi or Persian to all of the Iranian languages).

But whenever a comparison value is averaged over a large number of languages with the same phylogenetic depth, we obtain much more statistically stable results. Hence, the value of 35-36% for all of the IE languages (outside Anatolian) may have much lower statistical error margin (no more than +/- 1-2%).

More rigorous computations using a computer algorithm and extended lists should reduce the error margin even further.



A lexicostatistical tree of the Indo-European languages


Hypothesis: Albanian may be related to Celtic

Evidently, there's a profound dissimilarity between Goidelic and Brythonic languages, which makes "Celtic" a rather deep, archaic grouping similar in this respect to the Balto-Slavic or Iranian branches. In fact, the herein assumed lexicostatical depth of 52% seems to be greater than that for the Balto-Slavic group (65%). Consequently, the great depth of the Celtic branch may finally allow to add Albanian with its similar lexicostatistical separation of about 50% to the Celtic languages.

Is Albanian really Celtic? The similarities between Albanian and other Celtic languages can be directly exemplified by the following lexemes, some of which may turn out to be unique shared innovations:

OIr. uisce, Alb. uje, ujt (water) (obvious phonological similarity), rather opposed to Latin unda 'wave' with a modified meaning;
ainm; Old Welsh anu; Gheg mn (name) (apparently, a metathesis from *namen > *anmen, rather unique in Europe);
OIr. <súil> /su:il/; Alb. <sy> /s/ (eye) (the relatedness to IE "sun" seems doubtful because of semantic differences);
duille, Welsh deilen; Alb. <gjethe> /dJee/ (leaf), as opposed to Latin folium, Gr. fúllon;
. carric 'rock'; Welsh carreg; Alb. gurë (stone) (probably akin to Eng. hard, and Toch. B kärweñe 'stone');
Welsh gwraig /gura'ig/, Alb. grua (wife, woman);
Welsh w^y; Br. vi; Alb. vezë; ve (egg) (phonologically similar);
OIr. gin; Welsh or Cornish genau, Br. quen, Alb. goje (mouth) as opposed to Latin gena 'cheek', Eng. chin with a different meaning;
OIr. athir, Alb. atë (father) (a similar development, which may not be coincidental);
Ir. féar, Welsh gwair; gwellt; Alb. bar (grass) ;
[Also cf. the regular correspondence of Irish /f/ : Welsh /g/ : Alb. /b/ in Ir. <fear>, OIr. <fer>; Old Welsh <gur>; Alb. burrë (man) (as well as Latin vir; Anglo-Saxon wer, Lith. viras), and OIr <find>, Welsh <gwyn>, Alb. <bardhë> /barðê/
OIr. tech; Welsh <ty^>; Br. ti; Alb. shtëpi (house) (also Gr. stegn 'house', but Eng. thatch; Sanskrit stagati 'to cover' mostly with different meanings) (?);
Manx shimmey, Alb. shumë (many);
Welsh <gwdff>, Alb. qafë (neck);
Ir. ur, Alb. ri (new);
Ir. maith, Welsh mad, Alb. mirë (good);

On the other hand, there are a few unique Goidelic/Brythonic innovations (as long as these matches do not result from subsequent mutual borrowings).

macc, Old Welsh map (son) (IE root, semantically innovative);
tene, Welsh tân (fire) (IE root, semantically innovative);
lám, Old Welsh lau (hand) (akin to Latin palma, Anglo-Saxon folm 'palm', typically Celtic, but not unique);
OIr. carric, Old Welsh carrec (stone) (IE root, rather semantically innovative, but also cf. Alb. gurë);

Note that the percentage of Irish to Welsh may be a little lower than actual, because of the greater than usual number of dialectical (?) synonyms within the Welsh dataset, which are herein calculated as 0.5-0.3 per lexeme, so we might expect the corrected figure for the Irish/Welsh relatedness to be a little higher (about 57% -?).

Accordingly, this predicts two waves of migration into the British Isles, with Proto-Goidelic being the first to enter, and the Brythonic subgroup being a result of relatively recent migration from the Continent. Proto-Brythonic and Proto-Goidelic must have separated a long time ago somewhere in northwestern or central continental Europe.

As to Italo-Celtic, the current study is not sufficiently detailed and elaborate neither to completely exclude, nor to corroborate the possibility of Italo-Celtic grouping; rather we see it as a possible, but unlikely, and in this case very short-lived state within the European Centum branch. There seem to be no specific Italo-Celtic shared innovations in the 46-list, except for the typical Celt. ni : Latin nos (we), which is also attested in other Indo-European groups, and is not unique.


Hypothesis: Italic may be related to Hellenic

These two groups seem to have very much in common (herein ~69%), which should not be surprising, since the close proximity of Attic Greek to Latin was well-known since the antiquity. Consider the following phonological and semantic similarities from the 46-table:

(Latin and Greek are transcribed phonetically):

Latin duo; Gr. dú:o (but Welsh dau; Old English tva; Pruss. dwai. Lith. du);
Latin kwattuor; Myc. Gr. kwetoro (but OIr. cethir; Alb katër; Old English feower; Lith. keturì);
Latin ego; Gr. egó: (but Old English ik; Toch. ñäs; Welsh i; Alb.unë);
Latin pes; Gr pú:s (foot);
Latin noks; Gr. nü:ks (but Welsh nos; Alb. natë; Old English niht);
Latin humus (ground); Gr. *xamos, xamai 'on the ground';
Latin folium; Gr. fúllon (leaf);
Latin frater; Gr. phra:ter (brother) (but br- in most other IE languages, Sanskrit "bhra:taH")
Latin lupus; Gr. lükos (wolf) (a similar loss of initial v-, which was rather unique among other IE groups);
Latin petra; Gr. pétros (stone) (as opposed to I. cloch; Alb. gur; Toch. B kärweñe; Old English sta:n);
Latin domus; Gr. dómos (home);
Latin rivus; Gr. rheos (river);

In all of the above instances we observe close phonological and semantic proximity that can be explained by assuming a genetic unity of Italic and Hellenic languages. This is easily explained from the geographical perspective by considering the fact that one of the few feasible passages to the Italian peninsula goes through the southern Balkans and northern Greece, therefore the only geographically realistic way for Proto-Italic to form was by its separation from Proto-Hellenic at some point in time.

However, the lexicostatistical proximity of only about 45% between Modern Greek and modern Romance languages (such as Spanish) as compared to an average of about 40% among other modern European Centum languages indicates that the Italo-Hellenic proto-state was rather short-lived and unstable.


Hypothesis: Germanic may be related to Tocharian

An even more interesting find may be a possible proximity of Proto-Germanic to Proto-Tocharian (Old. Eng : Toch B ~ 65%; German : Toch B ~ 59%). This observation deserves further investigation:

Old English wæter; Toch. A wär; Toch. B wer < *wat'er (?) (but Ir. uisce; Welsh dwr; Gr. hüdo:r; Lith. vanduõ);
Goth. swistar; Toch. A s'ar; Toch. B s'er <*set'er (sister) (the same loss of aspirated intervocalic -t'-);
Goth. weis; Toch. B wes (we) (also, at least Lith. vedu 'we two' and OCS ve 'we two', but not as 'we' in the phonological form of *weis, and not in the European Centum languages);
Goth. hairto; Toch. B arañce <*harnte (?) (heart);
Goth. waurts; German Wurzel; Toch. A witsako (root);
German Blatt; Toch. A pält; Toch. B pilta (leaf, blade) (this root is also persistent in the Indo-Iranian branch);
German Stamm 'stem'; Toch. A s.tám; Toch. B stám (tree);
Goth. waurms; Toch. A wal (worm) (but also Latin vermis (with a full ending); Gr. rhomos; I. cruimh, Alb. krimb, Pruss. <Girmis>);

Consider also the strong aspiration in t'- which lead to a transformation t' > ts (not necessarily due to palatalization as normally explained):

Toch. B mácer (mother); Toch. B pacer (father); Toch. B tkácer (daugther); Toch. B. kuce (who);

Tocharian k- finds explanation as a strongly aspirated t' > tk' > k' > k (Apparently, the digraph <tk> as preserved in Toch. A tkam (earth); Toch. A ckácer; Toch. B tkácer marks the result of this aspiration.);

Here is a short lemma that attempts to prove a regular correspondence between Proto-Tocharian *ka- and *tV- in the European Centum languages:

Toch. A kam; Toch. B keme, hence Proto-Toch. *kam < *tham (tooth);
Toch. A kantu; Toch. B kantwo, hence Proto-Toch. *kantwo < *thank'wo (with a metathesis) (tongue);
Toch. A kom.; Toch. B kaum., hence Proto-Toch. *kaum < *thaum (day, sun);
Toch. A tkam.; Toch. B kem., hence Proto-Toch. *kam (tkam) < *tham (earth, cf. Latin tellus, OIr. ti:r);
Toch. A karke; Toch B kara:k, hence Proto-Toch. *karak < *tharak, *tharakh, *tharah (tree branch);
Toch. A kayurs'; Toch. B kaurs'e, hence Proto-Toch. *kaurs'e < *thaurse (bull, cf. Taurus);
Toch. A kälyt-är;Toch. B kalt-är, hence Proto-Toch. *kalt-ar < *thalt-ar (s-tand); (?)

[Yet, in some other cases we have k < *k:

Toch. A känt; Toch. B kante, hence Proto-Toch. *kente (hundred);
Toch. A kanwem.; Toch. B keni, hence Proto-Toch. *ken- (knees (du.));]

The former process is possible if Proto-Tocharian stops where heavily aspirated, hence *ta > *tha > *hha > *kha > *ka before an open /a/ when the dentals were undergoing an allophonic lention. The metathesis in *tankwo occurred precisely under the impact of aspiration, because both *th and kh* were pronounced in a rather similar way at some point, more or less like *hhanhhwo

The Tocharian aspiration reminds of the Grimm's law and the aspiration in the West Germanic languages.

Some of the Grimm's law seems to be already in progress in early Proto-Tocharian, since we have *k > *h > 0 in:

Got. dauhtar; Toch. B tkácer (but Gr. thügáte:r)
Got. hairto; Toch. B. arañce <*harnte (?)

Other examples of Germano-Tocharian analogies might include:

Toch. A kumn-äs'; Toch. B känm-as's'äm; German kommen (come) [cf. Skt. gamati "he goes," Avestan jamaiti "goes," Lith. gemu "to be born," Gk. bainein "to go, walk, step," Latin venire "to come"), which do not have the same semantic and phonological form as in German and Tocharian]
Toch B s'ayye; German Schaf (sheep) [no known cognates outside Germanic. The more usual IE word for the animal was *ewe.]

It should not be particularly surprising that the Proto-Tocharians wandered as far as the Taklamakan desert — remember that we have a massive Gothic migration to the Crimean peninsula and the rest of the Europe about two thousands of years later. The Indo-Europeans used horses, whereas the vast Ponto-Caspian and Central Asian steppes allowed for distant migrations across Eurasia.

I do not insist on Proto-Tocharian / Proto-Germanic unity; at this level that's just a tentative hypothesis, which follows from the data under consideration, but which is rather poorly demonstrated herein.


The Balto-Slavic unity is well-proven

The close proximity of Baltic and Slavic (herein 65%) languages is well-supported by many other studies (Dyen (1991), Ringe (2005)), including some articles you can find at this site. You can also easily see a number of shared Balto-Slavic lexical innovations in the present 46-lexeme list:

Pruss. ranko; Lith ranká; Latv. roka; OCS ro~ka [nasal]; Russ. ruká (hand, arm)
Pruss. nage; OCS noga (foot, leg)
Pruss. zwaigstan (or rather: swaigstan) 'the shining'; Lith. zhvaigzhde; Latv. zvaigzne; OCS zvezdá (star)
Pruss. zirgis "stallion"; Lith. zhirgas "horse"; Latv. zirgs "horse"; Russ. zhere-béts "stallion"

The close genetic proximity of both groups is evident to anyone familiar with any two Baltic and Slavic languages. It doesn't really take any research. Some selected words and phrases may not even require translation, and some meanings can even be figured out with some effort and the knowledge of regular correspondences. [Cf. as anecdotal evidence (phonetical transcription): Kaip ash buváu ministru "How I was a Minister" (a book by Zinkevicius), but a possible Russian translation Kak ya (also OCS azê and Bulgarian az) byl (also: byvál) minístrom; or Lith. Líye litús "Rains (pours) the rain" vs. Russ. Lyót líven' "Pours the shower/rain"] However, this close relationship should not be oversimplified or overestimated, neither it means that Lithuanian or Latvian are directly readable to the speakers of Slavic languages and vice versa.

In my personal humble opinion, reasons against Balto-Slavic genetic grouping can only come from western researches either unfamiliar with any of these languages, or nationalistically-minded Balts who view any relation to Slavic as insulting. This long-standing dispute should finally be closed down.

On the other hand, the difference between modern Lithuanian and Latvian seems rather pronounced. According to a lexicostatistical study by Girdenis&Mazhiulis (1994) we have 68% for the Lithuanian-Latvian pair, and 70% for the Russian -Macedonian pair, the two most lexicostatistically distant Slavic languages, whereas the average inter-Slavic lexicostatistical distance normally oscillates circa 75%. We should also take into consideration possible historical contacts between Proto-Latgalian-Latvian and Proto-Lithuanian-Samogitan throughout their history, which would further decrease the figure for the Baltic languages to about ~62% because of possible mutual borrowings. This leads us to the conclusion that the Baltic group has many internal differences and is generally a little older than the Slavic group. See [Girdenis, Mazhiulis (1994)].

As to the Balto-Slavic lexicostatistical relatedness in that research, we have an average of 46% for Lithuanian vs. Slavic and an average of 42% for Latvian vs. Slavic, or ~44% on average. This yields a 62–44 ~18% difference between the hypothetical lexicostatistical depth of the Baltic and Slavic groups.

Girdenis, Mazhiulis (Swadesh-200, cognates) (1994)
 LithuanianLatvianOld Prussian*Russian
Lithuanian 68%(49%)47%
Latvian  (44%)45%
Old Prussian   (41%)
(*Their data for Prussian are probably unreliable, because there's not enough attested material to fill in a Swadesh-200)

I have also conducted my own lexicostatistical study using an unconventional list of wild flora/fauna (81 lexemes) which is supposed to be much less affected by loanwords due to presumably high stability of this type of basic vocabulary (see Balto-Slavic Lexicostatistics (in Russian)). This flora/fauna list yielded the following percentages:

Wild fauna/flora, 81 lexemes, cognates (2008)
 LithuanianLatvianOld PrussianRussian
Lithuanian 64%67%48%
Latvian  58%46%
Old Prussian   *51%

(*The Prussian percentage should be decreased by a small number, because of the 600-year difference between the attested Old Prussian and a hypothetical "Modern Prussian", but that wouldn't affect the final outcome to any sufficient extent.)

Incidentally, that nearly coincides with the Mazhiulis' data (again, different lexical lists may normally coincide by absolute figures only by accident), hence we have (64 + 67 + 58 /3) = 63% for the average relatedness among the Baltic languages, and (48 + 46 + 51/3) = 48% for the average relatedness of Russian to Baltic. Again, we have a ~15% difference between the hypothetical glottochronological age of the Baltic and Slavic groups in this study.

Finally, the above figures partly corroborate the calculations of the present preliminary study (the 46-list):

The 46-list; cognates with phonological similarity (2008)
 LithuanianLatvianOld PrussianRussian
Lithuanian ~78%~76%~67%
Latvian  ~68%~66%
Old Prussian   ~64%

Herein, we have [78 + 76 + 68)/3] - [(67 + 66 + 64)/3] ~ 8% difference between Baltic and Slavic (Russian). The smaller difference may be attributed to a much higher stability of the 46-list and a different method of counting.

Consequently, the difference within the Baltic languages is a little greater than normally assumed, whereas the difference between Slavic and Baltic is less than normally assumed, which makes Balto-Slavic a statistically reasonable grouping, although it is true that the Slavic languages cannot be directly included into the Baltic group as a subgroup, that would be going too far, rather they seem to have separated much earlier than most Baltic languages.

Hypothesis: Is Balto-Slavic related to Germanic?

This current lexicostatistical conclusion of modern Baltic and Slavic being related to modern Germanic to about 50% contradicts the fact of pronounced satemization in Balto-Slavic. Herein, we have BS/Germanic ~ 50%, and BS/Indo-Iranian ~ 35%, which could be due to a lexicostatistical error. The close match may also be attributed to the archaism of the both groups. Neither are there any clear-cut innovations shared by Balto-Slavic and Proto-Germanic in the 46-list. More extensive research on the subject is needed to support or discard either of the hypotheses.


West Iranian

First of all, it should be noted that the traditional four-corner scheme of Iranian languages (Northwest, Southwest, Northeast, Southeast Iranian) hardly holds true in the perspective of contemporary accurate lexicostatistical studies. The Iranian languages are an extremely complex branch of IE languages with a glottochronological and historical depth of at least 3000 years, similar in this respect to the Balto-Slavic branch, but more numerous and extending over a highly differentiated geographic territory.

Most West Iranian subgroups are closely related (cf. Modern Persian/Kurdish ~ 80%). The fact of the close proximity of West Iranian languages can easily be explained by reminding that the West Iranian languages are in many ways similar to Romance — they result from the expansion of the Median and Persian Empire since c. 800-600 BC. Although the Median Empire and the unattested Median language is sometimes linked to Kurdish (without any clear arguments), the present research rather shows that Kurdish is much closer to Persian.

However, there is a longer lexicostatistical distance between Modern Persian and the "Northwest Iranian" languages, such as Zazaki (Dimli), the lesser Northwest Iranian languages (such as Harzani, Semnani, Gorani, Kermanshahi, Sangisari), probably Parthian and Mazandarani (Zazaki/M. Persian ~70-75%). [The results for the lesser languages have been inferred from the consideration of phonological transitions in 1-10 numbers.]

Kurdish is closely related to Persian, Zazaki is not

This can be shown at least in the following way:

(1) wolf: Kurdish gur, Pahlavi gurg, Balochi gurkh, Persian gorg, but Avestan varkha, Old Persian varka, Zazaki verk. Herein, we have a very typical post-Old-Persian innovation with the word-initial g-;
(2) three: Avestan thri > se in most West Iranian, chi in Old Persian, but Zazaki hire;
(3) I: the loss of the historical pronoun azem in many West Iranian languages with its substitution by man, but the retention of ez in Zazaki;
(4) year: Kurdish sal, Balochi so:l, Persian sal, but Avestan sared, Old Persian ßard, but Zazaki serre with -r-;
(5) heart: Kurdish. dil, Balochi dil, Persian del, but Zazaki zerre;

It can be seen that Zazaki is phonologically very different from Perisan. Consequently, the linguitsic legend of Kurdish being related to a semi-legendary unattested language of Media cannot hold true, although this may be true of Zazaki, Mazandarani and some other Northwest Iranian languages which evidently exhibit many differences from the languages that descend from the Persian Empire.



Avestan also demonstrates close proximity to Proto-West-Iranian. Just think that the Zoroastrian religion would never gain much acceptance in Persia if it were propagated in a language radically different or completely mutually unintelligible with Old Persian. A more important argument for the close link between Avestan and Proto-West-Iranian is the lack of the East Iranian lenition in Avestan: it does have some of it, but not enough.

Cf. Pr. bäradär, Av. brâta, but Pashto wror; Shughni verod (brother)
xahär; Av. xvaharha, but Pashto khor; Shughni yakh (sister)
Pr. doxtär; Av. duxdhar; but Pashto lûr; Bactrian logda (daughter)
Pr. atesh; Av âtar-sh; Pashto ol; Shughni yâc (fire), etc

We can see in these examples, that the East Iranian languages have undergone considerable changes and exhibit little phono- or lexicostatistical proximity to Avestan. Lexicostatistically, we have Avestan/Modern Persian ~ 78%, but Avestan/East Iranian ~60% on average [Avestan/Shughni ~59%, Avestan/Ossetic ~59%, Avestan/Pashto ~66%, Avestan/Wakhi ~56%].

Consequently, Avestan may be a good candidate for Proto-West-Iranian.

East Iranian

This is probably the most complex and most controversial group among the Indo-European languages. Having been studied only as late as the 19-20th century, it remains largely unknown to many Indo-Europeanists in the west. For years, researches have tacitly assumed that there should be nothing in Iranian which can't be found in Avestan ignoring the many bizarre peculiarities of this family. It was, for instance, poorly represented in Dyen's lexicostatistical research. The group's textbook classification (Northeast to Southeast Iranian) is completely unacceptable and is hardly supported by any linguistic arguments at all. In fact, a closer look reveals a complicated branch with many different sprouts. The present lexicostatistical study, for instance, shows that the actual difference between Russian and Lithuanian might, in fact, be less than between Wakhi and Shughni, both of which are believed to be "Pamir", or sometimes even called "Pamir dialects".

The group average lexicostatistical depth of about 60% indicates that the East Iranian languages have been hiding around the Pamir and Hindu-Kush Mountains probably since about 1000-1500 BC, branching off into several subgroups shortly after the period of separation of the whole Indo-Iranian supergroup. As a result, they can be regarded as complex and probably even a rather independent taxon of Indo-Iranian languages.

The most obvious feature of the East Iranian languages is a widespread lenition of consonants (d > ð > l; b > v; k > c, etc.) which, by the way, might be an early areal, rather than genetic feature. This makes East Iranian words look a far cry from the "normal" Indo-European languages:

Cf. Ossetic ærtæ, Shughni aráy (three)
Pahsto lûr, Yidga lughdoh; Ishkashimi udoGd (daughter)
Yd. uxsho; Sanglechi khoar; Shughni xo:gh (six)
Pashto le:wê [metathesis]; Shughni urj (wolf)

The Pamir languages may form an internal genetic unity, apparently with the three following subbranches: (1) Yidgha-Munji; (2) Ishkashimi-Zebaki-Sanglechi; (3) Shughni-Rushan-Sarikoli-Yazgulami. The first two are rather closely related (1)/(2) ~80%, while the third one is a little more differentiated (1)/(2); (2)/(3) ~ 70% (on average), with Yazgulami being particularly different. There might also be some speculations on relating ancient Bactrian (the language of the Kushan Kingdom) to Yidga-Munji, but the precise lexicostatistical study of Bactrian is absent due to lack of lexical material.

Wakhi, a language located in the Hindu-Kush mountains, just across the ridge from Burushaski, it is normally thought to be "Pamir", but differes from other Pamir languages in many respects. It exhibits less East Iranian lenition (cf. trui "three"; sha:d "six"; ðaGd "daughter"'), and possesses certain archaic lexemes and innovative phonological formations (suk "we"; bu "two"; pazuv "heart"; naghd "night" cf. Av. xshap; naxtu), which demonstrates the archaism of Wakhi. It had probably separated early on, and has been isolated from the rest of Iranian languages for a long time.

Yagnobi (Yaghnobi), or Neo-Sogdian spoken only in a few villages in Tadzikistan, is presently strongly contaminated by Tadzik even in basic vocabulary, which creates many difficulties in lexicostatistical studies. However, it should be noted that there is no evidence it is particularly close to Ossetic as it is assumed in the Northeast-to-Southeast textbook classification.

Ormuri and Parachi have been excluded from the calculations due to insufficient material, yet there are reasons to believe their separation from other Iranian subgroups is quite ancient.

Ossetic is one of the most famous offshoots of Proto-East-Iranian that must have separated quite early on (not later than 700 BC judging from historical assumptions). Among other features, it is characterized by an extensive metathesis:

ærtæ < *tere (three),
ævzhag < *zevag (tongue)
ærvad < *verad (brother)
art <*at(e)r (fire),

and further lenitive changes (*p > f):

fêd < *ped (father)
fêrt < *putr (son)

The lexicostatistical relatedness of Modern Persian to East Iranian (62%) is nearly the same or just slightly greater than among East Iranian languages to each other (58-59%), which means that all Iranian languages separated from the common Iranian stem almost simultaneously, and if the East Iranian languages constituted a genetic unity, it was only for a relatively short period of time.



Khotanese is a historically important and well-attested Iranian language of the Tarim Basin (Taklamakan Desert) konown since c. 500-700 AD, but almost completely forgotten in most classical Indo-European studies. It probably has nothing to do with the ancient Sakas, but the name has stuck and is unlikely to change. For all practical purposes, we could think of Khotanese (in the south) and Tumshuquese (in the north) as the "Taklamkan" Iranian languages, not "Sakan", at least this name would be more self-explanatory. There also existed several other languages of this branch, although they are poorly attested.

The Khotanese/Avestan lexicostatistical relatedess of ~ 73% corresponds to the glottochronological separation of about 2000 years prior to the mean dating of Khotanese (600 AD) and Avestan (600 BC), that is c. 2000 BC. This separation depth matches the average relatedness of Khotanese / Modern Iranian languages = (66 + 64 + 59 + 60 + 56 + 62 + 54) / 6 ~ 60%, which should be adjusted by a coefficient of about 0.9 to correct for the early dating of Khotanese (500-700 AD) thus yielding ~54%, or again 1800 BC.

This means that Proto-Saka could have separated from other Iranian languages at a very early stage, probably as early as the period of existence of the Oxus civilization; therefore it should be regarded as a separate Iranian group, which is also phonologically corroborated by the lack of East Iranian lenition, and geographically, by the great distance from the West Iranian languages.

Iranian in general

As to the Iranian languages in general, they do share many often unique lexical, semantic and phonological innovations which prove the existence of a rather long historical period of a common Proto-Iranian state.

chasman; Khotanese ceima; Persian cheshm; Wakhi cözm; Yidga. cam; Shugni cem; Ossetic cæsht (also Sanskrit chakshus.h) (eye)

xshap, Avestan xshap; Persian
shab; Pashto shpa; Yaghnobi xishap; Ishkashimi sab, sxab; Shugni shab; Ossetian æxshæv <*xeshev (but also archaic Av. naxturu 'nocturnal'; Wakhi naGd) (night)

raocah 'daylight'; Persian ruz; Zazaki roje; Pashto wradz; Wakhi rwor; Sanglechi rusht (day)

asanga 'stone'; Khotanese samga; Persian sang; Yaghnobi sank; Sanglechi song (stone)

gaosha; Persian gush, Pashto Gvazh; Yaghnobi Gu:sh; Wakhi ghish; Shugni ghox; Ossetic x"ush (ear)

taoxma; Persian tokhm; Wakhi tuxm murG; Shugni tarmurx (egg)

These lexical items can be seen as typically Iranian, indicating that Proto-Iranian has existed as a single unity for a time long enough to produce local innovations even in a short 46-list.

Also see a similar
Starostin's dendrogram of Iranian languages, which tends to confirm the conclusions of the present study as far the tree structure and lexicostatistical percentages are concerned. However, it should be noted that Starostin's "recalibrated" glottochronology often yields too early dates and is probably incorrect. For instance, he provides (-620) for the separation of West Iranian, whereas we know well from history that the Median Kingdom was first mentioned in 836 BC, whereas its language is normally believed to be West Iranian, as a result we have an obvious contradiction.


The position of Armenian is seen herein as highly controversial, and its discussion has been excluded from the present notes. It may very well be related to Indo-Iranian languages, not Proto-Greek, as many people assume.


Nuristani, Dardic and Indo-Aryan

The Nuristani-Dardic branch (~65%) seems to be as internally close as Balto-Slavic [although Khowar shows many dissimilarities from other members of the Dardic group]. The same is true for the mainstream Indo-Aryan languages (~65%).

Kashmiri does not seem to be Dardic, and was herein included into the mainstream Indo-Aryan subgroup.

Note the considerable difference between Sinhalese and Hindi-Kashmiri (~55%), which indicates that the Sinhalese-Maldivian subgroup must have been a very early offshoot.

The separation of Proto-Nuristani-Dardic from the main Indo-Aryan branch seems to have occurred at the depth of 54% which must correspond to roughly 1800-1600 BC (the archaeological and historical date normally associated with the "Aryan invasion").

To appreciate the shared Nuristani-Dardo-Indo-Aryan (or "Indic", for short) phonological transformations and lexical innovations, consider the following examples from the 46-list:

dits; Kalasha Jhiph; Skr. jihvha:; Hindi jibh; Sinh. diva (tongue), as opposed to Avestan *hizva:s; Pers. zabân, Bactrian ezbago; Pashto zhêba; Yaghnobi zivok; Shugni zev. [jh >d : z]

Kati su; Kalasha súri; Skr. su:ryaH; Hindi su:rey (sun), as opposed to Avestan hvar-z; Pers. khurshed, Yaghnobi khur; Shugni xer; Wakhi yir [s : x]

Kalasha hía; Skr. h'Rdaya; Hindi hridey; Sinh. <hardaya>; Dhivehi hi-iy (heart), as opposed to Zazaki <zerrî>; Pashto zrrê; zaru; Shugni zrað; Ossetic zhærdæ; (but cf. Kati ziri < an Iranian loanword?) [h : z]

ango; Khowar angar; Kalasha angár; Skr. agniH, àNg'ara; Hindi a:g (fire), as opposed to Avestan âtar-sh; Pers. atesh, Ossetic art < *atr; Pashto or; Yaghnobi ol; Yidgha yur; Shugni yâc. [-ng- : -t-/0 ?]

Kati kor; Kalasha ka; Skr. karNa; Hindi kan; Sinh. kana (ear), probably akin to Avestan karana 'side, flank'), as opposed to Avestan gaosha; Persian gush, Pashto Gvazh; Yaghnobi Gu:sh; Shugni ghox (semantic innovation).

Kati radur; Kalasha rat; Skr. ra:tra; Hindi ra:t; Sinh. <raeya> (night), akin to Lith. /ri:ta/ 'morning'; Pers. ruz 'day') as opposed to Av. xshap; Pers. shab; Pashto shpa; Yaghnobi xishap; Wakhi naGd. (semantic innovation)

Again, as in in the case with Proto-Iranian, from a great number of shared features within a short word list, we can deduce that Proto-Nuristani-Dardo-Indo-Aryan (Proto-Indic) has had a prolonged period of separate existence, at least 2000 years long.


Hypothesis: Nuristani are part of (or close to) Dardic

You can see from the examples above that the Nuristani languages (such as Kati (Kata-viri), Kami (Kam-viri), Wasi) are clearly related to Indic, since they inherit the same transformations and innovations, and thus cannot be seen as "intermediate" between Indic and Iranian, as sometimes claimed. They also seem to share some common phonolgical and semantic formations with the Dardic languages, and can hardly be viewed as radically separate:

(1) Kati g'u; Khowar g'oG; Kalasha goík (worm);

(2) Kati uts; Khowar awa; Kalasha a (I) (as opposed to Skr. asmad; Kashmiri bu, boh'; Hindi me; Lahnda mae; Bengali ami; Sinh. mama) (probably, an early loss of the second part of *as-mad);

(3) Kati nu; Khowar nan ; Kalasha áya (mother) (curiously, probably akin to Eng. "nanny" as also to a similar word in Eastern Iranian languages) (as opposed to Skr. ma:tar, ma:ta:; Kashmiri moju; Hindi ma:, ma:ta:ji; Sinh. <mava>). The much too overused objection to children's words is seen as exaggerated herein: words like "mother, father, nanny" are quite normal words, they are not easily re-created from scratch each time in each language.;

(4) Kati sh'üt; Khowar chuti; Kalasha chom (earth) (as opposed to Skr. mahi:; Kashmiri metsu, boh'; Hindi mitti 'clay'; Bengali mati; Gujarati mati; Sinh. <pas>, <poloova> ) (apparently, akin to East Iranian: Ishkashimi shit; Sarikoli sit; Yazgulami shat; Ossetic sêdJêt)

However, the shared innovations in question are few and may have formed independently because of an Iranian adstratum, borrowings or by other means.


The Proto-Indo-Iranian language existed a long time ago or/and was rather short-lived. This conclusion may be drawn from the fact that relatively few traces remain in the 46-list under consideration in modern languages, which could demonstrate the existence of Proto-Indo-Iranian. The uniquely similar words include:

(1) Av. âf-sh, ap; Pr. âb; Pashto obê; Yaghnobi op; Wakhi yupk; Kami oa, op; Skr. a:paH > paniya (?); (akin to Lith. /upe/ 'river') This ia a semantic innovation, which was created probably because water was closely associated with rivers in desert Central Asian regions, hence the semantic transformation "river" > "water"; it's more likely, however, that this lexeme is only present in Iranian, whereas its appearance in Indic is recent, cf. Sinh. <vatura>.

(2) Av. bu:mä; Old Pr. bu:mis; Kurdish bin; Ormuri (Logar) bouma; Kami. b'üm; Khowar b'um; Skr. bhu:miH; Hindi bhu:mi; Sinh. bin

(3) Av. masya; Pr. mâhi; Skr. ma:tsya; Hindi machhi; Marathi masa; Gujarati macheli; Sinh. malu (fish); probably akin to Lith. mesa; Eng. meat (not in the 46-list); the introduction of this word may indicate that fishery was an important component of Indo-Iranian subsistence.

As we have seen, the independent changes in Proto-Indic and Proto-Iranian are very pronounced and they share few common innovations, which indicates that both languages have existed separately from each other for some considerable amount of time, and no longer have much in common (~40% in the present study). Glottochronologically, from the considerations of the present study, they could have separated c. 3000-3500 BC, which is about 1000-1500 years earlier than usually assumed. That would mean that the Proto-Indo-Iranians entered Central Asia soon after 4000 BC (see Map of Indo-Iranian Migrations), quickly migrated along the Oxus valley, reached the Hindu-Kush and Pamir mountains, where the early Proto-Indic language completely separated by 3000 BC, penetrating the mountain ridges, and staying there with some internal differentiation until about 1700 BC when the Indo-Aryan languages finally began to migrate into northern India. Although this is not reflected in this study, it is also plausible to assume that Indo-Aryan per se had initially been a subbranch of the Dardic languages that expanded into the Indian subcontinent.

But then, why do we often hear about the close proximity between Sanskrit and Avestan? The probable explanation is that the classical "dictionary" Sanskrit" is not a real language, it is rather a quasi-etymological collection of lexemes which belong to different Indo-Aryan dialects from different periods; whereas the earliest Vedic Sanskrit, which has been passed down orally for many generations, is even more confusing and sometimes not even entirely decipherable. In any case, the classical Sanskrit cannot be seen as something of a Proto-Indo-Aryan, since it was basically an artificial conlang created by Panini, and then lexically expanded over the course of many centuries. Consequently, a casual comparison with Avestan may produce many synonyms and many obscure parts in Vedic Sanskrit texts which can be interpreted in different ways and thus provide a superficial impression of a close relationship between Sanskrit and Avestan.

On the other hand, in the present study, only real languages of the same period can be compared, which helps to uncover the lack of common lexical background between Iranian and Indic languages and offsets the Indo-Iranian separation further back in time. Similar difficulties of finding the common Indo-Iranian proto-state were also noted in other lexicostatistical studies of modern languages, first by Dyen, Kruskal, Black (1992) who complained about the "absence in the present classification of an Indoiranian group" and then by Ringe et al. (2005)

Nevertheless, this rather significant question stands to be further investigated in a more detailed research.



The current study confirms the early separation of Hittite. This is evident from the following considerations. The results of comparison of Hittite to: Latin (52%) (attested c. 100 BC), Attic Greek (51%)(400 BC), Avestan (50%) (600 BC), Sanskrit (~50%) (400BC) render nearly equal results, which means that Hittite seems to be equidistant from other Indo-European groups.

Since Hittite is dated to c. 1600 BC, there would be even fewer matches if it had existed for 1300 ys. longer to see Latin and Avestan, therefore this average figure of ~50% should be further reduced to about 45% of relatedness to most Indo-European groups. Now this result shows more differentiation than han in the normal Indo-European pairs: Latin/Greek (67%), Greek/Avestan (57%), Latin/Sanskrit (~57%). Glottochronologically, that figure would translate to about 5000 years before the Latin/Greek/Avestan/Sanskrit separation (c. 300 BC), or circa 5300 BC (see below).

Therefore, we repeat the conclusion that Anatolian group should be regarded separately from the mainstream Indo-European languages, which supports the hypothesis of Indo-Hittite (Indo-Anatolian).


Attempting to date Proto-Indo-European

One of the common reasons for the criticism of glottochronology is the alleged insufficient lexicostatistical distance between Modern Icelandic, Modern Armenian, and Modern Georgian and their respective old languages. However, the critics of glottochronology seem to ignore the law of large numbers, which states that even if some of the languages might deviate considerably from the mean in their phono- and lexicostatistical behavior, the arithmetic average over a large number of languages would be relatively stable and most likely correct. On the other hand, the probability of running into languages with considerable deviation from the mean would be rather low, while, in many cases, the abnormal behavior of such deviant languages may be explained and even consistently predicted using various ad-hoc assumptions, such as geographic and linguistic isolation on a distant island (as in the case with Icelandic) or in the mountains (as with Armenian, and Georgian).

However, we will try not to overuse any ad-hoc assumptions herein. The law of large numbers would be just enough to establish the temporal position of PIE. Here is what we can do. We can (1) calculate the mean average percentage for all of the Indo-Aryan languages, (2) recalibrate the rest of the list using the obtained lexicostatistical depth set to 1600 BC, the archaeologically attested date of the Indo-Aryan invasion into India (3) calculate the mean value for all of the Indo-European groups, (3) and finally convert that number into an approximate date in years using the aforementioned calibration date.

The mean percentage of separation among Nuristani-Dardic (excluding the unreasonably deviating Khowar), Sinhalese, and Hindi-Kashmiri seems to converge to an average depth of about 56% [(54 + 57 + 50 + 60 + 62 + 50 + 64 + 59 + 52) / 9 = 56], which should correspond to circa 1600 BC judging from the archaeological and historical record (also see The Map of Indo-European Migration).

Hence, from the logarithmic glottochronological formula, we have:
0.56 = x ^ 3.6
x = 0.56 ^ 0.28 = 0.85

After some calculations, that would produce the following calibrated glottochronological row:

2000 AD1000 AD01000 BC2000 BC3000 BC4000 BC5000 BC

This table functon seems to be more or less consistent with the following historically attested facts and plausible assumptions:

(1) (very approximately) with the attribution of Proto-Celtic (Irish/Welsh ~ 52% or 57%, as corrected for synonyms -- see above) to c. 1500 BC and the Early Bronze Urnfield culture (1300-750 BC);

(2) with the separation of Hellenic from Italic occurring before 1600-1900 BC when the Proto-Greek tribes must have entered Greece. Glottochronologically, the Attic Greek/Latin relatedness (69%) corresponds to about 2300 years before 200 BC (a mean value between the approximate dates of the Greek and Latin languages), thus yielding c. 2500 BC for the late Helleno-Italic proto-state.

(3) with the Baltic (~75%) expanding around 200 AD, just a little earlier than late Proto-Slavic (c. 450-500 AD), because the lexicostatistical distance between Lithuanian and Latvian in a more accurate lexicostatistical study by Mazhulis (1994) using Swadesh-200 is just slightly greater than among the Slavic languages to each other, whereas the period of Proto-Slavic split seems to be historically datable to 400-500 AD; therefore, 0-200 AD seems to be a plausible value for the separation date of Lithuanian and Latvian-Latgalian.

(4) with the likely separation of Proto-Zazaki from Persian (70-75%) soon after the end of the Old Persian period (300-500 BC);

(5) with Ossetic separating from other East Iranian at 60% (~1100 BC). This dating looks right, because the Scythian languages are attested in the Caucasus Mountains just circa 800 BC, and their migration from the Pamirs must have been relatively quick (because of the horse-drawn carts) and occurring at an early stage.

(6) with the existence of the BMAC civilization along the Oxus river during 1800-2300 BC, which should probably be attributed to the Proto-Iranian state, apparently located along the Oxus as well (see The map of Indo-Iranian Migration). According to the present calculations, the era of Proto-Iranian would roughly correspond to 60-40% thus embracing the period from 1100 to 3000 BC, which includes the period of the BMAC as a subset.

(7) with the diversification of West Iranian languages (70%, 200 BC) after the fall of the Persian Empire by 330 BC.

Calculating the approximate upper date for late PIE:

Now that we have obtained the glottochronological table function, we can use the figures for the Greek/Avestan (57%), and Latin/Sanskrit (~57%) relatedness to place the upper limit for Proto-Indo-European at the level of 3500 years before the average dating of Avestan (600 BC), Sanskrit (400 BC), Latin (100 BC), and Greek (400 BC), or circa 3900 BC.

We can also obtain a similar number starting from modern languages:

Celtic / Indo-Aryan = (38 + 39 + 42) / 3 ~ 39%
Celti / Balto-Slavic = (32 + 32 + 38 + 37 + 44 + 35) ~ 36%
Balto-Salvic / Indo-Aryan = (38 + 36 + 35 + 32) / 4 ~ 35%
Balto-Slavi / Iranian = (39 + 38 + 28 + 35) / 4 ~ 35%
European / Iranian = (39 + 37 + 30 + 34) / 4 ~ 35%
European / Indo-Aryan (29 + 34 + 39) / 3 ~ 34 %
PIE ~ 36-35% or circa 4400 BC

The conclusion is that PIE must have separated into early Indo-European dialects by circa 4100 BC, which is in rather good correspondence with Gimbutas' theory.



Does any of this agree with other models?

Does any of this agree with other researchers' models? Sometimes, it does.

See Vaclav Blazhek, On the internal classification of Indo-European languages: survey (2005)

(1) Eric Hamp (1990)
We have some essential agreement with non-lexicostatical model by Eric Hamp (1990), who based his classification on specific isoglosses in phonology, morphology and lexicon. For instance, he also tends to place Balto-Slavic in the same group with Centum, a purely lexicostatistical possible conclusion in this work. He also agrees that Thracian is an early Balto-Slavic offshoot. He seems to misplace Greek though, because of its alleged proximity to Armenian (a question I have not addressed herein). Otherwise, his conclusions are rather traditional.

(2) Starostin (2004)
There is some interesting agreement with Starostin's glottochronological study (2004). Note that the counting and calibration methods in this lexicostatistical study were completely different. Starostin has:
-4600 [my -5300] for the Anatolian separation;
-3800-3300 [my -4100] for the mainstream Indo-European languages separation;
-1200 [my -700] for Balto-Slavic;
-80 [my -200] for Latvian-Lithuanian;
-1000 [my -1900] for Brythonic-Goidelic;
-250 [my -700] for late Proto-Indo-Aryan;
-1200-700 [my -1100] for late Proto-Iranian;
+180 [my -200] for Shugni-Munji(Yidgha)-Ishkashimi (he also found the early separation of Wakhi (-500), which surprised me as well; and correctly identified the long separation of Ormuri-Parachi, etc, see The dendrogram of Iranian languages);
+300 [my -100] for late West Iranian;
-1100 [my -1700] for late Proto-Dardic (he also noticed the early separation of Khowar and Kalasha);
etc. Any of which is not too far from the figures in the present study.
At least, we have some basic, fundamental agreement here. You can also notice that Starostin has a smaller offset value, so it is basically a matter of calibration (whereas my calibration method was very rough and approximate in this work, so I don't even insist on it — it's not even the aim of this work to elaborate on a correct glottochronological calibration, because I was mostly interested in percentage values and interal relatedness of varous subbranches).

The rest of Starostin's cladistics seems to be sk
ewed, apparently because he relied too much on statistical calculations in short word lists, which are not always sufficiently accurate to produce an error margin small enough for building a correct dendrogram, when the separation times are much too close (a common problem in statistical phylogeny). To avoid this common error, I simply put an honest I-don't-know and relied on classical conclusions and rough approximations, whenever I felt there may be something wrong with the statistical side. On the other hand, some of his dates may in fact be more accurate, because I used a very small lexical base for just a small number of languages.