Demographic events in human history are expected to leave traces in languages and genes, hence Darwin’s intuition that the best possible description of linguistic relationships among populations would be their phylogenetic tree. In the 90’s, Sokal & Cavalli-Sforza tested empirically Darwin’s hypothesis, concluding that linguistic and genetic distances are strongly correlated. However, many questions remained open, due to the lack at the time of suitable genetic data, and to the limitations of linguistic comparisons based on lexicon. Indeed, lexical comparisons have proved to be reliable for closely related languages, but are useless for large-scale comparisons, across language families. Recent methodological developments focusing on syntax are now enabling more sophisticated quantitative studies across language families. Moreover, the advent of Next-Generation Sequencing technologies allows to sequence rapidly entire genomes, increasing the quantity of genetic information available from worldwide populations. In this thesis, we took advantage of state-of-art methods for the comparison of linguistic and genomic data, in two ways: (1) We combined linguistic and genomic data to shed light on the origin and spread dynamics of the Indo-European (IE), Uralic (UR) and Altaic (AL) linguistic families in Eurasia. We showed that languages of these families form three well-distinct clusters, but UR linguistic outliers share evident similarities with their current geographical neighbours., with Western UR speakers appearing genetically closer, in parallel shades, to their IE-speaking neighbours, and the Khanty showing similarities with AL speakers. Finally, we tried to interpret some of the observed historical patterns through a comparison between ancient and modern DNA variation, suggesting a South-to-North spread of UR languages in current Finland. Therefore, this study points out –and is able to quantify– plausible secondary convergence in the syntax of languages of different families, providing evidence that such interference effects were accompanied, and possibly caused, by equally measurable demographic exchanges. (2) We proposed a new Approximate Bayesian Computation framework (ABC) in which genomic and linguistic data would be simultaneously considered in the analysis of demographic models, and would also allow inference about biological and cultural evolution. We first assessed the power of the linguistic framework, to understand to what extent the proposed method is actually able to correctly identify the demographic history. After the validation, we applied this new ABC approach to the study of the Bantu expansion. The expansion of Bantu-speaking populations starting 5.000 yBP has been associated with the transition from hunter-gatherer societies to food producers, resulting in population growth and dispersal. However, the dynamics of this process are all but established. Two main hypotheses have been proposed: an Early split and a Late split of Bantu populations. We applied our new ABC framework to compare both models using for the first time whole-genome data together with linguistic data. Our results for the linguistic data seem to support the Late Split hypothesis, although further analyses seem necessary to reach a solid conclusion. For the genetic data, on the contrary, our results were not satisfactory. One problem is, DNA carries the consequences of the accumulation of diversity over long periods of time, whereas languages do not preserve well old signals, resulting in a failure of all models to reproduce the observed genetic patterns. Consequently, we need to design new demographic models taking into account the ancestral genetic diversity. Ultimately, with this new ABC framework, we expect to reveal details of the past history of Bantu population with an unprecedented definition.

Eventi demografici nella storia umana lasciano tracce nelle lingue e geni, secondo l'intuizione di Darwin per cui la migliore descrizione possibile delle relazioni tra le lingue sarebbe l’albero filogenetico delle popolazioni che le parlano. Sokal e Cavalli-Sforza (1990) hanno testato empiricamente l'ipotesi di Darwin, concludendo che distanze linguistiche e genetiche sono fortemente correlate. Tuttavia, molte domande sono rimaste aperte, a causa della mancanza di dati genetici adeguati e per i limiti dei confronti linguistici basati sul lessico. Il lessico permette confronti fra lingue strettamente correlate, ma è inutile per confronti su larga scala, fra lingue di famiglie diverse. Recenti sviluppi metodologici hanno permesso analisi a livello di sintassi, e dunque fra famiglie linguistiche diverse. Inoltre, l'avvento delle tecnologie di Next-Generation Sequencing ha radicalmente aumentato la quantità di informazioni genetiche disponibili sulle popolazioni del mondo. In questa tesi abbiamo sfruttato metodi all'avanguardia per il confronto di dati linguistici e genomici, in due modi: (1) Abbiamo combinato i dati linguistici e genomici per spiegare l’origine e le dinamiche di diffusione nell’Eurasia delle famiglie linguistiche indoeuropea (IE), uralica (UR) e altaica (AL). Le lingue di queste famiglie formano tre gruppi ben distinti, ma gli outlier linguistici UR sono molto simili ai loro vicini geografici: le popolazioni occidentali di lingua UR appaiono geneticamente più simili ai loro vicini di lingua IE e i Khanty mostrano somiglianze con le popolazioni che parlano lingue AL. Infine, confronti di DNA antico e moderno, fanno pensare a una diffusione da sud a nord delle lingue UR nell'attuale Finlandia. Pertanto, questo studio è in grado di quantificare una convergenza secondaria nella sintassi di lingue di famiglie diverse, fornendo prove che tali effetti di interferenza sono stati accompagnati, e forse causati, da scambi demografici ugualmente misurabili. (2) Abbiamo proposto un nuovo framework di Calcoli Bayesiani Approssimati (ABC) in cui abbiamo testato diversi modelli demografici alla luce dei dati genomici e linguistici, allo scopo di ricostruire aspetti dell'evoluzione biologica e culturale. Abbiamo prima valutato il potere dell’analisi linguistica, per capire se il metodo proposto è effettivamente in grado di identificare correttamente la storia demografica sottostante. Dopo la convalida, abbiamo applicato questo framework allo studio della espansione delle popolazioni di lingua Bantu. L'espansione delle popolazioni di lingua Bantu iniziato 5.000 anni fa è stata associata alla transizione dalle società di cacciatori-raccoglitori a quelle di coltivatori, con conseguente crescita e dispersione della popolazione. Tuttavia, le dinamiche di questa espansione sono oggetto di dibattito. Sono state proposte due ipotesi: Early split e Late split degli agricoltori Bantu. Abbiamo applicato il nuovo framework ABC per confrontare i modelli usando per la prima volta i dati di interi genomi e i dati linguistici. I risultati per i dati linguistici sembrano supportare l'ipotesi di Late split, tuttavia, per poter prendere una posizione chiara saranno necessari approfondimenti. Per i dati genetici, al contrario, i nostri risultati non sono stati del tutto soddisfacenti. Un problema è che nel DNA si ritrova una diversità genetica generata attraverso lunghi periodi di tempo, mentre le lingue non conservano segnali di fenomeni più antichi, il che probabilmente impedisce ai modelli studiati di reprodurre fedelmente la variabilità genetica osservata. Di conseguenza, dobbiamo progettare nuovi modelli tenendo conto della diversità genetica ancestrale. Con questo nuovo framework ABC, ci aspettiamo di rivelare dettagli della storia passata della popolazione Bantu con una definizione senza precedenti.

Inference of human migration from genomic and linguistic data

SILVA SANTOS, PATRÍCIA ALEXANDRA
2020

Abstract

Demographic events in human history are expected to leave traces in languages and genes, hence Darwin’s intuition that the best possible description of linguistic relationships among populations would be their phylogenetic tree. In the 90’s, Sokal & Cavalli-Sforza tested empirically Darwin’s hypothesis, concluding that linguistic and genetic distances are strongly correlated. However, many questions remained open, due to the lack at the time of suitable genetic data, and to the limitations of linguistic comparisons based on lexicon. Indeed, lexical comparisons have proved to be reliable for closely related languages, but are useless for large-scale comparisons, across language families. Recent methodological developments focusing on syntax are now enabling more sophisticated quantitative studies across language families. Moreover, the advent of Next-Generation Sequencing technologies allows to sequence rapidly entire genomes, increasing the quantity of genetic information available from worldwide populations. In this thesis, we took advantage of state-of-art methods for the comparison of linguistic and genomic data, in two ways: (1) We combined linguistic and genomic data to shed light on the origin and spread dynamics of the Indo-European (IE), Uralic (UR) and Altaic (AL) linguistic families in Eurasia. We showed that languages of these families form three well-distinct clusters, but UR linguistic outliers share evident similarities with their current geographical neighbours., with Western UR speakers appearing genetically closer, in parallel shades, to their IE-speaking neighbours, and the Khanty showing similarities with AL speakers. Finally, we tried to interpret some of the observed historical patterns through a comparison between ancient and modern DNA variation, suggesting a South-to-North spread of UR languages in current Finland. Therefore, this study points out –and is able to quantify– plausible secondary convergence in the syntax of languages of different families, providing evidence that such interference effects were accompanied, and possibly caused, by equally measurable demographic exchanges. (2) We proposed a new Approximate Bayesian Computation framework (ABC) in which genomic and linguistic data would be simultaneously considered in the analysis of demographic models, and would also allow inference about biological and cultural evolution. We first assessed the power of the linguistic framework, to understand to what extent the proposed method is actually able to correctly identify the demographic history. After the validation, we applied this new ABC approach to the study of the Bantu expansion. The expansion of Bantu-speaking populations starting 5.000 yBP has been associated with the transition from hunter-gatherer societies to food producers, resulting in population growth and dispersal. However, the dynamics of this process are all but established. Two main hypotheses have been proposed: an Early split and a Late split of Bantu populations. We applied our new ABC framework to compare both models using for the first time whole-genome data together with linguistic data. Our results for the linguistic data seem to support the Late Split hypothesis, although further analyses seem necessary to reach a solid conclusion. For the genetic data, on the contrary, our results were not satisfactory. One problem is, DNA carries the consequences of the accumulation of diversity over long periods of time, whereas languages do not preserve well old signals, resulting in a failure of all models to reproduce the observed genetic patterns. Consequently, we need to design new demographic models taking into account the ancestral genetic diversity. Ultimately, with this new ABC framework, we expect to reveal details of the past history of Bantu population with an unprecedented definition.
BARBUJANI, Guido
GHIROTTO, Silvia
GONZALEZ FORTES, Gloria Maria
BARBUJANI, Guido
File in questo prodotto:
File Dimensione Formato  
SilvaSantosPatriciaAlexandra_PhD_Thesis.pdf

accesso aperto

Descrizione: SilvaSantosPatriciaAlexandra_PhD_Thesis
Tipologia: Tesi di dottorato
Dimensione 7.45 MB
Formato Adobe PDF
7.45 MB Adobe PDF Visualizza/Apri

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2478788
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact