Designing spoken corpora for cross-linguistic research

CorTypo is a pilot research project (linguistic typology) funded by the French Agence Nationale de la Recherche and aimed at testing hypotheses about the similarities and differences between the languages of the world.

Coordinated by Amina Mettouchi at UMR 8135 of the CNRS (LLACAN), its team is composed of four computer scientists and twelve international researchers including Zygmunt Frajzyngier, author of the theoretical approach implemented.

The project started in March 2013 and lasted 48 months. It received €230,000 of ANR funding, for an overall cost of about €2,730,400.

For the last twenty years, typology, a branch of linguistics that compares the languages of the world, has made a lot of progress thanks to the existence of more and more numerous descriptions of indigenous languages from various regions of the globe. However, the tendency has often been to compare languages using top-down linguistic categories. The CorTypo project is distinct from those initiatives not only because it offers an online searchable comparative database, but because it is structured according to the unique system embodied by each language, and features a bottom-up, inductive and empirical comparison.

Thus the specificities of the linguistic categories are preserved: they are never exactly the same from one language to another, because they never belong to the same system of oppositions. It is those subsystems that CorTypo compares, by analyzing how constructions contrast with each other within each domain. This innovative perspective makes it possible to renew typological approaches, and to consider in a radically different way the similarities and differences that characterize the languages of the world.

In order to test this inductive perspective, two domains present in all the languages of the project were explored: Predication and Reference. The corresponding constructions were first identified by specialists within annotated corpora in each language. Their functions were analyzed, and the organization of the system within each domain was discovered and then integrated into the database in a structured way.

The results of any online search, which can be conducted by language, by domain, or by function, includes a precise description of the construction, its function, its encoding. Moreover, the query automatically generates a list of contextualized examples from annotated spoken corpora. Thus, any analysis can be checked and falsified.

The end-user can compare constructions and functions thought to be similar, and carry out their typological analysis with all the necessary information.

The challenges posed by the implementation and querying of the database were met by using of a MySQL relational database management system, PHP programming, with modules in JavaScript.

Currently, CorTypo compares twelve languages, belonging to ten families, and five super-families, for two functional domains. With the necessary computing power, the system would eventually make it possible to compare a multitude of languages.