Designing spoken corpora for cross-linguistic research

Funded by the Agence Nationale pour la Recherche (ANR), for 36 months (March 2013-March 2016, extended to March 2017). Prepared in 2011 and submitted in January 2012.

Principal Investigator: Amina Mettouchi
Directeur d’Etudes at EPHE (Ecole Pratique des Hautes Etudes), member of the CNRS laboratory LLACAN
Professional webpage (CV, publications):

Aim of the project

The aim of the CorTypo project is the elaboration of an innovative system of linguistic annotation of natural language corpora in lesser-described spoken languages, in view of testing linguistic hypotheses on spontaneous discourse data, in a typological perspective.

In order to achieve this goal a number of fundamental theoretical questions need to be resolved with respect to language form and language functions. Crucially, the project addresses the question of what kind of theoretical apparatus is required for the comparison of languages displaying different formal means and different functions. The approach chosen within the project is the Systems Interactions framework developed by Zygmunt Frajzyngier.

By implementing theoretical solutions into corpus-design and database-design, the project provides the basis for the empirical testing and falsification of hypotheses, and allows the elaboration of new hypotheses on language structure and cross-linguistic comparison. By proposing solutions to the problem of linguistic interoperability, it paves the way for large-scale typological work based on first-hand natural language data.

Innovative nature of the project

1. an annotation of sound-indexed texts that is based on the formal means existing in a given language, including prosodic means, linear orders, and phonological and morphological marking allowing the determination of syntactic and functional units in the language;

2. the creation of a functional database linked to the corpus. The database contains complex information about the functions grammaticalized in each language and the forms which code those functions. The database is linked to the corpus through a query engine so that forms, and ultimately contextualized examples, can be retrieved.

The data set composed of the corpus and the database is complemented by a Category table that provides terminological information and definitions. This table ensures the transparency and replicability of analyses, and provides input for the ISOcat registry.