Context, position and objectives of the proposal

Although there are a number of individual corpus projects in various languages, including lesser-described ones, there have been very few attempts at making those corpora amenable to cross-linguistic investigations. This project aims at bridging the gap, by testing innovative annotation schemes on languages belonging to various families, and by creating the scientific and technological conditions for testing linguistic hypotheses on natural language data.
In order to achieve this goal a number of fundamental theoretical questions need to be resolved with respect to language form and language functions. The project addresses the question of what kind of theoretical apparatus is required for the comparison of languages with different formal means and different functions.

By implementing theoretical solutions into corpus-design and database-design, the project provides the basis for the empirical testing and falsification of hypotheses, and allows the elaboration of new hypotheses on language structure and cross-linguistic comparison. By proposing solutions to the problem of linguistic interoperability, it also, crucially, paves the way for large-scale typological work based on first-hand data.
This project belongs to fundamental research, and more precisely, general linguistics and typology.

Scientific, economic and social background

Typology is a fundamental domain of linguistics whose ultimate aim is to explain why languages are similar and why they are different. As such, it must deal with very diverse samples of languages, most of them poorly described, and a lot of them endangered. Access to those languages is currently mostly realized through the analysis of grammatical descriptions. However, corpora in lesser-described languages are more and more needed, and given the improvements in the portability and ergonomy of recorders and various softwares, they are starting to be compiled.

We are therefore now in a situation where typology has potential access to those corpora, and can develop more fine-grained analyses of the types of formal means and functional categories existing in languages.
However, the usefulness of existing corpora is very limited for typologists. Most of the currently available databases are limited to well-described languages (a few dozens, out of the 5000 languages currently spoken in the world). And typology, as well as general linguistics as a whole, cannot build sustainable hypotheses on the structure of language, by only working on those well-known languages, because they are only a small sample of the potential structures and systems that can be realized by language as a faculty of the mind. The rich variety of formal means, functional domains, and systemic organizations has to be taken into account. And this can only be done if analyzable corpora in very diverse languages are made available, and more importantly, are made comparable.

Position of the Project, State of the Art

Indeed, while annotated spoken texts in lesser-described languages are slowly starting to be released, it is virtually impossible to conduct a comparative search across two or more languages/corpora: formats and annotation schemes are different. Although the issue of technical interoperability between software is being addressed, and converters are being designed under the auspices of CLARIN-ERIC among others, thus making it possible to technically compare corpora, the question of explicitness and systematicity in the linguistic annotations is only starting to be addressed, and is viewed as a challenge by corpus specialists. This gap between the advancement of computer technology, and the slow progress of linguistic convergence, is an obstacle that our project aims to remove.

In order to achieve that goal, the first question that has to be addressed is that of the principles of cross-linguistic comparison. There are essentially three choices in contemporary linguistic studies: one is to compare linguistic categories that actually occur in particular languages. This type of comparison runs into the difficulty, well noted by scholars, of being confronted with categories that are not comparable, viz, categories coding different functions across languages. The second approach relies on form (Newmeyer 2007), while the third, advocated by the St Petersburg tradition of linguistic typology, by Seiler 1995, justified theoretically by Lazard 2004, and later taken up by Haspelmath 2010 (see also Dixon 2010 and others), consists in selecting certain semantic notions and seeing if and how those notions are encoded across languages.

Form-based comparisons are relatively non-controversial. They often target phonological or morphological units (consonantal systems, inflectional vs derivational morphology, etc). Comparing other formal characteristics, such as word-order, head- or dependent-marking, or serial verb constructions, requires many presuppositions regarding the characteristics of the constructions involved. Thus, any study of the order of verb, subject, and object assumes the universal existence of the categories subject, verb, object, and of their universal characteristics. Similarly, the typology of head- or dependent-marking assumes universality of the notions head and dependent. Yet we know that it is not trivial to determine the universal properties of the category head (see papers in Corbett et al. 1993).

When it comes to the cross-linguistic study of functional-semantic categories, such as ‘passive’, ‘definite’, ‘causative’, ‘agent’, ‘patient’, ‘possession’, ‘location’, etc. the difficulties are even greater. Before one proceeds with the typology of semantic categories and functions one has to define what constitutes the same function and what constitutes different functions within a language and across languages. One of the solutions proposed in current typological theories is the use of « comparative concepts » (Haspelmath 2007) that are not the individual languages’ categories, but bear the same name, and correspond roughly to the common denominator that category has in N number of languages.

Other typological models than the ones discussed above are available and provide bottom-up rather than top-down categorization and analyzes. One of them is the Systems Interactions framework developed in Frajzyngier & Shay (2003). This model relies on functional domains to describe and analyse languages, as well as to conduct cross-linguistic comparisons. It is the theoretical starting point in our formulation of the annotation system of CORTYPO.
One of the elements in this framework is the notion of functional domain (Frajzyngier and Mycielski 1998). To use non-mathematical language, a functional domain is composed of a system of forms that have at least one functional/semantic feature in common and that are in complementary distribution. A functional domain may be composed of several subdomains, and each subdomain must have one or more grammaticalized functions.

Consider the domain of reference in a language. In studies that take into consideration only articles, reference has been confined to the distinction between definite and indefinite functions. Even for language that have articles this distinction is not sufficient, as some nouns can occur without articles, and articles can cooccur with other determiners (Hebrew), while other languages may have several morphemes, some preceding, other following the noun at the same time (Hdi, Frajzyngier with Shay 2003).
One way of dealing with the complexity of coding means is to analyze the functions and interrelationships of all the coding means participating in the given semantic area, in order to discover the functional domains coded, and the structure of its subdomains.
Languages are therefore compared on the basis of the structure of the domain under investigation: type and number of subdomains, specific predications and forms used to code the functions. The outcome of such a comparison for the domain of reference is a taxonomy of systems of reference, and a taxonomy of languages with respect to the system of reference.

This empirical approach to typology is particularly well adapted to the construction of corpora, and accompanying databases, that constitute rich and structured empirical resources for innovative cross-linguistic research, allowing for the discovery of new functions, and the analysis of the diversity, and sometimes unity, of languages.

The theoretical background described above is the scientific basis for the development of new functionalities of the software (ELAN) in which the corpus will be annotated, namely a more sophisticated query engine; and it allows for the development of a comparative database linked to the various languages composing the subcorpora of CORTYPO.

Objectives, originality and novelty of the project

The innovative nature of the CORTYPO consists of the fact that it proposes to build resources linked with a query engine informed by the empirical perspective that informs the Systems Interactions model, and interfaced with a database of functional domains.
The corpus is annotated on the basis of forms, which are considered as coding means for the functions that compose functional domains. The functional domains themselves are the modules composing the database.
So far, very few corpora have been linked to databases, and none for lesser-described languages. And no typological model has been implemented in a software, even if statistical softwares such as R have been used for typological research based on corpora (Bickel & Stoll 2008, on DoBeS corpora).

The result of the 36 months of collaborative work described in the next section will be the implementation of a pilot set of resources and tools specially designed for typological investigations. This set of resources and tools will be composed of pilot corpora in several languages, annotated in view of cross-linguistic investigations, and associated with a functional database which will form the interfacing between the formulation of typological queries, and the retrieval of relevant data in several single-language corpora. The transparency of the categories under investigation is guaranteed by the interfacing of a Category table, that will ultimately enrich the ISOcat registry ( Full documentation of the resources and the software will also be provided.

The Corpus part of the set of resources and tools will be coded according to forms for each language. Thus in Kabyle, only bound pronouns are unequivocally marked for subject function. Those affixes will be coded with the label SBJ, followed by person, number and gender information. Nouns can be computed as subjects, but they are not coded as such. Their identification is based on the interaction of several coding means: a nominal subject is a noun (coded N) that belongs to the same prosodic unit (coded REF) as the verb (coded V), and either immediately precedes V and is marked by absolute state (coded ABS), or follows V and is marked by annexed state (coded ANN). Therefore, nouns will not be coded for grammatical role in Kabyle. Their grammatical role will be retrievable using ELAN query commands (regular expressions).

This way of coding corpora does not presuppose derived functions, but provides the means to compute them. The methodology implies that only direct form-function mappings should be coded. Moreover, the delimitation of functional domains being based on the co-occurence or mutual exclusiveness of several types of linguistic forms, the annotation of the corpus must not be limited to morpheme-by-morpheme glossing as is usually the case for lesser-described languages when the corpus has more levels of annotation than part-of-speech tagging. Non-local phenomena are also important formal means that have functional values. Linear orders and anaphoric chains are a good example of that. Taking into account the various formal coding means in the annotation schema, and in the development of the query engine, is therefore essential, and constitutes another innovative aspect of CORTYPO.

The Functional Comparative Database will be elaborated following linguistic principles too. Those are the principles informing, among other approaches, the Systems Interactions Model. The model in itself is innovative, as it does not presuppose universal or comparative categories, but relies primarily on the discovery of functions through analysis of complementary or exclusive distributions, within a structure grammaticalized in the language. The innovative aspects of the theory are expanded below.

For each language it is theoretically possible to discover an exhaustive list of functional domains and subdomains actually grammaticalized. The comparison of the lists across languages will show which formal means occur in which languages; which functional domains are coded in which languages; what are the internal structures of functional domains; and finally which formal means are used to code which functional domains. The outcome of such a comparison will be information about how similar and how different languages are. The differences and similarities will be of two types. Languages will differ in the functional domains coded; even if the domains are the same their internal structure may be different; the subdomains may be coded by a variety of formal means. Every outcome of such a comparison is of considerable interest and is a potential object for further research. Thus one of the questions could be why certain domains are coded more often than others? Why the internal structures of domains are similar and different? Why certain formal means are used more often to code some domains and less often for others (if this turns up to be the case)?

Such a typology will be of great importance for the construction of linguistic theory; for computational linguists working on information extraction; for cognitive linguists willing to enlarge their investigations towards lesser-described languages, and for philosophers working on meaning in natural language.
Given the fact that each language potentially codes dozens of functional domains and given the potential complexities of each functional domain, the present project is confined only to a limited number of languages and within these languages, only to selected functional domains or areas that we know to exist in all of these languages viz: tense-aspect-mood, the relationships between the predicate and noun phrases, and the system of reference.

In summary, the project addresses one of the most intractable issues in linguistic typology, viz. what should be the object of cross-linguistic comparison. The project goes, however, much beyond the theoretical argumentation. It proposes and tests a methodology for such research based on the empirical data represented by corpora in undescribed or lesser-described languages, and associated databases.

In that respect, it is an innovative endeavor, and one that allows scientifically-grounded investigations. Indeed, the results will be controllable since the whole set of resources and tools, including its documentation, will be available to researchers, allowing them to conduct cross-linguistic queries, and to check the relevance of the retrieved data against the various corpora. The set of resources and tools will be completely transparent and documented, so that replicability, falsifiability, and access to the sources and data are ensured.


