Keywords fulltext information retrieval, multilingual access, computational linguistics, lexical semantics, documentary systems, machine translation

Start Date: 01-NOV-90 / Duration: 36 months

Objectives and Approach

The aim of the EMIR project is to validate a linguistic and statistical approach for the indexation of free text and multilingual query of textual databases by the use of a prototype. The final goal is to provide the user with an opportunity to query in his own language text databases written in various languages. It will also make it possible to query simultaneously in one language databases containing texts in several different languages.

The system developed within the project must be domain dependent. When processing a new domain, little work is needed to adapt the dictionaries and the user is helped by tools developed inside the project to perform this adaptation. More specifically, a semi-automatic method has been developed to extract compounds and their translations from texts that have already been translated.

In order to prove the generality of the approach, experimentation is done on three languages: English, French, and German. The English-French and French-German couples are currently under work. The German parser has been developed within the framework of the project. This parser specifically takes into account the splitting of compounds which is crucial for information retrieval systems.

Progress and Results

A first prototype has been developed using word-for-word translation. It has been tested on various databases (Decisions of the Court of Justice of the European Communities, Patents from the European Patent Office, nuclear power plants and image transmission, literary and ethnographic databases, and a US test database on Aeronautics and Space. The prototype has been presented in several conferences.

The experimentation of this prototype has led to the production of the specifications of the final prototype that will be validated from December 1993 to February 1994. After that, a commercial product will be developed. This new prototype includes a full treatment of idioms even if they are non contiguous or derivative, of compounds, and monolingual reformulation.

Apart from the development of a prototype, more fundamental results have been obtained by the experimentation carried out in this project. We can consider that the use of a large text database in a domain can be u sed as a semantic knowledge to help solving the problem of word translation ambiguities in the same domain.

A study of the generalisation of EMIR to other European languages is under way.


At present, EMIR is being tested on multilingual databases from the European Patent Office, the Court of Justice of the European Communities and the Schweizerische Kreditanstalt (SKA). The French Atomic Energy Commissariat (CEA) will use the system to access bilingual full text databases of scientific papers written in French or in English.

The TECHNOLOGIES company, which replaced SYSTEX as partner, intends to introduce the results of EMIR into the next release of its SPIRIT commercial information retrieval system in the middle of 1994. TRANSMODUL intends to introduce some aspects of the German parser into commercial information retrieval products, marketed in the German speaking countries.


Mr Christian Fluhr
F - 91191 GIF/YVETTE
tel: + 33/ 1-6908-6311
fax: + 33/ 1-6908-7993
telex: 604641 F ENERG


CEA - F - C

EMIR - 5312, December 1993

