Cooperative Research in Information Technology CRIT - 2
 
A. 0   PROPOSAL INFORMATION

  KBN project (KBN grant number) on which the proposal to CRIT2 is based: 8 T11C 011 10

PROJECT TITLE: (English): Application of language engineering methods to automatic analysis and generation of Polish texts

PROJECT TITLE: (Polish): Zastosowanie metod inzynierii lingwistycznej do automatycznej analizy i syntezy teksów jezyka polskiego

Commencement date: 1.01.1996, Duration: 3 years

Title of CRIT2 research proposal: An HPSG tree bank for Polish
 

A. 1 ADMINISTRATIVE INFORMATION FOR PROPOSER

              Institute of Computer Science, Polish Academy of Sciences (IPI PAN)

Address: Ordona 21, 01-237 Warsaw, POLAND

Telephone number : +48 22 36 28 41

Telefax :+48 22 37 65 64 e-mail : bolc@ipipan.waw.pl, agn@ipipan.waw.pl

 
              Name and title of the scientific coordinator responsible for the proposed research: Surname : Bolc

First Name : Leonard Title : prof.

Office address (the same as above)

 
Administrative coordinator responsible for the signature of a contract within the body carrying out the research:

Name and title : mgr Boguslaw Martyniak

Office address the same as above, e-mail : martyn@ipipan.waw.pl

 

B. SCIENTIFIC INFORMATION ABOUT THE PROPOSER

              Proposer's organisation

   Institute of Computer Science, Polish Academy of Sciences (IPI PAN) was founded in 1977 on the basis of the Computation Centre of Polish Academy of Sciences. IPI PAN is a leading national centre of research in computer science. The most important areas of interest are the following: software engineering and its mathematical foundations, concurrency theory, computational statistics, foundations of artificial intelligence, natural language processing and computer graphics. IPI PAN has a research staff of about 60 persons. The Institute has organised a number of international conferences, publishes an international journal Machine Graphics and Vision as well as scientific books and dissertations.
Curriculum vitae of lead researchers

Prof. Leonard Bolc.

Birth: June 18th, 1934.

Degrees: 1993 nominated professor from Institute of Computer Science, Warsaw

1969 D. Sc. University of Poznan,

1964 Ph.D. degree in the field of applied linguistics at the University of Poznan.

1958 M. Sc. University of Poznan (Poland) at the faculty of philology.

Work:

1988- till now Institute of Computer Science, Polish Academy of Sciences

1958-1988, University of Warsaw, Institute of Informatics

Research projects:

1996-1998 the project Application of language engineering methods to automatic analysis and generation of Polish texts.

1992-1994 the project Teoretyczne i metodologiczne podstawy budowy inteligentnych systemow z dostepem w jezyku naturalnym (English title: Theoretical and methodological foundations of intelligent systems with natural language interface) financed by KBN (State Committee for the Scientific Research).

Scholarship and contracts:
 Membership: Awards of Ministry of National Education and the rector of the University of Warsaw

 

MSc. Adam Przepiórkowski

Birth: 24 June, 1968.

Education

February, 1996 - present: a Ph.D. student at Seminar fur Sprachwissenschaft, University of Tuebingen

1995 graduation with honours from the University of Warsaw at the faculty of Mathematics, Informatics and Mechanics.

October 1993 to May 1994 a non-graduating MSc student at the Centre for Cognitive Science, University of Edinburgh, within the framework of the TEMPUS Joint European Project.

October 1991 to June 1992 : a guest student at Heriot-Watt University, Edinburgh, within the framework of individual TEMPUS program

 

Research projects:

1996-1998 participation in the project Application of language engineering methods to automatic analysis and generation of Polish texts. financed by KBN (State Committee for the Scientific Research).

April 1995 - January 1996: participation in the Verbmobil Machine Translation Project, Semantic Construction subproject at the University of Stuttgart..

June 1994 - present: collaboration with Prof. L. Bolc Man-Machine Communication Group at the Institute of Computer Science, Polish Academy of Sciences, Warsaw.

various academic projects at the Centre for Cognitive Science, Edinburgh, (1993-94), the Heriot-Watt University (1991-92) and the University of Warsaw (1989-91)

 

Publications: Bolc, L. & Mykowiecka, A. (1992) Podstawy przetwarzania jezyka naturalnego. Wybrane metody formalnego zapisu skladni. Akademicka Ofucyna Wydawnicza RM, Warsaw. (English title: Foundations of natural language processing. Formal methods of syntax description)

Bolc, L., Mykowiecka, A. Marciniak, M., Kupsc, A., Przepiórkowski, A. & Czuba, K. (1996) Wykorzystanie gramatyki HPSG do opisu jezyka polskiego. In Z. Vetulani, W. Abramowicz & G. Vetulani (eds.) Jezyk i technologia 1995, Akademicka Oficyna Wydawnicza PLJ, Warsaw (English title: Formal description of Polish within the framework of HPSG).

Bolc, L., Czuba K., Kupsc, A., Marciniak, M., Mykowiecka, A. & Przepiórkowski, A. (1996) A survey of systems for implementing HPSG grammars, IPI PAN Report 814, Warsaw.

Czuba, K., Przepiórkowski, A. (1995) Agreement and case assignment in Polish: An Attempt at a Unified Account, IPI PAN Report 783, Warsaw.

Kupsc, A., Marciniak M., Bolc L. (1997) Anaphor binding in Polish: an attempt at an HPSG analysis, IPI PAN Report, Warsaw.

Przepiórkowski, A. (1994) Critical review of approaches to multiply wh-movement. Research paper EUCCS/RP-62, Centre for Cognitive Science, University of Edinburgh.

Przepiórkowski, A. (1996) Case assignment in Polish: Towards an HPSG analysis. In C. Grover and E. Vallduví (ed), Edinburgh Working Papers in Cognitive Science, Vol. 12: Studies in HPSG, pp 191-228, Centre for Cognitive Science, University of Edinburgh.

Przepiórkowski A., Kupsc A. (1997), Negative Concord in Polish, IPI PAN Report 828.

Przepiórkowski, A.Swidzinski, M. (1997) Polish verbal negation revisited: A metamorphosis vs. HPSG account, IPI PAN Report 829, Warsaw.

Przepiórkowski, A. & Kupsc, A. (1997) Verbal negation and complex predicate formation in Polish. In Proceedings of the 1997 Conference of the Texas Linguistic Society on the Syntax and Semantics of Predication, Austin, to appear
 

 D. SUMMARY DESCRIPTION OF RESEARCH PROPOSAL (max. ONE page)
Title of research proposal: An HPSG treebank for Polish   D.1 Summary of the ongoing research project on which the proposal is based

A formal, computer-tractable description of natural language is the main task of the work being done in computational linguistics. Unfortunately in Poland this field of research is less developed than in many other countries (USA, EU). This can be easily noticed when comparing work done towards implementing large scale grammars of various languages. For Polish such efforts have been done only in the metamorphosis framework while for numerous European languages (English, German, French, etc.) many competing grammars have been implemented. The most important objective of the current research project is the creation of the relatively large HPSG-based grammar for Polish. This task needs analysing constructions of Polish, formulation of appropriate generalisations and finally expressing the relevant rules in the HPSG formalism. Some careful modifications and elaboration of the current version of the theory will be indispensable to cover the specific linguistic phenomena of the Polish.

The resulting grammar will create a good basis for various applications including commercial ones e.g. grammar-checkers and machine aided translation systems. This grammar can be also used as a testing tool for further research.

D.2 Summary of the proposed CRIT2 extension

The objective of our proposal is to create a treebank of syntactic structures in Polish using HPSG (Head-driven Phrase Structure Grammar) for encoding the parse trees. The formal HPSG grammar of Polish developed in our ongoing KBN project will be used for this purpose. Such an HPSG-encoded treebank will give sound linguistic grounds for evaluation and improvement of the KBN grammar and its implementation. The treebank can be also used for evaluation of other grammars, writing more effective parsers, e.g., to capture free word order phenomena, add probabilistic data, etc.

The framework of HPSG we have chosen is currently one of the leading linguistic formalisms used both in theoretical and application oriented research programs all over the world.

Recently the interest in modern language technologies has been driven also to Slavic languages (Czech and Bulgarian so far) and HPSG-based grammars have been used in LaTeSlav (Language Processing Technologies for Slavic Languages), a European Union joint research project. The use of a uniform linguistic platform for diverse languages gives the advantage of simplifying potential integration with grammars of other languages. Another vast step towards future practical developments is building the treebank of linguistic constructions. Although in the KBN project we concentrate mostly on syntactic description of Polish, both semantics and morphology will be taken into account in our grammar.

The work in this project will be divided into two tasks: preparing the test data out of the Polish texts corpus and manual annotation of this selected text corpus to prepare the linguistically motivated set of syntactic parses. This will be also the first test of adequacy and coverage of our HPSG grammar. Once such a bank is prepared, it will be used for the improvement of the implementation of the grammar. The organisation and management of the proposal are strictly related to our ongoing project. The work on the proposal can start no sooner as at the end of the second phase of the KBN project.

 

E. DETAILED DESCRIPTION OF THE RESEARCH PROPOSAL E.1 Description of the ongoing project the proposal is based on.. E.1.1. Present state of knowledge in the proposed research field.

In the last fifteen years many methods of describing natural language utterances have been developed but not many of them are really very widely used. In the computational linguistics area one can point to a very small number of formalisms in which real grammars are formulated. One of them is without any doubt the Head-driven Phrase Structure Grammar (HPSG) which was used both as a base for theoretical research as well as a tool for creating a number of applications. The success of the HPSG theory is based on the fact that it allows for relatively easy description of various linguistic phenomena and that these descriptions are relatively easily implementable. The main reasons of this popularity is the strong logical background of the formalism, its generative power and its declarativity enabling for relatively speedy development of large grammars.

HPSG grammars have been developed for a number of languages. The main research was made for English (mainly in USA but also in many other countries), German (Saabruecken, Tuebingen, Stuttgart, Berlin etc. and also in places like Stanford or Ohio) and French (Paris, Lille, Geneve, Stanford). There is also systematic work on Italian, Spanish, Japanese, Korean, Turkish and still others. Slavic languages have been also dealt with within the framework of the HPSG (e.g. EU-funded project LaTeSlav concerning Bulgarian and Czech) but very little work has been done so far for Polish.

The HPSG grammar is not in any case complete theory but it undergoes further development forced by the specificity of the new linguistic phenomena one wants to describe within this framework. This means that any work undertaken in that area needs continuos collaboration with other researchers involved in similar tasks and that these contacts may have real value for all sides.

  E.1.2. Objectives of the proposed research

The most important objective of the current research project is the creation of the relatively large HPSG-based grammar for Polish. To achieve this goal it is necessary to analyse Polish language constructions, to formulate appropriate generalisations and finally to express the relevant rules in the HPSG formalism. As it is already clear this task needs careful modifications and elaboration of the current version of the theory to make it capable of covering the specific linguistic phenomena of Polish.

 
E.1.3. Significance of the proposed research and its expected achievements

The most significant result of the research will be the HPSG grammar of Polish. It will be the first computational Polish grammar combining lexical, syntactical and semantical information and only the second one of that scale (following the one written in the metamorphosis grammar style). The grammar will provide a good basis for further applications including commercial ones e.g. grammar-checkers and machine aided translation systems. Suggested extensions to the HPSG formalism (e.g. word order problems) will be our contribution to the research conducted in this field in other countries (EU, USA).

E.1.4. Technical description

The workplan consists of three tasks, which are being realised in the subsequent years.

Task 1: The analysis of the features of the HPSG grammar and its appropriateness for the Polish texts? descriptions.

  This task comprises familiarisation with the HPSG theory which is to be obtained by describing some exemplary issues of Polish grammar: case assignment, verbal negation, anaphora binding. The second element of this task is to explore the existing grammar development systems and to carefully evaluate those which could be used to implement HPSG style grammars. This should led to the selection of two (possibly equal) computational environments: one for testing the linguistic hypothesis and the second one for the implementation of the final grammar.   Task 2: Modification and extension of the HPSG theory for the needs of the description of Polish.   The aim of the second task is the modification and the extension of the HPSG theory to cover linguistic phenomena specific for Polish. The main problem which should be elaborated is the treatment of relatively free word order typical for Polish sentences. The other main problem treated so far in a rather marginal way is coordination which is crucial for the expressiveness of natural languages. Task 3: Applying HPSG grammar to the analysis of the selected syntactic and semantic phenomena of Polish constructions (with implementation) The third task refers to the implementation of the first unification-based grammar of Polish. At this stage all previous experiences will be gathered in one joint implementation of a relatively efficient parser of Polish sentences. The main goal is to create a product which can be both practically used and further developed.

E.1.5. Organisation and Management

The project is led by Prof. Bolc and is realised by the members of the Man-Machine Communication Group, IPI PAN in collaboration with two Ph.D. students (University of Tuebingen and Carnegie-Mellon University) and a professor of formal linguistics (University of Warsaw). The partial results are published as IPI PAN reports and are presented at national and international conferences. The first two years of the project are devoted to solving separated problems connected with representing Polish sentences in the HPSG formalisms. The last phase of the project will be devoted to combining all results and their joint implementation.

 

E.2. Description of the proposed CRIT2 extension

  E.2.1. Objectives of the proposed research extension
 

The objective of our proposal is to create a treebank of syntactic structures in Polish using HPSG (Head-driven Phrase Structure Grammar) for encoding the parse trees. Treebank corpora are large grammatical databases which aim at providing an accurate, linguistically adequate basis for computational applications. The syntactic analysis of the test items (sentences in a selected text corpus) is manually derived according to the principles of some (formal) linguistic grammar. The formal HPSG grammar of Polish developed in our ongoing KBN project will be used for this purpose. Such an HPSG-encoded treebank will give sound linguistic grounds for evaluation and improvement of the KBN grammar and its implementation. Apart form this straightforward application, the treebank can be used for evaluation of other grammars, writing more effective parsers, e.g., to capture free word order phenomena, add probabilistic data, etc.
 
E.2.2. Significance of the proposed research extension and its expected achievements
  The framework of HPSG we have chosen is currently one of the leading linguistic formalisms used both in theoretical and application oriented research programs all over the world, e.g.: 1) Verbmobil --- a large automatic speech-to-speech translation project with partners in Germany, USA and Japan;

2) Head-driven Phrase Structure Grammar (theoretical foundations of the formalism based on cross-linguistic generalisations) and English Resource Grammar (computational grammar of English) projects in Stanford, CSLI, USA;

3) Phrase Structure Grammar for French (HPSG grammar of French) sponsored by CNRS-NSF grant.

Recently the interest in modern language technologies has been driven also to Slavic languages (Czech and Bulgarian so far) and HPSG-based grammars have been used in LaTeSlav (Language Processing Technologies for Slavic Languages), a European Union joint research project. The contribution of Polish is thus a natural extension of this stream of research. The use of a uniform linguistic platform for diverse languages gives the advantage of simplifying potential integration with grammars of other languages. Another vast step towards future practical developments is building the treebank of linguistic constructions. Although in the KBN project we concentrate mostly on syntactic description of Polish, both semantics and morphology will be taken into account. Thus, the treebank we propose to create will contain these components which will be particularly useful in case of advanced multi-language applications such as machine translation. In Verbmobil, for example, transfer, i.e., the core engine of machine translation, is based on semantics.
 
E.2.3. Technical description The work in this project will be divided into two tasks:

First, we will prepare the test data, i.e., the set of constructions (sentences) representing various linguistic phenomena. We will make a selection out of the Polish texts corpus trying to cover as many different constructions as possible. The aim of this task is to get a representative sample of the language.

The second task is the core of our proposal. We will manually annotate this selected text corpus, i.e., using only our formal linguistic description we will prepare the linguistically motivated set of syntactic parses. This will be also the first test of adequacy and coverage of our HPSG grammar. All ambiguities and idiomatic expressions will have to be marked as well to predict possible alternatives during automatic parsing. This treebank will serve for the verification and amelioration of the grammar devised in the KBN project.

 

E.2.4. Organisation and Management The organisation and management of the proposal is strictly related to our ongoing project. Since the preparation of the treebank requires maximally developed formal grammar, the work on the proposal can be started no sooner as at the end of the second phase of the KBN project. Once such a bank is prepared, it will be used for the improvement of the implementation of the grammar, i.e., starting from the middle of the third phase of the KBN project.