KBN project (KBN grant number) on which the proposal to CRIT2 is based: 8 T11C 011 10
PROJECT TITLE: (Polish): Zastosowanie metod inzynierii lingwistycznej do automatycznej analizy i syntezy teksów jezyka polskiego
Commencement date: 1.01.1996, Duration: 3 years
Title of CRIT2 research proposal: An HPSG tree bank for Polish
Institute of Computer Science, Polish Academy of Sciences (IPI PAN)
Telephone number : +48 22 36 28 41
Telefax :+48 22 37 65 64 e-mail : email@example.com, firstname.lastname@example.org
First Name : Leonard Title : prof.
Office address (the same as above)
Administrative coordinator responsible for the signature of a contract within the body carrying out the research:
Name and title : mgr Boguslaw Martyniak
Office address the same as above, e-mail : email@example.com
Institute of Computer Science, Polish Academy of Sciences (IPI PAN) was founded in 1977 on the basis of the Computation Centre of Polish Academy of Sciences. IPI PAN is a leading national centre of research in computer science. The most important areas of interest are the following: software engineering and its mathematical foundations, concurrency theory, computational statistics, foundations of artificial intelligence, natural language processing and computer graphics. IPI PAN has a research staff of about 60 persons. The Institute has organised a number of international conferences, publishes an international journal Machine Graphics and Vision as well as scientific books and dissertations.
Prof. Leonard Bolc.
Birth: June 18th, 1934.
Degrees: 1993 nominated professor from Institute of Computer Science, Warsaw
1969 D. Sc. University of Poznan,
1964 Ph.D. degree in the field of applied linguistics at the University of Poznan.
1958 M. Sc. University of Poznan (Poland) at the faculty of philology.
1988- till now Institute of Computer Science, Polish Academy of Sciences
1958-1988, University of Warsaw, Institute of Informatics
1992-1994 the project Teoretyczne i metodologiczne podstawy budowy inteligentnych systemow z dostepem w jezyku naturalnym (English title: Theoretical and methodological foundations of intelligent systems with natural language interface) financed by KBN (State Committee for the Scientific Research).
Birth: 24 June, 1968.
1995 graduation with honours from the University of Warsaw at the faculty of Mathematics, Informatics and Mechanics.
October 1993 to May 1994 a non-graduating MSc student at the Centre for Cognitive Science, University of Edinburgh, within the framework of the TEMPUS Joint European Project.
October 1991 to June 1992 : a guest student at Heriot-Watt University, Edinburgh, within the framework of individual TEMPUS program
April 1995 - January 1996: participation in the Verbmobil Machine Translation Project, Semantic Construction subproject at the University of Stuttgart..
June 1994 - present: collaboration with Prof. L. Bolc Man-Machine Communication Group at the Institute of Computer Science, Polish Academy of Sciences, Warsaw.
Bolc, L., Mykowiecka, A. Marciniak, M., Kupsc, A., Przepiórkowski, A. & Czuba, K. (1996) Wykorzystanie gramatyki HPSG do opisu jezyka polskiego. In Z. Vetulani, W. Abramowicz & G. Vetulani (eds.) Jezyk i technologia 1995, Akademicka Oficyna Wydawnicza PLJ, Warsaw (English title: Formal description of Polish within the framework of HPSG).
Bolc, L., Czuba K., Kupsc, A., Marciniak, M., Mykowiecka, A. & Przepiórkowski, A. (1996) A survey of systems for implementing HPSG grammars, IPI PAN Report 814, Warsaw.
Czuba, K., Przepiórkowski, A. (1995) Agreement and case assignment in Polish: An Attempt at a Unified Account, IPI PAN Report 783, Warsaw.
Kupsc, A., Marciniak M., Bolc L. (1997) Anaphor binding in Polish: an attempt at an HPSG analysis, IPI PAN Report, Warsaw.
Przepiórkowski, A. (1994) Critical review of approaches to multiply wh-movement. Research paper EUCCS/RP-62, Centre for Cognitive Science, University of Edinburgh.
Przepiórkowski, A. (1996) Case assignment in Polish: Towards an HPSG analysis. In C. Grover and E. Vallduví (ed), Edinburgh Working Papers in Cognitive Science, Vol. 12: Studies in HPSG, pp 191-228, Centre for Cognitive Science, University of Edinburgh.
Przepiórkowski A., Kupsc A. (1997), Negative Concord in Polish, IPI PAN Report 828.
Przepiórkowski, A.Swidzinski, M. (1997) Polish verbal negation revisited: A metamorphosis vs. HPSG account, IPI PAN Report 829, Warsaw.
Przepiórkowski, A. & Kupsc, A. (1997) Verbal negation and
complex predicate formation in Polish. In Proceedings of the 1997 Conference
of the Texas Linguistic Society on the Syntax and Semantics of Predication,
Austin, to appear
A formal, computer-tractable description of natural language is the main task of the work being done in computational linguistics. Unfortunately in Poland this field of research is less developed than in many other countries (USA, EU). This can be easily noticed when comparing work done towards implementing large scale grammars of various languages. For Polish such efforts have been done only in the metamorphosis framework while for numerous European languages (English, German, French, etc.) many competing grammars have been implemented. The most important objective of the current research project is the creation of the relatively large HPSG-based grammar for Polish. This task needs analysing constructions of Polish, formulation of appropriate generalisations and finally expressing the relevant rules in the HPSG formalism. Some careful modifications and elaboration of the current version of the theory will be indispensable to cover the specific linguistic phenomena of the Polish.
The resulting grammar will create a good basis for various applications including commercial ones e.g. grammar-checkers and machine aided translation systems. This grammar can be also used as a testing tool for further research.
D.2 Summary of the proposed CRIT2 extension
The objective of our proposal is to create a treebank of syntactic structures in Polish using HPSG (Head-driven Phrase Structure Grammar) for encoding the parse trees. The formal HPSG grammar of Polish developed in our ongoing KBN project will be used for this purpose. Such an HPSG-encoded treebank will give sound linguistic grounds for evaluation and improvement of the KBN grammar and its implementation. The treebank can be also used for evaluation of other grammars, writing more effective parsers, e.g., to capture free word order phenomena, add probabilistic data, etc.
The framework of HPSG we have chosen is currently one of the leading linguistic formalisms used both in theoretical and application oriented research programs all over the world.
Recently the interest in modern language technologies has been driven also to Slavic languages (Czech and Bulgarian so far) and HPSG-based grammars have been used in LaTeSlav (Language Processing Technologies for Slavic Languages), a European Union joint research project. The use of a uniform linguistic platform for diverse languages gives the advantage of simplifying potential integration with grammars of other languages. Another vast step towards future practical developments is building the treebank of linguistic constructions. Although in the KBN project we concentrate mostly on syntactic description of Polish, both semantics and morphology will be taken into account in our grammar.
The work in this project will be divided into two tasks: preparing the test data out of the Polish texts corpus and manual annotation of this selected text corpus to prepare the linguistically motivated set of syntactic parses. This will be also the first test of adequacy and coverage of our HPSG grammar. Once such a bank is prepared, it will be used for the improvement of the implementation of the grammar. The organisation and management of the proposal are strictly related to our ongoing project. The work on the proposal can start no sooner as at the end of the second phase of the KBN project.
In the last fifteen years many methods of describing natural language utterances have been developed but not many of them are really very widely used. In the computational linguistics area one can point to a very small number of formalisms in which real grammars are formulated. One of them is without any doubt the Head-driven Phrase Structure Grammar (HPSG) which was used both as a base for theoretical research as well as a tool for creating a number of applications. The success of the HPSG theory is based on the fact that it allows for relatively easy description of various linguistic phenomena and that these descriptions are relatively easily implementable. The main reasons of this popularity is the strong logical background of the formalism, its generative power and its declarativity enabling for relatively speedy development of large grammars.
HPSG grammars have been developed for a number of languages. The main research was made for English (mainly in USA but also in many other countries), German (Saabruecken, Tuebingen, Stuttgart, Berlin etc. and also in places like Stanford or Ohio) and French (Paris, Lille, Geneve, Stanford). There is also systematic work on Italian, Spanish, Japanese, Korean, Turkish and still others. Slavic languages have been also dealt with within the framework of the HPSG (e.g. EU-funded project LaTeSlav concerning Bulgarian and Czech) but very little work has been done so far for Polish.
The HPSG grammar is not in any case complete theory but it undergoes further development forced by the specificity of the new linguistic phenomena one wants to describe within this framework. This means that any work undertaken in that area needs continuos collaboration with other researchers involved in similar tasks and that these contacts may have real value for all sides.
E.1.2. Objectives of the proposed research
The most important objective of the current research project is the creation of the relatively large HPSG-based grammar for Polish. To achieve this goal it is necessary to analyse Polish language constructions, to formulate appropriate generalisations and finally to express the relevant rules in the HPSG formalism. As it is already clear this task needs careful modifications and elaboration of the current version of the theory to make it capable of covering the specific linguistic phenomena of Polish.
E.1.3. Significance of the proposed research and its expected achievements
The most significant result of the research will be the HPSG grammar of Polish. It will be the first computational Polish grammar combining lexical, syntactical and semantical information and only the second one of that scale (following the one written in the metamorphosis grammar style). The grammar will provide a good basis for further applications including commercial ones e.g. grammar-checkers and machine aided translation systems. Suggested extensions to the HPSG formalism (e.g. word order problems) will be our contribution to the research conducted in this field in other countries (EU, USA).
E.1.4. Technical description
The workplan consists of three tasks, which are being realised in the subsequent years.
Task 1: The analysis of the features of the HPSG grammar and its appropriateness for the Polish texts? descriptions.
E.1.5. Organisation and Management
The project is led by Prof. Bolc and is realised by the members of the Man-Machine Communication Group, IPI PAN in collaboration with two Ph.D. students (University of Tuebingen and Carnegie-Mellon University) and a professor of formal linguistics (University of Warsaw). The partial results are published as IPI PAN reports and are presented at national and international conferences. The first two years of the project are devoted to solving separated problems connected with representing Polish sentences in the HPSG formalisms. The last phase of the project will be devoted to combining all results and their joint implementation.
E.2.1. Objectives of the proposed research extension
2) Head-driven Phrase Structure Grammar (theoretical foundations of the formalism based on cross-linguistic generalisations) and English Resource Grammar (computational grammar of English) projects in Stanford, CSLI, USA;
3) Phrase Structure Grammar for French (HPSG grammar of French) sponsored by CNRS-NSF grant.
First, we will prepare the test data, i.e., the set of constructions (sentences) representing various linguistic phenomena. We will make a selection out of the Polish texts corpus trying to cover as many different constructions as possible. The aim of this task is to get a representative sample of the language.
The second task is the core of our proposal. We will manually annotate this selected text corpus, i.e., using only our formal linguistic description we will prepare the linguistically motivated set of syntactic parses. This will be also the first test of adequacy and coverage of our HPSG grammar. All ambiguities and idiomatic expressions will have to be marked as well to predict possible alternatives during automatic parsing. This treebank will serve for the verification and amelioration of the grammar devised in the KBN project.