In this section, two scripts are provided freely (click on the links to proceed to the specific instructions which are displayed below):
I wrote these scripts to play with GradeStat and Gnuplot.
My scripts have the toy status since Aleksander Buczyński (also with the Institute of Computer Science) has recently programmed a more professional tool. It is Poliqarp 1.1, a new beta version of the Poliqarp KWIC browser, used for the IPI PAN Corpus of Polish. Poliqarp 1.1 covers the functionality of both a KWIC browser and corpus_to_matrix.pl. It has a different query syntax, however.
DOWNLOAD OF THE SCRIPTS:
The downloadable ZIP archive (184K) contains:
DOWNLOAD OF AN EXAMPLE CORPUS:
To test my scripts, you may use the following zipped version of the immortable SFPW Corpus (a.k.a. the corpus of Frequency dictionary of contemporary Polish) in the appropriate format:
The SFPW Corpus is a ca. 500,000-word-long POS-tagged corpus of the 1960's Polish compiled for a frequency dictionary (Kurcz, Lewicki, Sambor, Szafran i Woronczak, Słownik frekwencyjny polszczyzny współczesnej, Kraków: Instytut Języka Polskiego PAN, 1990). It is GNU-licensed and available also in a few other formats:
The ZIP archive of mine contains file "ksfpw", which looks approximately like the following:
form lemma POS number case gender person degree aspect negation Sztuka sztuka subst sg nom f - - - - utraciła utracić praet sg - f - - perf - swoją swój adj sg acc f - pos - - moc moc subst sg acc f - - - - pobudzającą pobudzający adj sg acc f - pos - - ...
The first row of the file specifies the attribute names (according to the IPI PAN Corpus tagset). The next rows are records of the attribute values, relating to the consecutive text positions. You can use my scripts with a different moderately-sized corpus of the same format as "ksfpw", with the first row specifying names of some other attributes.
FYI: The first sentence of the SFPW Corpus "Sztuka utraciła swoją moc pobudzającą." reads "Art has lost its invigorating power."
EFFECTIVE USE:
If you use my scripts for research, let me know and I may mention your publications/slides here.
This toy Perl script can compute an astonishingly wide range of contingency tables for a small annotated corpus. You may use it for instance in linguistic research. The output of the script can be imported to GradeStat easily.
USAGE:
corpus_to_matrix.pl [WINDOW WIDTH]
[MATCHING CONDITION]
[ROW VARIABLE]
[COLUMN VARIABLE]
[MINIMAL NUMBER OF ROW VALUE OCCURENCES]
< [CORPUS FILE]
> [CONTINGENCY TABLE FILE]
corpus_to_matrix.pl {to produce a help info}
EXAMPLES:
Let the corpus file be ksfpw. The task is to compute a contingency table reporting the counts (the frequencies) of nouns (POS=subst) occuring in the respective genders and cases and to write the table to file "m_SUBST_lemma_case":
acc dat gen inst loc nom voc
f 10835 922 22900 4497 8291 12764 686
m1 2386 1186 6978 1689 275 12737 1633
m2 189 28 403 94 57 442 25
m3 10632 421 21313 4222 9634 10017 32
n 7328 444 11097 2524 4303 7742 74
In order to compute this table, call simply
./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "1.case" 1 <ksfpw >m_SUBST_gender_casein the Bash terminal (or in another Linux shell terminal).
The syntax of the call is quite intuitive. Some mysterious points are arguments [WINDOW WIDTH] and [MINIMAL NUMBER OF ROW VALUE OCCURENCES], both equal "1", as well as, "1." strings preceding the attribute names in arguments [MATCHING CONDITION], [ROW VARIABLE], and [COLUMN VARIABLE].
./corpus_to_matrix.pl 2 "" "1.POS" "2.POS" 1 <ksfpw >m_ALL_POS_POS+1The meaning is: Scan the corpus by single lines. At every line look ahead at 2 lines, this line and the next one. Compute the value of the row variable as the POS in the first line (1.POS) and the value of the column variable as the POS in the second line (2.POS). Add +1 to cell of the contingency table identified by the computed pair of values.
The specifications of [MATCHING CONDITION], [ROW VARIABLE], and [COLUMN VARIABLE] can be more complex:
./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "(1.case=loc,1.case=voc)" 1 <ksfpw >m_SUBST_gender_loc_vocyields a subtable of the table displayed above, namely:
1.case=loc 1.case=voc
f 8291 686
m1 275 1633
m2 57 25
m3 9634 32
n 4303 74
./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "(1.case=loc|inst,1.case=voc|dat)" 1 <ksfpw >m_SUBST_gender_loc+inst_voc+datyields a table with the added frequencies of locative and intrumental cases ("loc" and "inst") vs. the added frequencies of vocative and dative ("voc" and "dat"). Namely:
1.case=loc|inst 1.case=voc|dat
f 12788 1608
m1 1964 2819
m2 151 53
m3 13856 453
n 6827 518
./corpus_to_matrix.pl 1 "" "" "" 1 <ksfpw
661838
./corpus_to_matrix.pl 1 "1.POS=interp" "" "" 1 <ksfpw
106772
But the option of displaying the total frequencies of rows may be
quite useful. Observe that
./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "(,1.case=voc)" 1 <ksfpwproduces the following output:
1.case=voc
f 60895 686
m1 26884 1633
m2 1238 25
m3 56271 32
n 33512 74
./corpus_to_matrix.pl 1 "1.POS=subst" "1.number:1.case" "1.gender" 1 <ksfpw >m_SUBST_number\&case_genderyields a transposed table with the concatenated values of number and case:
f m1 m2 m3 n
pl:acc 2705 797 69 2913 1536
pl:dat 194 393 6 153 118
pl:gen 5736 3078 224 7878 3340
pl:inst 756 428 18 911 461
pl:loc 1434 86 14 1910 933
pl:nom 2587 2297 155 2537 1351
pl:voc 34 119 3 4 10
sg:acc 8130 1589 120 7719 5792
sg:dat 728 793 22 268 326
sg:gen 17164 3900 179 13435 7757
sg:inst 3741 1261 76 3311 2063
sg:loc 6857 189 43 7724 3370
sg:nom 10177 10440 287 7480 6391
sg:voc 652 1514 22 28 64
Resuming, consider the last but final example:
./corpus_to_matrix.pl 3 "2.POS=prep|conj" "2.lemma" "(,1.case=-,3.case=-,1.POS=interp)" 300 <ksfpw >m_NDM_lemma_subst-1_subst+1_interp-1
This query selects all triplets in the corpus where the middle is a preposition or a conjunction ("prep" or "conj") and computes the contingency table where the rows are lemmas of the noninflected words and the columns are the counts of the following contexts: i) Does the previous token in the corpus have the case attribute? ii) Does the next token in the corpus have the case attribute? iii) Is the previous token in the corpus a punctuation mark? The table looks like this:
1.POS=interp 1.case=- 3.case=-
a 3258 2860 2929 1513
aby 413 359 382 263
ale 2219 2124 2166 1292
ani 354 190 271 78
bez 504 128 328 1
bo 894 826 880 497
czy 765 350 435 308
dla 1459 206 653 23
do 6330 536 3331 121
gdy 438 344 419 197
i 13304 1418 3396 4151
jak 2568 1481 2103 1291
jako 488 137 275 34
jeśli 358 259 348 229
kiedy 313 268 298 174
lub 454 50 96 113
między 573 83 243 5
na 9729 1456 5492 452
nad 544 40 249 5
niż 384 44 184 102
o 3446 521 2015 94
od 2071 330 1139 92
oraz 786 75 91 91
po 2146 660 1563 243
pod 713 92 387 7
przed 934 144 600 13
przez 1557 138 595 39
przy 1063 254 625 28
to 985 815 927 617
u 453 101 336 1
w 17501 3502 9812 250
więc 747 223 710 346
z 9106 1142 4359 211
za 1489 275 943 50
że 4743 4424 4691 2084
żeby 641 580 632 428
Potentially, the script can be used in lexical semantic research. For this usage, however, the KSFPW Corpus itself may be too small. The last example shows the distribution of color adjectives and the nouns which follow them:
./corpus_to_matrix.pl 2 "2.POS=subst:1.lemma=^(biały|czarny|zielony|czerwony)$" "2.lemma" "1.lemma" 3 <ksfpw
biały czarny czerwony niebieski zielony żółty
beret - - 3 - - -
człowiek 3 - - - - -
dom 5 - - - - -
flaga - - 3 - - -
głowa - 2 1 - - -
góra - - - - 13 -
kolor - 1 - 1 1 1
krzyż - 1 6 - - -
linia - 1 - - 2 -
magia - 3 - - - -
najemnik 3 - - - - -
oko - - 1 2 - -
sukienka 1 1 1 - - -
światło - 1 2 - 1 -
This toy Perl script calls Gnuplot, ps2pdf, and Bash to make a graph of a N x 2 contingency matrix. The row attributes may be plotted as (nonoverlapping) text labels or as data points (colored accordingly to the attribute values). A contingency matrix which is readable by "matrix_Nx2_to_gnuplot.pl" can be generated e.g. by corpus_to_matrix.pl. The resulted graph is written out as a PDF file.
(Script "matrix_Nx2_to_gnuplot.pl" replaces the less convenient script "prune.labels.pl" previously available from this page.)
USAGE:
matrix_Nx2_to_gnuplot.pl [MINIMAL X]
[MAXIMAL X]
0|1 {is X in logscale?}
[MINIMAL Y]
[MAXIMAL Y]
0|1 {is Y in logscale?}
[FONTSIZE]
[ENCODING]
P|Q|R|L {plot type}
< [CONTINGENCY MATRIX FILE]
> [PDF FILE WITH THE GRAPH]
matrix_Nx2_to_gnuplot.pl {to produce this help info}
EXAMPLES:
Consider a contingency table resuming the frequencies of nouns in two groups of cases: i) accusative, instrumental, and locative ("acc|inst|loc") vs. ii) nominative, dative and vocative ("nom|dat|voc"). Using the previously discussed script corpus_to_matrix.pl, run on the SFPW corpus , we can compute it as follows:
./corpus_to_matrix.pl 1 "1.POS=subst" "1.lemma" "(1.case=acc|inst|loc,1.case=nom|dat|voc)" 3 <ksfpw >m_SUBST_lemma_AIL_NDV
1.case=acc|inst|loc 1.case=nom|dat|voc
Adam 2 4
Aldona - 1
Andrzej - 5
Anglik - 2
Anna - 5
Antoni 1 5
Artur 1 4
Bonn 1 3
Chiny - 1
Czarny - 3
Czesiek 1 3
Dębski 1 4
Ewa 1 1
Francja - 4
Francuz - 3
Grzegorz - 6
It may be more insightful to plot this table as a planar plot with the text labels taken from the column #1 and placed in the coordinates given by the columns #2 and #3. This can be achieved by calling:
cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 R >m_SUBST_lemma_AIL_NDV.pdf
Unfortunately, Gnuplot cannot use the CP1250 encoding of Central European fonts in the graphs so all diacritics in the input file were transcoded into ISO-8859-2 encoding (via script "winiso") before it was sent to "matrix_Nx2_to_gnuplot.pl". Additionally, to increase readability, script "matrix_Nx2_to_gnuplot.pl" was run with the plot type option "R" requiring to remove some overlapping labels. To obtain the plot that displays all the labels, call
cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 L >m_SUBST_lemma_AIL_NDV.pdfwhere the plot type option "L" is applied instead of "R". It is hardly useful for this dataset.
Furthermore, it is possible to draw points (dots) instead of labels. Replace the plot type option "L" with option "P" to do so:
cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 P >m_SUBST_lemma_AIL_NDV.pdfOption "P" results in drawing all nouns as dots of the same shape and color. You can differentiate these points by shapes and colors, using option "Q".
The plot type option "Q" requires that each point in the graph be plotted with the style which is distinct for the text attribute of the point (given in column #1 of the input file). Recall that each text attribute is unique for the row in file "cat m_SUBST_lemma_AIL_NDV". Therefore calling
cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 Q >m_SUBST_lemma_AIL_NDV.pdfresults in an error message, since neither "matrix_Nx2_to_gnuplot.pl" nor Gnuplot can handle so large a number of different point styles.
We can group, however, nouns by genders. First we have to add the explicit gender information to the plotted matrix:
./corpus_to_matrix.pl 1 "1.POS=subst" "1.lemma:1.gender" "(1.case=acc|inst|loc,1.case=nom|dat|voc)" 3 <ksfpw >m_SUBST_lemma\&gender_AIL_NDV 1.case=acc|inst|loc 1.case=nom|dat|voc Adam:m1 2 4 Aldona:f - 1 Andrzej:m1 - 5 Anglik:m1 - 2 Anna:f - 5 Antoni:m1 1 5 Artur:m1 1 4 Bonn:n 1 3 Chiny:n - 1 Czarny:m1 - 3 Czesiek:m1 1 3 Dębski:m1 1 4 Ewa:f 1 1 Francja:f - 4 Francuz:m1 - 3 Grzegorz:m1 - 6
The nouns in the SPFW corpus have one of five genders: masculine personal (m1), masculine animal (m2), masculine inanimate (m3), feminine (f), and merged neuter and plurale tantums (n). For the following processing, we will drop the lemmas in the input file and we will omit rows containing the nouns of genders "m2", "f" and "n":
cat m_SUBST_lemma\&gender_AIL_NDV | perl -pe "s/.*:(f|n|m2).*\n//; s/.*://;" |./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 Q >m_SUBST_lemma\&gender_AIL_NDV.pdf
The resulted plot shows clearly that the nouns of gender "m1" tend to occur mostly in nominative, dative and vocative ("nom|dat|voc"), whereas the nouns of gender "m1" prefer to appear in accusative, instrumental, and locative ("acc|inst|loc"). This has to do with semantics rather than with formal syntactic rules. The "m1" nouns denote persons while the "m3" nouns correspond mostly to things, places, or time units. As it can be checked in the previous plot, the nouns denoting persons of any sex prefer the syntactic positions expressed via "nom|dat|voc" — they are usually the agents and beneficiaries of actions or the addressees of utterances. For instance, feminines such as "pani" (lady, Mrs., Ms.), "mama" (mother) occupy the "m1" region of the graph. On the other hand, the nouns denoting non-persons tend to occur in accusative, instrumental, and locative — they are the objects, instruments, or locations of actions. Hence, the "m3" area covers also feminine and neuter nouns like "godzina" (hour), "miejsce" (place), and "ściana" (wall). (If you have troubles spotting them in m_SUBST_lemma_AIL_NDV.pdf, simply use the Acrobat Reader text search command!).
Notice that all titles in the graphs were added automatically by the script, which used the information available in the input file.
The last example of a plot produced by "matrix_Nx2_to_gnuplot.pl" was prepared using the different corpus data provided by Dorota Lewandowska. These data are not fully available publicly and consist of two parts: (i) the 100,000-word-long newspaper subcorpus of the SFPW Corpus, excerpted in 1960's and (ii) a new 100,000-word-long newspaper corpus excerpted in 1990's. Both subcorpora were POS-tagged. In the graph below, you can watch the difference of word frequency between the People's Republic of Poland's newspeak and the press language of the Third Republic of Poland:
Łukasz Dębowski, 21.06.2007