up (some free linguistic Perl scripts)

Visualization and extraction of frequency matrices from corpora

In this section, two scripts are provided freely (click on the links to proceed to the specific instructions which are displayed below):

I wrote these scripts to play with GradeStat and Gnuplot.

My scripts have the toy status since Aleksander Buczyński (also with the Institute of Computer Science) has recently programmed a more professional tool. It is Poliqarp 1.1, a new beta version of the Poliqarp KWIC browser, used for the IPI PAN Corpus of Polish. Poliqarp 1.1 covers the functionality of both a KWIC browser and corpus_to_matrix.pl. It has a different query syntax, however.

DOWNLOAD OF THE SCRIPTS:

The downloadable ZIP archive (184K) contains:

DOWNLOAD OF AN EXAMPLE CORPUS:

To test my scripts, you may use the following zipped version of the immortable SFPW Corpus (a.k.a. the corpus of Frequency dictionary of contemporary Polish) in the appropriate format:

The SFPW Corpus is a ca. 500,000-word-long POS-tagged corpus of the 1960's Polish compiled for a frequency dictionary (Kurcz, Lewicki, Sambor, Szafran i Woronczak, Słownik frekwencyjny polszczyzny współczesnej, Kraków: Instytut Języka Polskiego PAN, 1990). It is GNU-licensed and available also in a few other formats:

The ZIP archive of mine contains file "ksfpw", which looks approximately like the following:


form lemma POS number case gender person degree aspect negation
Sztuka sztuka subst sg nom f - - - -
utraciła utracić praet sg - f - - perf -
swoją swój adj sg acc f - pos - -
moc moc subst sg acc f - - - -
pobudzającą pobudzający adj sg acc f - pos - -
...

The first row of the file specifies the attribute names (according to the IPI PAN Corpus tagset). The next rows are records of the attribute values, relating to the consecutive text positions. You can use my scripts with a different moderately-sized corpus of the same format as "ksfpw", with the first row specifying names of some other attributes.

FYI: The first sentence of the SFPW Corpus "Sztuka utraciła swoją moc pobudzającą." reads "Art has lost its invigorating power."

EFFECTIVE USE:

If you use my scripts for research, let me know and I may mention your publications/slides here.


corpus_to_matrix.pl

This toy Perl script can compute an astonishingly wide range of contingency tables for a small annotated corpus. You may use it for instance in linguistic research. The output of the script can be imported to GradeStat easily.

USAGE:


corpus_to_matrix.pl [WINDOW WIDTH]
                    [MATCHING CONDITION]
                    [ROW VARIABLE]
                    [COLUMN VARIABLE]
                    [MINIMAL NUMBER OF ROW VALUE OCCURENCES]
                    < [CORPUS FILE]
                    > [CONTINGENCY TABLE FILE]

corpus_to_matrix.pl    {to produce a help info}

EXAMPLES:

Let the corpus file be ksfpw. The task is to compute a contingency table reporting the counts (the frequencies) of nouns (POS=subst) occuring in the respective genders and cases and to write the table to file "m_SUBST_lemma_case":


        acc     dat     gen     inst    loc     nom     voc
f       10835   922     22900   4497    8291    12764   686
m1      2386    1186    6978    1689    275     12737   1633
m2      189     28      403     94      57      442     25
m3      10632   421     21313   4222    9634    10017   32
n       7328    444     11097   2524    4303    7742    74

In order to compute this table, call simply


./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "1.case" 1 <ksfpw >m_SUBST_gender_case
in the Bash terminal (or in another Linux shell terminal).

The syntax of the call is quite intuitive. Some mysterious points are arguments [WINDOW WIDTH] and [MINIMAL NUMBER OF ROW VALUE OCCURENCES], both equal "1", as well as, "1." strings preceding the attribute names in arguments [MATCHING CONDITION], [ROW VARIABLE], and [COLUMN VARIABLE].

  1. The first argument, [WINDOW WIDTH], says how many consecutive lines of the corpus are needed to compute the single values of [MATCHING CONDITION], [ROW VARIABLE], and [COLUMN VARIABLE] for a scanned position in the corpus. You have to refer to the relative positions of these lines in the specifications [MATCHING CONDITION], [ROW VARIABLE], and [COLUMN VARIABLE]. For instance, if you wish to compute the frequency of part-of-speech (POS) bigrams then call:
    
    ./corpus_to_matrix.pl 2 "" "1.POS" "2.POS" 1 <ksfpw >m_ALL_POS_POS+1
    
    The meaning is: Scan the corpus by single lines. At every line look ahead at 2 lines, this line and the next one. Compute the value of the row variable as the POS in the first line (1.POS) and the value of the column variable as the POS in the second line (2.POS). Add +1 to cell of the contingency table identified by the computed pair of values.
  2. If the [MINIMAL NUMBER OF ROW VALUE OCCURENCES] is set to N, then all rows with a total sum of occurences less than N are deleted from the final contingency table. In our example, the argument is "1" so the output table is not pruned.

The specifications of [MATCHING CONDITION], [ROW VARIABLE], and [COLUMN VARIABLE] can be more complex:

  1. [COLUMN VARIABLE] may be specified as a comma-separated list of matching conditions in parentheses. For example,
    
    ./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "(1.case=loc,1.case=voc)" 1 <ksfpw >m_SUBST_gender_loc_voc
    
    yields a subtable of the table displayed above, namely:
    
            1.case=loc      1.case=voc
    f       8291		686
    m1      275     	1633
    m2      57      	25
    m3      9634    	32
    n       4303    	74
    
  2. Matching conditions on the right hand side of "=" may be arbitrary Perl regular expressions. For instance, [MATCHING CONDITION] equal "1.POS=qub|prep|conj" means matching the lines with part-of-speech equal to "qub", "prep", or "conj". On the other hand,
    
    ./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "(1.case=loc|inst,1.case=voc|dat)" 1 <ksfpw >m_SUBST_gender_loc+inst_voc+dat
    
    yields a table with the added frequencies of locative and intrumental cases ("loc" and "inst") vs. the added frequencies of vocative and dative ("voc" and "dat"). Namely:
    
            1.case=loc|inst 1.case=voc|dat
    f       12788		1608
    m1      1964    	2819
    m2      151     	53
    m3      13856   	453
    n       6827    	518
    
  3. Any of the arguments may be the empty string. [MATCHING CONDITION] equal "" results in matching all lines of the corpus. The row and column variables defined as "" have empty values. Pervertly, you may use it to compute the corpus size (the number of words and punctuation marks in the corpus):
    
    ./corpus_to_matrix.pl 1 "" "" "" 1 <ksfpw
    
            661838
    
    ./corpus_to_matrix.pl 1 "1.POS=interp" "" "" 1 <ksfpw
    
            106772
    
    But the option of displaying the total frequencies of rows may be quite useful. Observe that
    
    ./corpus_to_matrix.pl 1 "1.POS=subst" "1.gender" "(,1.case=voc)" 1 <ksfpw
    
    produces the following output:
    
                    1.case=voc
    f       60895   686
    m1      26884   1633
    m2      1238    25
    m3      56271   32
    n       33512   74
    
  4. You can use a conjunction/concatenation operator ":". Thus [MATCHING CONDITION] equal "1.POS=adj:2.POS=subst" means matching the lines containing adjectives (POS=adj) followed by lines containing nouns (POS=subst).  The row and column variables defined "1.POS:2.POS" have values of "1.POS" concatenated with "2.POS" via separator ":". Query
    
    ./corpus_to_matrix.pl 1 "1.POS=subst" "1.number:1.case" "1.gender" 1 <ksfpw >m_SUBST_number\&case_gender
    
    yields a transposed table with the concatenated values of number and case:
    
            f       m1      m2      m3      n
    pl:acc  2705    797     69      2913    1536
    pl:dat  194     393     6       153     118
    pl:gen  5736    3078    224     7878    3340
    pl:inst 756     428     18      911     461
    pl:loc  1434    86      14      1910    933
    pl:nom  2587    2297    155     2537    1351
    pl:voc  34      119     3       4       10
    sg:acc  8130    1589    120     7719    5792
    sg:dat  728     793     22      268     326
    sg:gen  17164   3900    179     13435   7757
    sg:inst 3741    1261    76      3311    2063
    sg:loc  6857    189     43      7724    3370
    sg:nom  10177   10440   287     7480    6391
    sg:voc  652     1514    22      28      64
    

Resuming, consider the last but final example:


./corpus_to_matrix.pl 3 "2.POS=prep|conj" "2.lemma" "(,1.case=-,3.case=-,1.POS=interp)" 300 <ksfpw >m_NDM_lemma_subst-1_subst+1_interp-1

This query selects all triplets in the corpus where the middle is a preposition or a conjunction ("prep" or "conj") and computes the contingency table where the rows are lemmas of the noninflected words and the columns are the counts of the following contexts: i) Does the previous token in the corpus have the case attribute? ii) Does the next token in the corpus have the case attribute? iii) Is the previous token in the corpus a punctuation mark? The table looks like this:


                                1.POS=interp    1.case=-        3.case=-
a               3258            2860            2929            1513
aby             413             359             382             263
ale             2219            2124            2166            1292
ani             354             190             271             78
bez             504             128             328             1
bo              894             826             880             497
czy             765             350             435             308
dla             1459            206             653             23
do              6330            536             3331            121
gdy             438             344             419             197
i               13304           1418            3396            4151
jak             2568            1481            2103            1291
jako            488             137             275             34
jeśli           358             259             348             229
kiedy           313             268             298             174
lub             454             50              96              113
między          573             83              243             5
na              9729            1456            5492            452
nad             544             40              249             5
niż             384             44              184             102
o               3446            521             2015            94
od              2071            330             1139            92
oraz            786             75              91              91
po              2146            660             1563            243
pod             713             92              387             7
przed           934             144             600             13
przez           1557            138             595             39
przy            1063            254             625             28
to              985             815             927             617
u               453             101             336             1
w               17501           3502            9812            250
więc            747             223             710             346
z               9106            1142            4359            211
za              1489            275             943             50
że              4743            4424            4691            2084
żeby            641             580             632             428

Potentially, the script can be used in lexical semantic research. For this usage, however, the KSFPW Corpus itself may be too small. The last example shows the distribution of color adjectives and the nouns which follow them:


./corpus_to_matrix.pl 2 "2.POS=subst:1.lemma=^(biały|czarny|zielony|czerwony)$" "2.lemma" "1.lemma" 3 <ksfpw

         biały   czarny  czerwony niebieski zielony żółty
beret    -       -       3        -         -       -
człowiek 3       -       -        -         -       -
dom      5       -       -        -         -       -
flaga    -       -       3        -         -       -
głowa    -       2       1        -         -       -
góra     -       -       -        -         13      -
kolor    -       1       -        1         1       1
krzyż    -       1       6        -         -       -
linia    -       1       -        -         2       -
magia    -       3       -        -         -       -
najemnik 3       -       -        -         -       -
oko      -       -       1        2         -       -
sukienka 1       1       1        -         -       -
światło  -       1       2        -         1       -


matrix_Nx2_to_gnuplot.pl

This toy Perl script calls Gnuplot, ps2pdf, and Bash to make a graph of a N x 2 contingency matrix. The row attributes may be plotted as (nonoverlapping) text labels or as data points (colored accordingly to the attribute values). A contingency matrix which is readable by "matrix_Nx2_to_gnuplot.pl" can be generated e.g. by corpus_to_matrix.pl. The resulted graph is written out as a PDF file.

(Script "matrix_Nx2_to_gnuplot.pl" replaces the less convenient script "prune.labels.pl" previously available from this page.)

USAGE:


matrix_Nx2_to_gnuplot.pl [MINIMAL X] 
                         [MAXIMAL X] 
                         0|1 {is X in logscale?}
                         [MINIMAL Y]
                         [MAXIMAL Y] 
                         0|1 {is Y in logscale?}
                         [FONTSIZE]
                         [ENCODING]
                         P|Q|R|L {plot type}
                         < [CONTINGENCY MATRIX FILE]
                         > [PDF FILE WITH THE GRAPH]

matrix_Nx2_to_gnuplot.pl    {to produce this help info}

EXAMPLES:

Consider a contingency table resuming the frequencies of nouns in two groups of cases: i) accusative, instrumental, and locative ("acc|inst|loc") vs. ii) nominative, dative and vocative ("nom|dat|voc"). Using the previously discussed script corpus_to_matrix.pl, run on the SFPW corpus , we can compute it as follows:


./corpus_to_matrix.pl 1 "1.POS=subst" "1.lemma" "(1.case=acc|inst|loc,1.case=nom|dat|voc)" 3 <ksfpw >m_SUBST_lemma_AIL_NDV

                1.case=acc|inst|loc     1.case=nom|dat|voc
Adam            2                       4
Aldona          -                       1
Andrzej         -                       5
Anglik          -                       2
Anna            -                       5
Antoni          1                       5
Artur           1                       4
Bonn            1                       3
Chiny           -                       1
Czarny          -                       3
Czesiek         1                       3
Dębski          1                       4
Ewa             1                       1
Francja         -                       4
Francuz         -                       3
Grzegorz        -                       6

It may be more insightful to plot this table as a planar plot with the text labels taken from the column #1 and placed in the coordinates given by the columns #2 and #3. This can be achieved by calling:


cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 R >m_SUBST_lemma_AIL_NDV.pdf

Unfortunately, Gnuplot cannot use the CP1250 encoding of Central European fonts in the graphs so all diacritics in the input file were transcoded into ISO-8859-2 encoding (via script "winiso") before it was sent to "matrix_Nx2_to_gnuplot.pl". Additionally, to increase readability, script "matrix_Nx2_to_gnuplot.pl" was run with the plot type option "R" requiring to remove some overlapping labels. To obtain the plot that displays all the labels, call


cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 L >m_SUBST_lemma_AIL_NDV.pdf
where the plot type option "L" is applied instead of "R". It is hardly useful for this dataset.

Furthermore, it is possible to draw points (dots) instead of labels. Replace the plot type option "L" with option "P" to do so:


cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 P >m_SUBST_lemma_AIL_NDV.pdf
Option "P" results in drawing all nouns as dots of the same shape and color. You can differentiate these points by shapes and colors, using option "Q".

The plot type option "Q" requires that each point in the graph be plotted with the style which is distinct for the text attribute of the point (given in column #1 of the input file). Recall that each text attribute is unique for the row in file "cat m_SUBST_lemma_AIL_NDV". Therefore calling


cat m_SUBST_lemma_AIL_NDV |./winiso| ./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 Q >m_SUBST_lemma_AIL_NDV.pdf
results in an error message, since neither "matrix_Nx2_to_gnuplot.pl" nor Gnuplot can handle so large a number of different point styles.

We can group, however, nouns by genders. First we have to add the explicit gender information to the plotted matrix:


./corpus_to_matrix.pl 1 "1.POS=subst" "1.lemma:1.gender" "(1.case=acc|inst|loc,1.case=nom|dat|voc)" 3 <ksfpw >m_SUBST_lemma\&gender_AIL_NDV

	        1.case=acc|inst|loc     1.case=nom|dat|voc
Adam:m1       	2                       4
Aldona:f        -                       1
Andrzej:m1      -                       5
Anglik:m1       -                       2
Anna:f          -                       5
Antoni:m1       1                       5
Artur:m1        1                       4
Bonn:n          1                       3
Chiny:n         -                       1
Czarny:m1       -                       3
Czesiek:m1      1                       3
Dębski:m1       1                       4
Ewa:f           1                       1
Francja:f       -                       4
Francuz:m1      -                       3
Grzegorz:m1     -                       6

The nouns in the SPFW corpus have one of five genders: masculine personal (m1), masculine animal (m2), masculine inanimate (m3), feminine (f), and merged neuter and plurale tantums (n). For the following processing, we will drop the lemmas in the input file and we will omit rows containing the nouns of genders "m2", "f" and "n":


cat m_SUBST_lemma\&gender_AIL_NDV | perl -pe "s/.*:(f|n|m2).*\n//; s/.*://;" |./matrix_Nx2_to_gnuplot.pl - - 1 - - 1 12 iso_8859_2 Q >m_SUBST_lemma\&gender_AIL_NDV.pdf

The resulted plot shows clearly that the nouns of gender "m1" tend to occur mostly in nominative, dative and vocative ("nom|dat|voc"), whereas the nouns of gender "m1" prefer to appear in accusative, instrumental, and locative ("acc|inst|loc"). This has to do with semantics rather than with formal syntactic rules. The "m1" nouns denote persons while the "m3" nouns correspond mostly to things, places, or time units. As it can be checked in the previous plot, the nouns denoting persons of any sex prefer the syntactic positions expressed via "nom|dat|voc" — they are usually the agents and beneficiaries of actions or the addressees of utterances. For instance, feminines such as "pani" (lady, Mrs., Ms.), "mama" (mother) occupy the "m1" region of the graph. On the other hand, the nouns denoting non-persons tend to occur in accusative, instrumental, and locative — they are the objects, instruments, or locations of actions. Hence, the "m3" area covers also feminine and neuter nouns like "godzina" (hour), "miejsce" (place), and "ściana" (wall). (If you have troubles spotting them in m_SUBST_lemma_AIL_NDV.pdf, simply use the Acrobat Reader text search command!).

Notice that all titles in the graphs were added automatically by the script, which used the information available in the input file.

The last example of a plot produced by "matrix_Nx2_to_gnuplot.pl" was prepared using the different corpus data provided by Dorota Lewandowska. These data are not fully available publicly and consist of two parts: (i) the 100,000-word-long newspaper subcorpus of the SFPW Corpus, excerpted in 1960's and (ii) a new 100,000-word-long newspaper corpus excerpted in 1990's. Both subcorpora were POS-tagged. In the graph below, you can watch the difference of word frequency between the People's Republic of Poland's newspeak and the press language of the Third Republic of Poland:


Łukasz Dębowski, 21.06.2007