Gorazd Generator

The application was created within the project GORAZD: The Old Church Slavonic Digital Hub (implemented thanks to the NAKI II programme of the Ministry of Culture of the Czech Republic, DG16P02H024). The application automatically recognizes the structure of entries on a page, or in an image, and saves this structure in the format Gorazd XML that was developed within the project ( XSD schema of the format Gorazd XML: http://gorazd.org/sites/default/files/software/gXML.zip).

The input for Gorazd Generator can be files in the format ALTO XML (https://www.loc.gov/standards/alto) as an output of optical recognition of a printed dictionary in the program ABBYY Recognition Server (https://www.abbyy.com/cs-cz/recognition-server) or in the form of regular HTML format.

The recognition of the structure of entries is an original innovation within the software part of the project. The structure of entries is recorded by formal grammar written in the language of ANTLR library that is used for generating a parser of this formal grammar. Thus, the description of the structure of entries is done in a language with conceptual apparatus close to both programmers and linguists.

The output of this application is generating a specific XML file for each recognized entry. This file is intended for further processing in Gorazd Editor and display or pre-print preparation by Gorazd Export. Besides recognized entries, the output is also a file in MARC XML format by which these entries are imported into the system Invenio (https://invenio-software.org), which is a system for managing data in which the dictionary entries are managed and edited.

Installation

The application Gorazd Generator is distributed in the form of a Python package/module and it can be installed in the standard way using utility pip. The application Gorazd Generator needs several libraries to run properly. The libraries are installed automatically unless they are already present in the system. For that reason it may be necessary to install the application Gorazd Generator with root rights or into Python virtual environment virtualenv.

  • $ pip install [-t cilovy_adresar] gorazd_generator-1.zip

Launch

The application is launched from the command line directly on the server where Invenio runs.

  • $ python generator.py
    usage: generator.py [-h] [-g GENERATOR] [-p] [-l LOG_DIR] [-r]
    input_dir output_dir
    generator.py: error: too few arguments

Parameters are passed to the application on the command line in the standard way; you can get help by the parameter –h

  • $ python generator.py –h

    usage: generator.py [-h] [-g GENERATOR] [-p] [-l LOG_DIR] [-r]
    input_dir output_dir

    SJSGenerator of vocabulary records from plain text / ALTO XML.

    positional arguments:
    input_dir Directory containing ALTO XML files for input.
    output_dir Directory containing generated files.

    optional arguments:
    -h, --help show this help message and exit
    -g GENERATOR, --generator GENERATOR
    Type of generator: SJS, RSI, SNSP
    -p, --preprocess If preprocessing is turned on we fix slavonic and
    greek characters and join words that were split at
    line breaks.
    -l LOG_DIR, --log_dir LOG_DIR
    Directory for log files. Default is the current
    working directory.
    -r, --run_bibupload Run bibupload.

The parameters input_dir and output_dir are obligatory. Subfolders in the folder gorazd-generator/in are usually used as input_dir. We recommend creating a new subfolder in gorazd-generator/out/ for each output_dir. If the given subfolder in the folder “out” is not found, the application will create it; if it is found, the data in it will be rewritten.

The type of processed data is chosen by the parameter –g. It is set for SJS by default.

In the folder input_dir, ALTO XML expects files that are output from ABBYY Recognition Server or html files. Names of these files create the first part of PAGE ID of entries. That means that it is necessary to use correct format of file names.

If there are also text files with the same name as ALTO XML files (they differ only in the extension txt) in the folder input_dir, the generator uses them as a source for correcting Old Greek diacritics. You can get these files from the application ABBYY FineReader.

Any other files will cause errors that will be logged into log files and the operation/running of the program will probably end early with an unidentified error. Any errors are written into the folder log (even if it is not a part of the command for launch.)

When the program is running, information about the process is written on the screen and this information is also saved into a logging file. (Its configuration viz. below.) After the running is finished, the application writes a basic statistic of results (the number of entries, the number of found translations etc.)

  • 2016 -12-01 21:57:09,146 - INFO - -------------- STATISTIKY --------------
    2016-12-01 21:57:09,536 - INFO - Pocet hesel: 932
    2016-12-01 21:57:09,740 - INFO - Pocet nerozpoznanych zahlavi: 1 + 131
    2016-12-01 21:57:09,775 - INFO - Pocet elementu vyskyt: 765
    2016-12-01 21:57:09,807 - INFO - Pocet prekladu do modernich jazyku: 561
    2016-12-01 21:57:09,838 - INFO - Pocet prekladu do starych jazyku: 548

    2016-12-01 21:57:09,146 - INFO - -------------- STATISTICS --------------
    2016-12-01 21:57:09,536 - INFO – Number of entries: 932
    2016-12-01 21:57:09,740 - INFO – Number of unrecognized headwords: 1 + 131
    2016-12-01 21:57:09,775 - INFO – Number of elements: 765
    2016-12-01 21:57:09,807 - INFO – Number of translations into modern languages: 561
    2016-12-01 21:57:09,838 - INFO – Number of translations into ancient languages: 548

The generator saves the output into the folder output_dir. The most important is the file final-marc.xml that is created only when the application process is successfully completed. The file is imported/uploaded into Invenio by launching

  • $ sudo -u www-data /opt/invenio/bin/bibupload -i output_dir/final-marc.xml

You can see all the generated entries in the file result-parsed-postprocessed.xml. Files with “marc” in their name are metadata generated from these entries for Invenio.

Another interesting file is errors.xml where entries with structure not recognized by the generator are copied.

These entries are part of final-marc.xml and thus they can be imported/uploaded into Invenio. The file errors.xml only checks grammar.

We recommend saving the files final-marc.xml, errors.xml and result-parsed-postprocessed.xml for the case of archiving.

The parameter –p can be used to skip pre-process phase during which characters are uploaded and Old Greek is checked. It can speed the testing process during debugging grammar and it has no purpose during the normal process of the application.

The parameter -l LOG_DIR decides the folder where logging records about the process of generating are saved. By default it is set for “log”.

The parameter –r serves for automatic launching upload into Invenio after generating all entries.

An example of launching:

  • $ python generator.py –g SJS in/sjs_ii_1-200 out/sjs_ii_1-200 -l log/sjs_ii

    $ python generator.py in/3strany out/ -l log/ -g SJS

Licence

The application is distributed based on the open licence GNU GPL v3. The application can be used for generating entries of other dictionaries by altering the source code. The authors of this application would be very grateful if you inform them about using the source code of this application or its parts in other projects. You can contact them via e-mail: gorazd@slu.cas.cz.

Installation package:

Gorazd Generator 1.0: http://gorazd.org/sites/default/files/software/gorazd_generator-1.zip

Authors:

  • Mgr. Vít Tuček, Ph.D. (programmer)
  • Mgr. Olga Čiperová (development analyst)
  • Bc. Martin Majer (development manager)
  • PhDr. Štefan Pilát, Ph.D. (expert development consultant)

© 2018, Institute of Slavonic studies of the Czech Academy of Sciences, v. v. i.