Numra for developers

Here starts the documentation for making Numra work for new languages.

Numeral system

The task is to make transducers for the number system. Our transducers are made simultaneously with both the Xerox Finite State Transducer and the the HFST Finite State Transducer compilers. The former is better documented (see reference above) and easier to set up and use, but the latter is open source, and also more stable. The current Oahpa setup uses xfst compilers.

In our current implementation of Numra, we offer running numbers (ordinal and cardinal), clock, and date expressions.

In itself, Numra might be run without automata as well, just with a list of number - word pairs, but we find it easier to make automata.

Running numbers

The task is to match number to word. The order may be the same (like in Mari: вит = 5, латвит = 15, or it may be different (like in English: five = t, fifteen = 15). You may make the system from skratch, or use one of ours as a starting point.

At Giellatekno, we have number automata for various languages: Kildin, Pite, South, North, Lule, Inari, Skolt Saami, and Finnish, Kven, Komi, Meadow Mari, Hill Mari, Erzya, Moksha, Livonian, Nenets, Norwegian and Russian.

Note that some of them are on an experimental stage, and may contain errors.

Pick one, or start from skratch, and make your transducer. Include both ordinals and cardinals. In Oahpa, we stop at 1000, so in this context there is no need to continue beyond that.

Clock

The task is to match numerically written time expression (11:45) to word (quarter to twelve).

At Giellatekno, we have clock automata for various languages: South, North, Lule Saami, Finnish, Kven and Russian.

We use flag diacritics. The reason for this is that the clock is written in the order hour - minute, but spoken in the order minute - hours (seventeen past ten = 10:17). In order to get this we put a mark on the hour number. When the hour word is added later, we check it against the mark. We thus generate twelve versions of the same hour, but remove 11 of them, the 11 not having a matching hour mark. For documentation, see the Flag Diacritics chapter in Beesley and Karttunens book on finite-state transducers.

Date

The task is to match numerically written date expression (18/9) to word (September the eighteenth).

At Giellatekno, we have date automata for at least these languages: South, North, Inari Saami, and Kven and Russian.

Compiling the files

The source files can be made in any text editor. Compile in the language directory (e.g. $GTHOME/langs/fkv for Kven) by writing make. The files can be tested by the commands:

  • lookup src/transcriptor-clock2text.xfst
  • lookup src/transcriptor-date2text.xfst
  • lookup src/transcriptor-number2text.xfst
  • lookup src/transcriptor-text2clock.xfst
  • lookup src/transcriptor-text2date.xfst
  • lookup src/transcriptor-text2number.xfst