Leksa for developers

Here starts the documentation for making Leksa work for new languages. The goal is to make a demo with at least 55 words. The lexicon will be used also for Morfa-C. Here you can see how the North Saami Leksa works. Choose instruction language in the right margin before you choose Leksa.

Lemmas

You need some lemmas for the lexicon from different semantic categories: HUMAN and FOOD/DRINK. We will give the lemmas semantic classes so it will be easier to combine them in different semantic categories, and some of them we can use also in Morfa-C:

  • - the verb for "to live (in a place)", semantic class "HUMAN_V"
  • - the verb "to eat", semantic class "FOODDRINK_V" and "HUMAN_V"
  • - the verb "to drink", semantic class "FOODDRINK_V" and "HUMAN_V"
  • - at least 5 verbs for making food (e.g. steak, cook, bake), semantic classes "FOODDRINK_V"
  • - at least 5 nouns for beverages (e.g. water, milk, beer), semantic class "DRINK"
  • - 5 nouns for food you will have for dinner (e.g. porridge, soup, steak), semantic classe "FOOD_DISH"
  • - 5 nouns for food you will by in a shop (e.g. flour, sugar, fruit), semantic class "FOOD_GROCERY"
  • - 5 nouns places where you can live (e.g. city, town, village), semantic class "PLACE"
  • - at least 5 nouns for family members beings , semantic class "FAMILY"
  • - at least 5 nouns for other human beings (e.g. boy, woman, teacher), semantic class "PEOPLE"
  • - at least 5 adjectives about human beings, (e.g. strong, old, young, clever), semantic class "HUMAN_A"
  • - at least 5 adjectives about food, (e.g. warm, cold, sour, sweet), semantic class "FOOD_A"

Give in addition the semantic class "mACTIVITY" to all verbs which can be an answer to the question: "What did the girl do yesterday?" (e.g. bake). The initial m tells that this is a morfaset, outside the systems of semantic categories.

Give a book or level name to all entries, e.g. half of them K1 and half of them K2.

XML-format

You put the lemmas into xml-format. There should be one file for each of the parts of speech. Add names to the files with the language code first, like we have done for North Saami with translation to Norwegian V_smenob.xml, N_smenobxml, A_smenob.xml. The "best" translation, that we want as key answer, gets attribute stat="pref".

<?xml version="1.0" encoding="utf-8"?>
<r xml:lang="sme">
 <e>
   <lg>
     <l pos="n">mielki</l>
      <sources>
         <book name="K1"/>
      </sources>
   </lg>
      <mg>
        <semantics>
          <sem class="DRINK"/>
          <sem class="FOOD_GROCERY"/>
        </semantics>
       <tg xml:lang="nob">
          <t stat="pref" pos="n">melk</t>
          <t pos="n">mjølk</t>     
      </tg>
    </mg>
 </e>
</r>		

One way of making the lexicon, is to start with a csv-file, with double underscore as delimiter:

mielki __ N __ melk, mjølk __ DRINK, FOOD_GROCERY

and use the script main/ped/script/uusv2oahpa_xml.xsl

If one word have different meanings belonging to different semantic classes, this is the way of doing it:

beaivi __ N __ day __ TIME
beaivi __ N __ sun __ NATUR

Lexicon files the other direction

The translations with attribute stat="pref" will be lemmas in the lexicon, which goes the other language direction, and the name of the file will be e.g. N_tlangslang.xml Also here may be synonyms in the translation field. You only need the Leksa semantic categories (HUMAN and FOOD/DRINK), but it does no harm if you add all the semantic classes from the original files.

<?xml version="1.0" encoding="utf-8"?>
<r xml:lang="nob">
 <e>
   <lg>
     <l pos="n">melk</l>
      <sources>
         <book name="K1"/>
      </sources>
   </lg>
      <mg>
        <semantics>
          <sem class="DRINK"/>  
        </semantics>
       <tg xml:lang="sme">
          <t stat="pref" pos="n">mielki</t>
       </tg>
    </mg>
 </e>
</r>		
		

There are scripts for this in main/ped/script/ , documentet in 00_README.txt

Handling of semantic sets in Leksa

We put togehter semantic classes into semantic categories with the file semanticsets.xml

</?xml version="1.0" encoding="utf-8"?>
</lexicon>
        </subclasses class="HUMAN">  
      </sem class="PEOPLE"/>   
      </sem class="FAMILY"/>
      </sem class="HUMAN_A"/>
      </sem class="HUMAN_V"/>
      </sem class="PLACE"/>
    <//subclasses>
        </subclasses class="FOOD/DRINK">  
      </sem class="DRINK"/>
      </sem class="FOODDRINK_V"/>
      </sem class="FOOD_A"/>
      </sem class="FOOD_DISH"/>
      </sem class="FOOD_GROCERY"/>
    </subclasses>
<//lexicon>
		

Your Leksa demo

You have now made the xml-files for a Leksa demo with at least 48 words. The demo has these options:

  • - language direction
  • - 2 semantic categore containing nouns, verbs and adjectives: HUMAN and FOOD/DRINK. The other semantic classes are for Morfa-C.
  • - two book or level choices

Further work

You can add more lemmas to each semantic class. You can make more semantic classes, e.g. NATURE. Here is a list of semantic classes used in North Saami and South Saami Leksa, which can be a suggestion, not limitation: semclasslist

Sometimes the translation will be an explanation, and it can be difficult to use the word in the Leksa from one of the languages to another one. You can make exeption for the word about in the xml-file. The word will still be used in Morfa-C.

		<e exclude="leksa">