This Web service annotates Slovene text with JOS morphosyntactic descriptions and lemmas and returns the result. You can type or paste the input into the text window, or upload a plain text file, which should use UTF-8 for character encoding.
The output file is in a format appropriate for
CWB
or
SketchEngine,
i.e. a tabular file, where each line is either an
XML tag (<div>, <p>, <s> or
</div>, </p>, </s>) or an annotated token. Each
line with a token has three fields, separated by the TAB
character. The first is the token as it appears in
the text, the second the JOS morphosyntactic description (MSD),
and the third the word lemma. For punctuation, the MSD and lemma
fields are identical to the token.
Note that the MSDs are currently available only in the Slovene localisation, e.g.
Sometn, meaning
samostalnik, vrsta = občno_ime, spol = moški, število = ednina, sklon = tožilnik, živost = ne.
The MSDs can be converted into various other formats with the help of the JOS MSD
conversion tables.
If you press "show" the output file is returned as a plain text file - in order to display it properly, the Character Encoding in the browser should be set to UTF-8. If you press the "download" button, the text will be returned compressed in ZIP format. You should use the latter option for large files.
There is no pre-set limit for the size of the text, however, the practical limit (server timeout) is about 1 million words.
Page last updated 2009-07-08,
et