JOS ToTaLe text analyser

This Web service annotates Slovene text with JOS morphosyntactic descriptions and lemmas and returns the result. You can type or paste the input into the text window, or upload a plain text file, which should use UTF-8 for character encoding.

The output file is in a format appropriate for CWB or SketchEngine, i.e. a tabular file, where each line is either an XML tag (<div>, <p>, <s> or </div>, </p>, </s>) or an annotated token. Each line with a token has three fields, separated by the TAB character. The first is the token as it appears in the text, the second the JOS morphosyntactic description (MSD), and the third the word lemma. For punctuation, the MSD and lemma fields are identical to the token. Note that the MSDs are currently available only in the Slovene localisation, e.g. Sometn, meaning samostalnik, vrsta = občno_ime, spol = moški, število = ednina, sklon = tožilnik, živost = ne. The MSDs can be converted into various other formats with the help of the JOS MSD conversion tables.

If you press "show" the output file is returned as a plain text file - in order to display it properly, the Character Encoding in the browser should be set to UTF-8. If you press the "download" button, the text will be returned compressed in ZIP format. You should use the latter option for large files.

There is no pre-set limit for the size of the text, however, the practical limit (server timeout) is about 1 million words.


Type or paste in text:

Plain text file in UTF-8:

Analyse the text and or the results.

the form!


Note: uploaded files are being archived, and could be used as a basis for further research.
Report problems to tomaz.erjavec@ijs.si

Valid XHTML 1.0 Transitional Page last updated 2009-07-08, et