Date Mici: POS Romanian

Proiect Romanian POS Tagger
In acest moment nu exista un instrument ușor accesibil pentru etichetare POS pt. limba romana.
Desi atat OPEN NLP sau TreeTagger ar permite acest lucru.

Teoretic utilizând Tree Tagger se poate obține un istrument pentru Romană.
Sunt necesare trei fișiere:
1. Lexiconul

Training
--------

Training is done with the *train-tree-tagger* program. If the program is 
called without arguments, the following output is printed:

USAGE: train-tree-tagger <lexicon> <open class file> <infile> <outfile> 
       {-cl <context length>} {-dtg <min. decision tree gain>}
       {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}

Description of the command line arguments:
* <lexicon>: name of a file which contains the fullform lexicon. Each line 
  of the lexicon corresponds to one word form and contains the word form 
  itself followed by a Tab character and a sequence of tag-lemma pairs.
  The tags and lemmata are separated by whitespace.

Example:
aback RB aback
abacuses NNS abacus
abandon VB abandon VBP abandon
abandoned JJ abandoned VBD abandon VBN abandon
abandoning VBG abandon

  Important: Ordinal and cardinal numbers which consist of digits
  should not be included in the lexicon. Otherwise, the tagger will
  not be able to learn how to tag numbers which are not listed in the
  lexicon. Numbers with unusual tags should be added to the lexicon,
  however.

  Remark: The tagger doesn't need the lemmata for tagging. If
  you do not have the lemma information or if you do not plan to
  annotate corpora with lemmas, you can replace the lemma with a dummy
  value, e.g. "-".

* <open class file>: name of a file which contains a list of open class tags
  i.e. possible tags of unknown word forms. This information is needed to
  estimate likely tags of unknown words. This file would typically contain
  adverb, adjective, noun, proper name and perhaps verb tags, but not
  prepositions, determiners, pronouns or numbers.
* <input file>: name of a file which contains tagged training data. The data
  must be in one-word-per-line format. This means that each line contains 
  one token and one tag in that order separated by a tabulator. 
  Punctuation marks are considered as tokens and must have been tagged as well.

Cum din varii motive un corpus adnotat nu a putut fi utilizat o varianta este utilizarea romanului adnotat 1984

Example: Pierre NP Vinken NP , , 61 CD years NNS * <output file>: name of the file in which the resulting tagger parameters are stored.

Presupune utilizarea unui corpus gata etichetat.
Sunt disponibile o serie de materiale ce pot fi utilizate pentru acest lucru:
1.1Lexicon
Free
https://www.clarin.si/repository/xmlui/handle/11356/1041

MULTEXT-East free lexicons 4.0 BIBTEXCMDI
Credit:
Erjavec, Tomaž; Bruda, Stefan; Derzhanski, Ivan; Dimitrova, Ludmila; Garabík, Radovan; Holozan, Peter; Ide, Nancy; Kaalep, Heiki-Jaan; Kotsyba, Natalia; Oravecz, Csaba; Petkevič, Vladimír; Priest-Dorman, Greg; Shevchenko, Igor; Simov, Kiril; Sinapova, Lydia; Steenwijk, Han; Tihanyi, Laszlo; Tufiş, Dan and Véronis, Jean, 2010, MULTEXT-East free lexicons 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1041.

Forma fisiere lexic free:

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.

Exemplu:

ţurţuri ţurţur Ncmp-n
ţurţurii ţurţur Ncmpry
ţurţurilor ţurţur Ncmpoy
ţurţurul ţurţur Ncmsry
ţurţurului ţurţur Ncmsoy
ţuşti ţuşti I
ţânc ţânc Ncms-n
ţânci ţânc Ncmp-n
ţâncii ţânc Ncmpry
ţâncilor ţânc Ncmpoy
ţâncul ţânc Ncmsry

Autorii fisierului romanesc sunt:
Romanian:
S.Bruda, C.Diaconu, L.Diaconu, and D.Tufis
Center for Research in Machine Learning,
Natural Language Processing and Conceptual Modelling
Romanian Academy of Sciences

1.2 Document Adnotat 1984 George Orwell
https://www.clarin.si/repository/xmlui/browse?value=Romanian&type=language

2. Treetagger http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Odată ce exista un tagger (fie via Maxtent Open NLP fie TreeTagger ) se pot utiliza instrumente tipice pentru analiza.

TreeTagger este free in scop necomercial.
In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)!

3.koRpus https://cran.r-project.org/web/packages/koRpus/index.html
Mai multe detalii aici:http://reaktanz.de/?c=hacking&s=koRpus
Exista si varinate comerciale pt. Lexic.

4. Romanian Treebank http://universal.elra.info/search.php
Un tutorial pentru antrenarea unui TreeBank
4.1Exemple:
Exemplu corpus rusesc cu utilizare Multiest:
Designing and Evaluating a Russian Tagsethttps://msuweb.montclair.edu/~feldmana/publications/2008-lrec-mocky.pdf
4.2 Eforturi academice si comerciale:

RACAI-RoTb: nucleu de corpus de limbă română adnotat sintactic cu relaţii de dependenţă

http://rochi.utcluj.ro/rrioc/articole/RRIOC-8-2-Irimia.pdf

Parser de dependenţe pentru limba română realizat

pe baza parserelor pentru alte limbi romanicehttp://rochi.utcluj.ro/rrioc/articole/RRIOC-7-1-Florea.pdf

O interfata Windows pentruTreeBank
5. Train the Romanian TreeBank http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm 5.1 Trained corpus 5.2

1. Lexicon Util fisierul
2. Corpus
3. Fisier de parsare. http://www.racai.ro/external/static/awde/tufiscor11.html
Penetru a avea un fisier de parsare corespunzator acest atrebuie adaptat pentru descrierea din Lexicon. Pentru Romana descrierea este aici:
http://nl.ijs.si/ME/Vault/V3/msd/html/msd.html#SECTION05200000000000000000
fisierul cu descrierea MSD este disponibil aici:

MULTEXT-East
                    Morphosyntactic Specifications
                              Version 4

http://nl.ijs.si/ME/V4/msd/tables/

Tabel in Romana:
http://nl.ijs.si/ME/V4/msd/tables/msd-human-ro.tbl
In format xml:
http://nl.ijs.si/ME/V4/msd/tables/msd-canon-ro.tbl

Documentatie completa:
http://nl.ijs.si/ME/V4/msd/html/msd-ro.html
Ontologie:
http://nl.ijs.si/ME/owl/msd-ro.owl
Alternative:
https://www.sketchengine.co.uk/romanian-tagset/

Rezultat:
Tree Tagger are fișier de parametrizare in limba româna.

Date Mici

Wednesday, March 16, 2016

POS Romanian

No comments:

Post a Comment