Instrumente pentru etichetarea POS.
http://www.martinschweinberger.de/blog/part-of-speech-tagging-with-r/
Monday, March 28, 2016
Wednesday, March 16, 2016
POS Romanian
Proiect Romanian POS Tagger
In acest moment nu exista un instrument ușor accesibil pentru etichetare POS pt. limba romana.
Desi atat OPEN NLP sau TreeTagger ar permite acest lucru.
Teoretic utilizând Tree Tagger se poate obține un istrument pentru Romană.
Sunt necesare trei fișiere:
1. Lexiconul
Training
--------
Training is done with the *train-tree-tagger* program. If the program is
called without arguments, the following output is printed:
USAGE: train-tree-tagger <lexicon> <open class file> <infile> <outfile>
{-cl <context length>} {-dtg <min. decision tree gain>}
{-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}
Description of the command line arguments:
* <lexicon>: name of a file which contains the fullform lexicon. Each line
of the lexicon corresponds to one word form and contains the word form
itself followed by a Tab character and a sequence of tag-lemma pairs.
The tags and lemmata are separated by whitespace.
Example:
aback RB aback
abacuses NNS abacus
abandon VB abandon VBP abandon
abandoned JJ abandoned VBD abandon VBN abandon
abandoning VBG abandon
Important: Ordinal and cardinal numbers which consist of digits
should not be included in the lexicon. Otherwise, the tagger will
not be able to learn how to tag numbers which are not listed in the
lexicon. Numbers with unusual tags should be added to the lexicon,
however.
Remark: The tagger doesn't need the lemmata for tagging. If
you do not have the lemma information or if you do not plan to
annotate corpora with lemmas, you can replace the lemma with a dummy
value, e.g. "-".
* <open class file>: name of a file which contains a list of open class tags
i.e. possible tags of unknown word forms. This information is needed to
estimate likely tags of unknown words. This file would typically contain
adverb, adjective, noun, proper name and perhaps verb tags, but not
prepositions, determiners, pronouns or numbers.
* <input file>: name of a file which contains tagged training data. The data
must be in one-word-per-line format. This means that each line contains
one token and one tag in that order separated by a tabulator.
Punctuation marks are considered as tokens and must have been tagged as well.
Cum din varii motive un corpus adnotat nu a putut fi utilizat o varianta este utilizarea romanului adnotat 1984
Example:
Pierre NP
Vinken NP
, ,
61 CD
years NNS
* <output file>: name of the file in which the resulting tagger parameters
are stored.Presupune utilizarea unui corpus gata etichetat.
Sunt disponibile o serie de materiale ce pot fi utilizate pentru acest lucru:
1.1Lexicon
Free
https://www.clarin.si/repository/xmlui/handle/11356/1041
MULTEXT-East free lexicons 4.0 BIBTEXCMDI
Credit:
Erjavec, Tomaž; Bruda, Stefan; Derzhanski, Ivan; Dimitrova, Ludmila; Garabík, Radovan; Holozan, Peter; Ide, Nancy; Kaalep, Heiki-Jaan; Kotsyba, Natalia; Oravecz, Csaba; Petkevič, Vladimír; Priest-Dorman, Greg; Shevchenko, Igor; Simov, Kiril; Sinapova, Lydia; Steenwijk, Han; Tihanyi, Laszlo; Tufiş, Dan and Véronis, Jean, 2010, MULTEXT-East free lexicons 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1041.
Forma fisiere lexic free:
Exemplu:
ţurţuri ţurţur Ncmp-nţurţurii ţurţur Ncmpry
ţurţurilor ţurţur Ncmpoy
ţurţurul ţurţur Ncmsry
ţurţurului ţurţur Ncmsoy
ţuşti ţuşti I
ţânc ţânc Ncms-n
ţânci ţânc Ncmp-n
ţâncii ţânc Ncmpry
ţâncilor ţânc Ncmpoy
ţâncul ţânc Ncmsry
Autorii fisierului romanesc sunt:
Romanian:
S.Bruda, C.Diaconu, L.Diaconu, and D.Tufis
Center for Research in Machine Learning,
Natural Language Processing and Conceptual Modelling
Romanian Academy of Sciences
1.2 Document Adnotat 1984 George Orwell
https://www.clarin.si/repository/xmlui/browse?value=Romanian&type=language
2. Treetagger http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Odată ce exista un tagger (fie via Maxtent Open NLP fie TreeTagger ) se pot utiliza instrumente tipice pentru analiza.
TreeTagger este free in scop necomercial.
In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)!
3.koRpus https://cran.r-project.org/web/packages/koRpus/index.html
Mai multe detalii aici:http://reaktanz.de/?c=hacking&s=koRpus
Exista si varinate comerciale pt. Lexic.
4. Romanian Treebank http://universal.elra.info/search.php
Un tutorial pentru antrenarea unui TreeBank
4.1Exemple:
Exemplu corpus rusesc cu utilizare Multiest:
Designing and Evaluating a Russian Tagsethttps://msuweb.montclair.edu/~feldmana/publications/2008-lrec-mocky.pdf
4.2 Eforturi academice si comerciale:
RACAI-RoTb: nucleu de corpus de limbă română adnotat sintactic cu relaţii de dependenţă
http://rochi.utcluj.ro/rrioc/articole/RRIOC-8-2-Irimia.pdf
Parser de dependenţe pentru limba română realizat
pe baza parserelor pentru alte limbi romanicehttp://rochi.utcluj.ro/rrioc/articole/RRIOC-7-1-Florea.pdfO interfata Windows pentruTreeBank
5. Train the Romanian TreeBank http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm 5.1 Trained corpus 5.2
1. Lexicon Util fisierul
2. Corpus
3. Fisier de parsare. http://www.racai.ro/external/static/awde/tufiscor11.html
Penetru a avea un fisier de parsare corespunzator acest atrebuie adaptat pentru descrierea din Lexicon. Pentru Romana descrierea este aici:
http://nl.ijs.si/ME/Vault/V3/msd/html/msd.html#SECTION05200000000000000000
fisierul cu descrierea MSD este disponibil aici:
MULTEXT-East
Morphosyntactic Specifications
Version 4
http://nl.ijs.si/ME/V4/msd/tables/Tabel in Romana:
http://nl.ijs.si/ME/V4/msd/tables/msd-human-ro.tbl
In format xml:
http://nl.ijs.si/ME/V4/msd/tables/msd-canon-ro.tbl
Documentatie completa:
http://nl.ijs.si/ME/V4/msd/html/msd-ro.html
Ontologie:
http://nl.ijs.si/ME/owl/msd-ro.owl
Alternative:
https://www.sketchengine.co.uk/romanian-tagset/
Rezultat:
Tree Tagger are fișier de parametrizare in limba româna.
Subscribe to:
Comments (Atom)

