Damir Ćavar
University of Zadar, Linguistics Department
Slides 1
Slides 2
Slides 3
Slides 4
Code (download complete ZIP file)
- tokenizer.py (Command line: python tokenizer.py *.txt > tokens.lst)
- fp.py (frequency profile; import resulting tokens-fp.lst into some spread sheet software and sort it; Command line: python fp.py tokens.lst > tokens-fp.lst)
- rfp.py (relative frequency profile; Command line like for fp.py)
- ttr.py (type/token ratio; Command line: python ttr.py tokens.lst)
- unique.py (unique tokens per class; generates for each token list (which represents a class) a list of unique tokens; Command line: python unique.py sports.lst tech.lst ...)
- rfp-unique.py (unique tokens per class with relative frequencies; as for unique.py)
- unknown-class1.py (calculates simple frequency distances for models and some unknown text; Requires language models, see code; Command line: python unknown-class1.py myUnknownText.txt)
- entropy1.py
- pentropy1.py
- BM1.py (Bayesian learner and classifier for text, more instructions during class and personally)
- make-docmodel.py and make-docmodels.sh (for BM1.py, see instructions for it)
- nyt-books.dat, sports.dat and tech.dat (models for BM1.py, see instructions for it)
- tfidf.py and make-tfidf.py (make a df model for terms, and compare relative frequencies with a tf-idf score, see instructions during class)
- kld.py (classify with Kullback-Leibler Divergence scores)
Some more:
rfp1.py
fp1.py