CLS2010 - Course - Ćavar "Quantitative and qualitative computational analysis of language and text similarities"

Quantitative and qualitative computational analysis of language and text similarities, clustering and classification
Damir Ćavar
University of Zadar, Linguistics Department

Slides 1
Slides 2
Slides 3
Slides 4

Code (download complete ZIP file)

tokenizer.py (Command line: python tokenizer.py *.txt > tokens.lst)
fp.py (frequency profile; import resulting tokens-fp.lst into some spread sheet software and sort it; Command line: python fp.py tokens.lst > tokens-fp.lst)
rfp.py (relative frequency profile; Command line like for fp.py)
ttr.py (type/token ratio; Command line: python ttr.py tokens.lst)
unique.py (unique tokens per class; generates for each token list (which represents a class) a list of unique tokens; Command line: python unique.py sports.lst tech.lst ...)
rfp-unique.py (unique tokens per class with relative frequencies; as for unique.py)
unknown-class1.py (calculates simple frequency distances for models and some unknown text; Requires language models, see code; Command line: python unknown-class1.py myUnknownText.txt)
entropy1.py
pentropy1.py
BM1.py (Bayesian learner and classifier for text, more instructions during class and personally)
make-docmodel.py and make-docmodels.sh (for BM1.py, see instructions for it)
nyt-books.dat, sports.dat and tech.dat (models for BM1.py, see instructions for it)
tfidf.py and make-tfidf.py (make a df model for terms, and compare relative frequencies with a tf-idf score, see instructions during class)
kld.py (classify with Kullback-Leibler Divergence scores)

Some more:
rfp1.py
fp1.py

CLS2010 - Computational Linguistics Summer Events