Quantitative and qualitative computational analysis of language and text similarities, clustering and classification
Damir Ćavar
University of Zadar, Linguistics Department

Slides 1
Slides 2
Slides 3
Slides 4

Code (download complete ZIP file)
  • tokenizer.py (Command line: python tokenizer.py *.txt > tokens.lst)
  • fp.py (frequency profile; import resulting tokens-fp.lst into some spread sheet software and sort it; Command line: python fp.py tokens.lst > tokens-fp.lst)
  • rfp.py (relative frequency profile; Command line like for fp.py)
  • ttr.py (type/token ratio; Command line: python ttr.py tokens.lst)
  • unique.py (unique tokens per class; generates for each token list (which represents a class) a list of unique tokens; Command line: python unique.py sports.lst tech.lst ...)
  • rfp-unique.py (unique tokens per class with relative frequencies; as for unique.py)
  • unknown-class1.py (calculates simple frequency distances for models and some unknown text; Requires language models, see code; Command line: python unknown-class1.py myUnknownText.txt)
  • entropy1.py
  • pentropy1.py
  • BM1.py (Bayesian learner and classifier for text, more instructions during class and personally)
  • make-docmodel.py and make-docmodels.sh (for BM1.py, see instructions for it)
  • nyt-books.dat, sports.dat and tech.dat (models for BM1.py, see instructions for it)
  • tfidf.py and make-tfidf.py (make a df model for terms, and compare relative frequencies with a tf-idf score, see instructions during class)
  • kld.py (classify with Kullback-Leibler Divergence scores)

Some more: