I developed some Python and NLP, CL, ML teaching material as iPython notebooks for jupyter. They will all be linked here eventually, here are some examples:
- Intro to Part-of-Speech Tagging (zip, jupyter nbviewer, Anaconda Cloud Notebook, GitHub repo)
- Intro to Hidden Markov Models (zip, jupyter nbviewer, Anaconda Cloud Notebook, GitHub repo)
- Intro to WordNet and NLTK (zip, jupyter nbviewer, GitHub repo)
- Topic Modeling with MALLET (zip, jupyter nbviewer, GitHub repo)
- Intro to the Forward Algorithm (zip, jupyter nbviewer)
- Intro to the Backward Algorithm (zip, jupyter nbviewer)
I was porting some Finite State algorithms to Python 3 for some more or less functional FST-lib for Weighted Finite State Transducers in native Python, and code generation to C for example. I will place the code on GitHub: Project PyFST
Here is some of the material from my Python classes and developments. Some of it is from the late 90s, so it might be outdated, and not really working in Python 3.x. Some of the Python examples and tutorials (slides and instruction handouts) for corpus, data and language processing are adapted to Python 3.
- course material for JSSECL 2006
- course material for the DGfS/CL Fall School 2005
- Corpus processing tools (TEI XML from HTML, XML filtering, quantitative analysis)
- Language identification (LID) with n-gram models
- Orthography to IPA conversion for Croatian (with Malgorzata E. Cavar): see phonemic
- TextStat.py lightweight module with functions for creating and using n-gram models for statistical analyses, various statistical functions, chi2 test, vector space conversion of n-gram models, entropy and information theoretic measures etc. There are examples for document classification, measures of text or model similarity and various other useful functions.
- Finite State Automata (FSA) scripts: FSA class, automaton from word list, DOT (Graphviz) from automaton, etc.
- Mutual Information and Relative Entropy syntactic parsing (Python code base)
- Text 2 TEI XML with linguistic annotation
- Lithuanian, Croatian, … finite state morphology (transducer, lemmatized, feature annotation) (mostly in C++ now, see the FLE Project)