The code for the ongoing projects an be found on my GitHub and Bitbucket repos.

Some tools coded in C++:

  • ELAN2split (splitting ELAN annotation files into time-sequences as annotated in a specific tier). This tool generates a corpus of file-pairs, i.e. audio-file chunks from a time-aligned speech corpus with the corresponding transcription to be used by HTK-based speech tools for the generation of Forced Aligners or training of other types of speech recognizer models. The C++11 code is available at the Bitbucket Git Repo.
  • TreeBankParserSA is a tool written in C++11 to extract Context-free Grammar rules from treebanks in the Penn-Treebank format. It can generate Probabilistic Context-free Grammar (PCFG) formats for the Free Linguistic Environment (FLE) with absolute counts and relative frequencies. The frequencies can refer to the left-hand-side symbol or the particular extracted rule. One output format will be also compact using Finite State representations or the FLE-based Weighted Finite State Transducer (WFST) representation.
  • Free Linguistic Environment (FLE), a parser environment implemented in C++11/C++14, mainly focusing on compatibility with XLE and XFST for parsing based on the LFG-formalism (using the existing XFST morphologies and XLE grammars). It also can parse with CFGs, PCFGs, etc. The implementation provides an environment to work with Probabilistic LFG in the backbone (using PCFGs or higher level probabilistic grammars), or it allows for modeling of probabilistic relations between inputs and parse-tree and f-representation. The morphological analyzer uses Foma and OpenFST. I run a list for the development group, a closed Bitbucket Git-repo (by invitation) and a free and open repo with the finally released code.