Funded projects or grant-related activities:

Other projects:

Some other ongoing activities:

  • The Language Technology Lab (LTL) at EMU closed in Spring 2014, and a new kind of environment should reopen at IU.
  • Multi-tier linguistic annotation of language resources for the study of qualitative and quantitative aspects of language change and dependencies across linguistic levels.
  • Discontinuous constituents, i.e. Syntax, Information Theory, Prosody, Parsing with LFG and related frameworks.
  • Croatian morphology with XFST, (porting CroMo to it)
  • Croatian, Lithuanian LFG grammar with XLE. Lithuanian grammar and morphology using XFST and XLE.
  • TEI XML and corpus processing tools (backend evaluation of XQuery and different XML DBs).
  • Corpus annotation of the CLC using the linguistic components above, and others.
  • The Scheme Natural Language Toolkit (SNLTK) and some potential corresponding textbook.
  • OWL and SPARQL, RDF in general, and Linked Linguistic Open Data.
  • C++11 and C++14 and NLP/HLT-tools. Keeping up with new C++ language standards and implementing tools for NLP that are processing time and memory efficient.
  • Probabilistic LFG and FLE.


Former projects:

MultiTree: Completing the Library of Language Relationships
Award Number: 1227106; Principal Investigator: Damir Cavar, former PIs: Helen Aristar-Dry, Anthony Aristar. Organization: Indiana University; NSF Organization: BCS Award Date: 07/12/2012.
Project page:

Automatically Annotated Repository of Digital Video and Audio Resources Community (AARDVARC)
Award Number: 1244713; Principal Investigator: Damir Cavar; former PI: Helen Aristar-Dry and former Co-PIs: Anthony Aristar, Damir Cavar; Organization: Indiana University; NSF Organization: BCS Award Date: 09/15/2012. Collaborative grant: PI Douglas H. Whalen, CUNY.
Project page:

The Scheme Natural Language Toolkit (SNLTK)
This project is a no-budget project to develop a Scheme and Racket implementation of libraries and functionality for the analysis and processing of text, language and linguistic data. This project is open to all interested parties, and mainly supported by the Schemers or Racketeers in Zadar, and myself.

The Croatian Language Corpus (CLC) (Hrvatski jezični korpus)
is a joint project with the Institute for Croatian Language and Linguistics, as part of the program “Croatian Online Language Repository”, in cooperation with Dunja Brozović-Rončević, Małgorzata E. Ćavar, Tomislav Stojanov. The CLC is a text corpus of Croatian literature, newspapers and other genres, encoded in XML on the basis of the TEI P5 standard, made available online using the Philologic interface. Currently additional interfaces are being developed and tested, to extend the online usability and user experience, when working with the corpus. The corpus is being annotated phonemically and morphologically and syntactically parsed. We ported an initial hand-crafted morphological analyzer to XFST, and we are working on a Croatian LFG grammar for XLE for syntactic parsing and functional markup. An extended search interface that allows for online retrieval of linguistic annotations and structures at these linguistic levels will be provided in the near future.

Semantic Nets and Computational Lexicology
This project is part of a research program at the Institute of Croatian Language and Linguistics (IHJJ), funded by the Ministry of Research, Education, and Sports of the Republic of Croatia, since November 2006. Several researchers at the University of Zadar and the Institute of Croatian Language and Linguistics are affiliated with it.

Applied Technology for Language-Aided CMS (ATLAS) till September 2010
Workpackages leader: Damir Cavar (Croatian Language Processing Chain and Multilingual Document Classification)
EC web site
Funded under: The Information and Communication Technologies Policy Support Programme
Area: CIP-ICT-PSP.2009.5.3 – Multilingual Web : Multilingual Web content management: methods, tools and processes
Project reference: 250467; Execution: From 01/03/2010 to 28/02/2013; Project status: Finalized
In cooperation with: Pavle Valerijev (University of Zadar), Franjo Pehar (University of Zadar), Damir Kero (University of Zadar), Drahomira Gavranović (University of Zadar), Malgorzata E. Cavar (University of Zadar)
Tetracom Interactive Solutions (Tetracom) – Coordinator; Deutsches Forschungszentrum Fuer Kuenstliche Intelligenz GmbH (DFKI); Instytut Podstaw Informatyki Polskiej Akademii Nauk (ICS PAS); Atlantis Consulting SA (Atlantis); University Alexandru Ioan Cuza (UAIC); Institute for Bulgarian Language (IBL DCL); Institute of Technologies and Development Foundation (ITD); University of Hamburg (UHH); University of Zadar (UniZD)

ABUGI (Alignment based grammar induction)
Unsupervised Grammar Induction; with Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, Giancarlo Schrementi, Linguistics Dept., Indiana University.

A quantitative Model of Contact-Induced Language Change…
with a Focus on Pidginization and Creolization
This grant was funded in 2005 and 2006 by the FRSP Award program, Linguistics Dept., Indiana University.

Northern Caddoan Languages Documentation Project
Award Number: 0421838; Principal Investigator: Douglas Parks; Co-Principal Investigator: Wallace Hooper, Damir Cavar; Organization: Indiana University; NSF Organization: BCS Award Date: 07/15/2004.