15 Sep

ELAN2split

I published a new version of ELAN2split on Bitbucket.

ELAN2split is a tool that creates pairs of audio/transcription files that correspond to time-aligned segments in an ELAN file. Each time-aligned segment is saved in two files, the trimmed WAVE-file from the original recording and the transcription or annotation text in the corresponding tier that can be selected via command line. This corpus is ideal to build and train a Forced Aligner and initial speech corpora and subsequently to train a speech recognizer. I built this tool to work with the Prosodylab-Aligner.

This is a command line tool. It does not come with a graphical interface. Binary versions for Ubuntu 16.04 64-bit and Mac OSX are available in the Downloads section of the Bitbucket repository.

 

15 Sep

TreebankParser SA

I posted the code of the standalone version of the TreebankParser on Bitbucket. A binary for Mac OSX can be found in the Download section. Binaries for the common operating systems will follow.

The TreebankParser is part of the Free Linguistic Environment (FLE) project. It converts some types of treebanks to a set of rules with frequencies or probabilities. Some extensions that are part of FLE will be added soon, e.g. converting the Context-free Grammars (CFGs) into a Weighted Finite State Transducer (WFST) representation.

 

14 Aug

On Ubuntu/Debian/… tools for linguists

DRAFT – Work in Progress

The standard Ubuntu distribution comes with various linguistic tools. I am linking here to 16.04(.1).

Various Finite State Transducer tool-kits can be found in the package list that are used for the development of morphological analyzers, tokenizers, and other NLP tools:

There are also ready NLP tools for various languages in the standard package list:

 

Other Repositories

Some repositories provide more packages that might be interesting or useful for linguistic work, be it language documentation or corpus linguistics:

I set up the TEI XML repository by creating as root or using sudo a file:

/etc/apt/sources.list.d/sil.list

with this content for Ubuntu xenial (16.04):

deb http://packages.sil.org/ubuntu xenial main

 

08 Aug

Repairing ELAN annotation files from before 2005 for use with ELAN 4.9.4 or newer

ELAN 4.9.4 has issues with older versions of ELAN Annotation Files (EAFs).

We noticed that ELAN annotation files (EAFs) from 2004 and earlier would not open in ELAN 4.9.4, the most recent version of it. The problem is somewhat serious in that it does not alert one or show the reason for not showing any tiers or annotations. ELAN 4.9.4 would open the files without any specific notification and just showing an empty tiers and media section.

Han Sloetjes informed me that ELAN has a log dialog under *View > View Log…*. We identified the issue to be a missing attribute in the older EAF XML. The old XML root tag looks like this:

<ANNOTATION_DOCUMENT AUTHOR="Jarrod Slocum" DATE="2004.03.02 13:56 CST"
 FORMAT="2.0" VERSION="2.0"
 xsi:noNamespaceSchemaLocation="http://www.mpi.nl/tools/elan/EAFv2.0.xsd">

This root tag is missing one important attribute that the newer ELAN XML-parser requires. Han sent me this change suggestion:

<ANNOTATION_DOCUMENT AUTHOR="Jarrod Slocum" DATE="2004.03.02 13:56 CST"
 FORMAT="2.0" VERSION="2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="http://www.mpi.nl/tools/elan/EAFv2.0.xsd">

This adds the attribute:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

The issue with this problem is that many linguists working on ELAN annotations and opening older EAFs will potentially get confused and might assume that they lost the data and start writing over the old EAFs and really lose the previous data and work. Please be aware of this issue and if you have archived EAFs from 2005 or earlier, you might have to correct the EAF XML to be able to view and edit it in the most recent ELAN.

I told Han that it might be a good idea to include the code in new ELAN releases to correct this error automatically, since many archived files might be affected by this issue.

If you have a lot of older ELAN files with this issue, here is a script or set of commands that will add this attribute to the XML. You will need some version of Python 3.x to run this script. If you run this in an Unix environment (Mac OS X, Linux, Windows 10 with Linux Subsystem and Bash), you will also need *find*. This script will create a backup-copy of all EAFs that are repaired. I strongly recommend that you create your own backup copy and verify that the repaired EAFs can be opened with the newer versions of ELAN.

To repair all EAFs in the local folder, assuming that the script elanrepair.py is executable:

./elanrepair.py *.eaf

Download the script elanrepair.py in the ZIP-file elanrepair.zip.

If the you cannot run the script directly, use the Python 3 interpreter to run it:

python3  ./elanrepair.py *.eaf

If you want to recurse through a larger set of sub-folders in a bash-command line try:

find . -name "*.eaf" -type f -exec ./elanrepair.py {} \;

Let me know, if there are any issues with that.

Damir

17 Jul

Compiling Thrax on Linux, Mac OSX, and Windows with Cygwin

Thrax is a grammar compiler developed by a team of researchers at Google Research. It depends on OpenFST. See for more details on how to configure and compile OpenFST.

Download and unpack the source of Thrax version 1.2.2 or newer.

See what options configure provides:

   ./configure –help

Since I need the static libraries, I added this option:

–enable-static

I also enabled the command line binaries:

–enable-bin

and the readline library in the rewrite tester:

–enable-readline

<pstyle=”text-align: left;”>The last option presupposes that you have installed the devel-package for libreadline on your Mac OSX, Fedora, Ubuntu, or Windows with Cygwin.

You have to have the complete OpenFST library compiled and installed on your system. Thrax depends on some more extensions that are not compiled with the default configuration of OpenFST. Follow the instructions on the blog post how to compile OpenFST.

I also set the environment variable for Windows with Cygwin. In particular the library path information seems to be necessary since the make process might stop with the message that it cannot find libfst and libfstfar:

   export CXXFLAGS=”-O -D_POSIX_SOURCE  -L/usr/local/lib -I/usr/local/include”

My complete configure command in the thrax-1.2.2 folder:

   ./configure –enable-static –enable-bin –enable-readline

To avoid issues with undefined ACCESSPERMS in the file thrax-1.2.2/src/lib/util/utils.cc you can add these three lines of code below the include statements (for example below line 31):

#ifndef ACCESSPERMS
#define ACCESSPERMS (S_IRWXU|S_IRWXG|S_IRWXO)
#endif

The maintainers of the Thrax code-base might want to add this anyway, since some operating systems do not declare ACCESSPERMS in the default. Cygwin on Windows does not.

To compile Thrax run:

   make

Then follow up with a:

   make install

On Mac OSX or Linux you might want to prepend a sudo to the command above.

That was it.

17 Jul

Compiling Foma on Windows with Cygwin

Foma comes as a package in the standard distribution of Ubuntu. However, it might be necessary to recompile it, if you intend to link the static library to your software, since the static library was missing in the 16.04 LTE distro that I tested. You can uninstall the fedora package and use the same instructions below to compile your own version with all components (i.e. binaries, dynamic and static libs, and include files) for Ubuntu 16.04 or Fedora 24. In fact, the same is true for compilation on the most recent Mac OSX with Xcode.

I got the Foma source from the Bitbucket site using the foma-0.9.18.tar.gz file.

In the unpacked folder foma-0.9.18 I edited the Makefile. In line 16 I changed the line:

LDFLAGS = -lreadline -lz -ltermcap

by removing the -ltermcap parameter to get:

LDFLAGS = -lreadline -lz

I saved the changed Makefile.

In Cygwin I made sure that all the development tools are installed, and in particular the package libreadline-devel.

To compile Foma I ran:

make

To install the libraries and include-files I ran:

make install

You could prepend the command above with sudo on Mac OSX or Linux. The components can be found in the subfolders bin, lib, include of /usr/local.

That is all.

 

17 Jul

Compiling OpenFST on Windows 10 using Cygwin

What might compile on Fedora or Ubuntu out of the box, can be somewhat more complicated on Windows. Assuming that you are able to set up Cygwin on your Windows 10 (in my case) and also install all the development tools (e.g. GCC, G++, bison, flex, etc.), here is how I compiled the newest OpenFST (v. 1.5.3 in this case) library and tools for Cygwin (doing a compilation to DLLs for Windows native based on Visual Studio 2015 or so might follow soon):

To avoid issues with the file too big error during compilation, switch on optimization in C++ (capital O like in Omega):

-O

To avoid issues with errors like ‘fileno’ was not declared in this scope one needs to compile the code with POSIX_SOURCE set.

In the Cygwin bash set the environment variables (first one is not necessary):

export CFLAGS=-D_POSIX_SOURCE

export CXXFLAGS=”-O -D_POSIX_SOURCE”

In the source folder openfst-1.5.3 run configure. Check for the modules and extensions that you want to have compiled (use for example ./configure –help) I activated most of the extensions for use in or required by other tools and libraries:

./configure –enable-static –enable-bin –enable-compact-fsts –enable-compress –enable-const-fsts –enable-far –enable-linear-fsts –enable-lookahead-fsts –enable-mpdt –enable-ngram-fsts –enable-pdt

If you want the Python extensions, add this parameter and make sure that your Python-dev package is installed:

–enable-python

Compile the code using:

make

Install it in the Cygwin file-structure:

make Install

That is it.