brill-tagger.lsp

Module index

source download

Module: brill-tagger

Brill part of speech (POS) tagger interface

Version: 0.3a
Author: Lutz Mueller 2008




This module interfaces to a modified version of the Brill part of speech (POS) Tagger from: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z
The original source code was modified to also compile on Mac OS X and to suppress some information sent to standard out. This module uses a simple newLISP exec interface to call the tagger. The modified package is available here: BRILL_TAGGER_NEWLISP_V1.14 .

Requirements

Make the Brill Tagger utilities using the modified Brill tagger distribution: RULE_BASED_TAGGER_V1.14-MAC-OSX.tgz. To make for Mac OS X use makefile_osx to make for other UNIX use the normal Makefile. See the file README.MAC_OSX for details.

Bye default program and data files should be in the following places:
 /usr/bin/tagger
 /usr/bin/start-state-tagger
 /usr/bin/final-state-tagger
 /usr/share/RULE_BASED_TAGGER_V1.14/LEXICON
 /usr/share/RULE_BASED_TAGGER_V1.14/BIGRAMS
 /usr/share/RULE_BASED_TAGGER_V1.14/LEXICALRULEFILE
 /usr/share/RULE_BASED_TAGGER_V1.14/CONTEXTUALRULEFILE
 


When using different locations for the data files constants of the same name in the header of bill-tagger.lsp have to be changed.

§

rb:tag

syntax: (rb:tag str-corpus [boolean-flag])
parameter: str-corpus - The sentences to be tagged separated by a line feed character.
parameter: boolean-flag - Set true to see raw output from Brill Tagger.

return: An association list of words and their tags in the order they occur in the sentence.



The corpus to be tagged should be one sentence per line, with punctuation tokenized. As much text as possible should be tagged at once to minimize overhead.

Example:
 (brt:tag "the cat eats the mouse")
 => (("the" brt:DT) ("cat" brt:NN) ("eats" brt:VBZ) ("the" brt:DT) ("mouse" brt:NN))

 (brt:tag "the cat eats the mouse" true)
 => "the/DT cat/NN eats/VBZ the/DT mouse/NN"


- ∂ -

generated with newLISP  and newLISPdoc