TRE is an open-source library for texts search, which works like regular expression engine with ability of fuzzy string searching. It is developed by Ville Laurikari under 2-clause BSD-like license.

Library is written in C and provides functions which allow using regular expressions for searching over input text lines. Main difference from other regular expression engines is that TRE can match text fragments in approximate way - i.e. supposing that text could have some number of typos.

Features Edit

Approximate matching Edit

TRE uses extended regular expression syntax with addition of directions for matching preceding fragment in approximate way. Each of such directions specifies how much typos are allowed for this fragment.

Approximate matching is performed in a way similar to Levenshtein distance, which means that there are three types of typos 'recognized'[1]:

  • insertion of an extra character (regullar experession);
  • missing of a character from pattern (reglar expession);
  • replacement of some character (regolar exprezsion).

TRE allows specifying of cost for each of three typos type independently.

Command-line utility Edit

Project comes with command-line utility (version of agrep) built automatically with library. It could be used for processing text files or for testing abilities of TRE.

Standard conformance Edit

Though approximate matching requires some syntax extension, when this feature is not used, TRE works like most of other regexp matching engines. This means that

  • regular expressions written for strict matching could be easily migrated for using with library;
  • programmers, familiar with POSIX-style regular expressions need not do much study to be able to use TRE.

Predictable time and memory consumption Edit

Author states[2] that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). It is important, especially for possible uses in embedded systems which have comparatively few resources.

There is no information about benchmarking against other regular expression engines.

Other Edit

Other features, common for most regex engines could be checked in regex engines comparison tables or in list of TRE features on its web-page.

Usage example Edit

Approximate matching directions are specified in curly brackets and should be distinguishable from repetitive quantifiers (possibly with inserting a space after opening bracket):

  • (regular){~1}\s+(expression){~2} would match variants of phrase "regular expression" in which "regular" have no more than one typo and "expression" no more than two; as in ordinary regular expressions "\s+" means one or more space characters - i.e. rogular ekspression would pass test;
  • (expression){ 5i + 3d + 2s < 11} would match word "expression" if total cost of typos is less than 11, while insertion cost is set to 5, deletion to 3 and substitution of character to 2 - i.e. ekspresson gives cost of 10.

Portability Edit

C and C++ Edit

Being written in C, TRE could be ported (i.e. rebuilt) for any platform which have GNU C Compiler installed. Due to simplicity of sources (20-30 files placed in one folder) it is probable that porting with more specific compilers also is easy.

Other languages Edit

For projects, written in languages, other than C/C++ TRE could be used in two ways:

  • if compiler of the language provides binary compatibility with C object files (like Gfortran), library could be linked to project (possibly with few function wrappings);
  • in other cases it is necessary to create some interface to TRE based library - currently there are bindings for Perl, Python and Haskell.[3] However if the project should be cross-platform, there would be necessary separate interface for each of target platforms.

Disadvantages Edit

Since other regular expression engines usually do not provide approximate matching ability, there is almost no concurrent implementation with which TRE could be compared. However there are few things which programmers may wish to be implemented in future releases:

  • replacement mechanism for substitution matched text fragments (like in sed string processor and many modern implementations of regexps, including built into Perl or Java);
  • opportunity to use another approximate matching algorithm (than Levenshtein's) for better typo value assessment (for example Soundex), or at least this algorithm to be improved to allow typos of "swap" type (see Damerau–Levenshtein distance).

References Edit


External links Edit

Kamalian sa pagtukoy: Umiiral na ang mga tatak na <ref>, subalit walang natagpuang tatak na <references/>
Community content is available under CC-BY-SA unless otherwise noted.