Regular Expressions

Perl has an excellent regular expression engine which sets the standard by which others are measured. Regular expressions (regexps) are instances of a very powerful pattern matching language. They let you do amazing things with very little code, particularly given Perl's compact pattern matching syntax.

Here's an example of regexps at work. The job is to extract name/value pairs from a fortran namelist which contains many lines like this:

!----    JETtransp Update Mon Jun  7 09:33:24 BST 2004    --------------------
!HOST = pppl
FTIME = 0.343
KMDSPLUS = 1
MDS_PATH = '\top.INPUTS'
NLEBAL     =.true.     ! .true. to perform e- energy balance calculation  [ - ]
E0IN       = 20.0, 3.0, 0.55, 20.0, 3.0, 0.55 !                           [eV ]
 NLCO(1) = .T
 TBONA(1) =  8.899999E-02
Note the variable syntax for names and boolean values, variable spacing, mixture of single values and comma-separated lists, single-quoted strings, and comments. Each name/value pair occupies one line, so line-oriented processing is appropriate.

This short program does the job:

#! /usr/bin/perl -w

use strict;

{
   my ($file, $line, $name, $key, $value, $boolean, $integer, $float, 
       $string, $n, %nameValues);

   $name = '[A-Z][A-Z0-9_]+(?:\(\d\))?';
   $boolean = '\.true\.|\.TRUE\.|\.false\.|\.FALSE\.|\.T|\.F';
   $integer = '-?\d+';
   $float = '-?\d+\.\d*E?-?\+?\d*|-?\d*\.\d+E?-?\+?\d*';
   $string = '\'[^\']+\'';

   $file = 'namelist.txt';
   open(FILE, $file) or die("can't open $file");
   while ($line = <FILE>) {
      if (($line =~ m/^\s*($name)\s*=\s*($boolean)/) ||
          ($line =~ m/^\s*($name)\s*=\s*((?:$float\s*,?\s*)+)/) ||
          ($line =~ m/^\s*($name)\s*=\s*((?:$integer\s*,?\s*)+)/) ||
          ($line =~ m/^\s*($name)\s*=\s*($string)/)) {
         $nameValues{$1} = $2;
      }
   }
   close(FILE) or die("can't close $file");

   foreach $n (sort(keys(%nameValues))) {
      print("$n = $nameValues{$n}\n");
   }
}

After the declarations, which include hash %nameValues to store the results, match patterns are defined for names and values. Note the necessary use of single quotes to inhibit character and variable substitution.

$name = '[A-Z][A-Z0-9_]+(?:\(\d\))?';
A name starts with a capital letter, followed by one or more capitals, digits or underscores. This is optionally followed by a digit in parentheses. (?: ... ) defines a group which the final ? makes optional. Parentheses are special characters in regexps, so literals must be escaped with '\'. \d indicating a decimal digit is another special character.

$boolean = '\.true\.|\.TRUE\.|\.false\.|\.FALSE\.|\.T|\.F';
A boolean value is either .true. or .TRUE. or ..... Here we have | for alternation, and \. for a literal '.'. Without the escape, '.' normally means any character except a newline.

$float = '-?\d+\.\d*E?-?\+?\d*|-?\d*\.\d+E?-?\+?\d*';
Float syntax is a bit messy. The only required feature is a digit before or after a decimal point. This pattern is sloppy because it says that exponent digits are optional, but it works with sensible data.

$integer = '-?\d+';
An integer is one or more digits with an optional minus sign. Because floats can start with an integer, you have to test the line for a float match before testing it for an integer.

$string = '\'[^\']+\'';
A string is any sequence of non-single-quote characters in single quotes. Literal single quotes must be escaped here. ^ as the first character in square brackets means 'not the following characters'.

Moving on, the file is opened and read a line at a time. Each line is tested against a compound pattern of the form name = value, where the patterns discussed above are interpolated. This gets a bit complicated when the value is a list. Here is the simple case:

if (($line =~ m/^\s*($name)\s*=\s*($boolean)/) ||
=~ is the match operator. This tests the line to see if it matches the pattern within the outer / markers. \s* is optional white space. The parentheses in the pattern capture the value matched by the sub-expression they contain, in string variables called $1, $2 etc. If a test succeeds, the captured name/value pair is written to the nameValues hash:
$nameValues{$1} = $2;
Finally, the name/value pairs are printed in sorted order to prove the pudding.