Home » lucene-3.0.1-src » org.apache » regexp » [javadoc | source]
org.apache.regexp
public class: RE [javadoc | source]
java.lang.Object
   org.apache.regexp.RE

All Implemented Interfaces:
    Serializable

RE is an efficient, lightweight regular expression evaluator/matcher class. Regular expressions are pattern descriptions which enable sophisticated matching of strings. In addition to being able to match a string against a pattern, you can also extract parts of the match. This is especially useful in text parsing! Details on the syntax of regular expression patterns are given below.

To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:

 RE r = new RE("a*b");

Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:

 boolean matched = r.match("aaaab");
will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".

If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:

 RE r = new RE("(a*)b");                  // Compile expression
 boolean matched = r.match("xaaaab");     // Match against "xaaaab"

 String wholeExpr = r.getParen(0);        // wholeExpr will be 'aaaab'
 String insideParens = r.getParen(1);     // insideParens will be 'aaaa'

 int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
 int endWholeExpr = r.getParenEnd(0);     // endWholeExpr will be index 6
 int lenWholeExpr = r.getParenLength(0);  // lenWholeExpr will be 5

 int startInside = r.getParenStart(1);    // startInside will be index 1
 int endInside = r.getParenEnd(1);        // endInside will be index 5
 int lenInside = r.getParenLength(1);     // lenInside will be 4
You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:
 ([0-9]+)=\1
will match any string of the form n=n (like 0=0 or 2=2).

The full regular expression syntax accepted by RE is described here:


 Characters

   unicodeChar   Matches any identical unicode character
   \                    Used to quote a meta-character (like '*')
   \\                   Matches a single '\' character
   \0nnn                Matches a given octal character
   \xhh                 Matches a given 8-bit hexadecimal character
   \\uhhhh              Matches a given 16-bit hexadecimal character
   \t                   Matches an ASCII tab character
   \n                   Matches an ASCII newline character
   \r                   Matches an ASCII return character
   \f                   Matches an ASCII form feed character


 Character Classes

   [abc]                Simple character class
   [a-zA-Z]             Character class with ranges
   [^abc]               Negated character class
NOTE: Incomplete ranges will be interpreted as "starts from zero" or "ends with last character".
I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF], [-] means "all characters".

 Standard POSIX Character Classes

   [:alnum:]            Alphanumeric characters.
   [:alpha:]            Alphabetic characters.
   [:blank:]            Space and tab characters.
   [:cntrl:]            Control characters.
   [:digit:]            Numeric characters.
   [:graph:]            Characters that are printable and are also visible.
                        (A space is printable, but not visible, while an
                        `a' is both.)
   [:lower:]            Lower-case alphabetic characters.
   [:print:]            Printable characters (characters that are not
                        control characters.)
   [:punct:]            Punctuation characters (characters that are not letter,
                        digits, control characters, or space characters).
   [:space:]            Space characters (such as space, tab, and formfeed,
                        to name a few).
   [:upper:]            Upper-case alphabetic characters.
   [:xdigit:]           Characters that are hexadecimal digits.


 Non-standard POSIX-style Character Classes

   [:javastart:]        Start of a Java identifier
   [:javapart:]         Part of a Java identifier


 Predefined Classes

   .         Matches any character other than newline
   \w        Matches a "word" character (alphanumeric plus "_")
   \W        Matches a non-word character
   \s        Matches a whitespace character
   \S        Matches a non-whitespace character
   \d        Matches a digit character
   \D        Matches a non-digit character


 Boundary Matchers

   ^         Matches only at the beginning of a line
   $         Matches only at the end of a line
   \b        Matches only at a word boundary
   \B        Matches only at a non-word boundary


 Greedy Closures

   A*        Matches A 0 or more times (greedy)
   A+        Matches A 1 or more times (greedy)
   A?        Matches A 1 or 0 times (greedy)
   A{n}      Matches A exactly n times (greedy)
   A{n,}     Matches A at least n times (greedy)
   A{n,m}    Matches A at least n but not more than m times (greedy)


 Reluctant Closures

   A*?       Matches A 0 or more times (reluctant)
   A+?       Matches A 1 or more times (reluctant)
   A??       Matches A 0 or 1 times (reluctant)


 Logical Operators

   AB        Matches A followed by B
   A|B       Matches either A or B
   (A)       Used for subexpression grouping
  (?:A)      Used for subexpression clustering (just like grouping but
             no backrefs)


 Backreferences

   \1    Backreference to 1st parenthesized subexpression
   \2    Backreference to 2nd parenthesized subexpression
   \3    Backreference to 3rd parenthesized subexpression
   \4    Backreference to 4th parenthesized subexpression
   \5    Backreference to 5th parenthesized subexpression
   \6    Backreference to 6th parenthesized subexpression
   \7    Backreference to 7th parenthesized subexpression
   \8    Backreference to 8th parenthesized subexpression
   \9    Backreference to 9th parenthesized subexpression

All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the string as possible without causing the overall match to fail. If you want a closure to be reluctant (non-greedy), you can simply follow it with a '?'. A reluctant closure will match as few elements of the string as possible when finding matches. {m,n} closures don't currently support reluctancy.

Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. In fact, if you want to pre-compile one or more regular expressions, the 'recompile' class can be invoked from the command line to produce compiled output like this:

   // Pre-compiled regular expression "a*b"
   char[] re1Instructions =
   {
       0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
       0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
       0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
       0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
       0x0000,
   };


   REProgram re1 = new REProgram(re1Instructions);
You can then construct a regular expression matcher (RE) object from the pre-compiled expression re1 and thus avoid the overhead of compiling the expression at runtime. If you require more dynamic regular expressions, you can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.


ISSUES:

Field Summary
public static final  int MATCH_NORMAL    Specifies normal, case-sensitive matching behaviour. 
public static final  int MATCH_CASEINDEPENDENT    Flag to indicate that matching should be case-independent (folded) 
public static final  int MATCH_MULTILINE    Newlines should match as BOL/EOL (^ and $) 
public static final  int MATCH_SINGLELINE    Consider all input a single body of text - newlines are matched by . 
static final  char OP_END    * The format of a node in a program is: * * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] * * char OPCODE - instruction * char OPDATA - modifying data * char OPNEXT - next node (relative offset) * * 
static final  char OP_BOL     
static final  char OP_EOL     
static final  char OP_ANY     
static final  char OP_ANYOF     
static final  char OP_BRANCH     
static final  char OP_ATOM     
static final  char OP_STAR     
static final  char OP_PLUS     
static final  char OP_MAYBE     
static final  char OP_ESCAPE     
static final  char OP_OPEN     
static final  char OP_OPEN_CLUSTER     
static final  char OP_CLOSE     
static final  char OP_CLOSE_CLUSTER     
static final  char OP_BACKREF     
static final  char OP_GOTO     
static final  char OP_NOTHING     
static final  char OP_CONTINUE     
static final  char OP_RELUCTANTSTAR     
static final  char OP_RELUCTANTPLUS     
static final  char OP_RELUCTANTMAYBE     
static final  char OP_POSIXCLASS     
static final  char E_ALNUM     
static final  char E_NALNUM     
static final  char E_BOUND     
static final  char E_NBOUND     
static final  char E_SPACE     
static final  char E_NSPACE     
static final  char E_DIGIT     
static final  char E_NDIGIT     
static final  char POSIX_CLASS_ALNUM     
static final  char POSIX_CLASS_ALPHA     
static final  char POSIX_CLASS_BLANK     
static final  char POSIX_CLASS_CNTRL     
static final  char POSIX_CLASS_DIGIT     
static final  char POSIX_CLASS_GRAPH     
static final  char POSIX_CLASS_LOWER     
static final  char POSIX_CLASS_PRINT     
static final  char POSIX_CLASS_PUNCT     
static final  char POSIX_CLASS_SPACE     
static final  char POSIX_CLASS_UPPER     
static final  char POSIX_CLASS_XDIGIT     
static final  char POSIX_CLASS_JSTART     
static final  char POSIX_CLASS_JPART     
static final  int maxNode     
static final  int MAX_PAREN     
static final  int offsetOpcode     
static final  int offsetOpdata     
static final  int offsetNext     
static final  int nodeSize     
 REProgram program     
transient  CharacterIterator search     
 int matchFlags     
 int maxParen     
transient  int parenCount     
transient  int start0     
transient  int end0     
transient  int start1     
transient  int end1     
transient  int start2     
transient  int end2     
transient  int[] startn     
transient  int[] endn     
transient  int[] startBackref     
transient  int[] endBackref     
public static final  int REPLACE_ALL    Flag bit that indicates that subst should replace all occurrences of this regular expression. 
public static final  int REPLACE_FIRSTONLY    Flag bit that indicates that subst should only replace the first occurrence of this regular expression. 
public static final  int REPLACE_BACKREFERENCES    Flag bit that indicates that subst should replace backreferences 
Constructor:
 public RE() 
 public RE(String pattern) throws RESyntaxException 
    Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.
    Parameters:
    pattern - The regular expression pattern to compile.
    Throws:
    RESyntaxException - Thrown if the regular expression has invalid syntax.
    Also see:
    RECompiler
    recompile
    exception: RESyntaxException - Thrown if the regular expression has invalid syntax.
 public RE(REProgram program) 
    Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
 public RE(String pattern,
    int matchFlags) throws RESyntaxException 
    Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.
    Parameters:
    pattern - The regular expression pattern to compile.
    matchFlags - The matching style
    Throws:
    RESyntaxException - Thrown if the regular expression has invalid syntax.
    Also see:
    RECompiler
    recompile
    exception: RESyntaxException - Thrown if the regular expression has invalid syntax.
 public RE(REProgram program,
    int matchFlags) 
    Construct a matcher for a pre-compiled regular expression from program (bytecode) data. Permits special flags to be passed in to modify matching behaviour.
    Parameters:
    program - Compiled regular expression program (see RECompiler and/or recompile)
    matchFlags - One or more of the RE match behaviour flags (RE.MATCH_*):
      MATCH_NORMAL              // Normal (case-sensitive) matching
      MATCH_CASEINDEPENDENT     // Case folded comparisons
      MATCH_MULTILINE           // Newline matches as BOL/EOL
    

    Also see:
    RECompiler
    REProgram
    recompile
Method from org.apache.regexp.RE Summary:
getMatchFlags,   getParen,   getParenCount,   getParenEnd,   getParenLength,   getParenStart,   getProgram,   grep,   internalError,   match,   match,   match,   matchAt,   matchNodes,   setMatchFlags,   setParenEnd,   setParenStart,   setProgram,   simplePatternToFullRegularExpression,   split,   subst,   subst
Methods from java.lang.Object:
clone,   equals,   finalize,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.regexp.RE Detail:
 public int getMatchFlags() 
    Returns the current match behaviour flags.
 public String getParen(int which) 
    Gets the contents of a parenthesized subexpression after a successful match.
 public int getParenCount() 
    Returns the number of parenthesized subexpressions available after a successful match.
 public final int getParenEnd(int which) 
    Returns the end index of a given paren level.
 public final int getParenLength(int which) 
    Returns the length of a given paren level.
 public final int getParenStart(int which) 
    Returns the start index of a given paren level.
 public REProgram getProgram() 
    Returns the current regular expression program in use by this matcher object.
 public String[] grep(Object[] search) 
    Returns an array of Strings, whose toString representation matches a regular expression. This method works like the Perl function of the same name. Given a regular expression of "a*b" and an array of String objects of [foo, aab, zzz, aaaab], the array of Strings returned by grep would be [aab, aaaab].
 protected  void internalError(String s) throws Error 
    Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption). In practice, this should be very rare.
 public boolean match(String search) 
    Matches the current regular expression program against a String.
 public boolean match(String search,
    int i) 
    Matches the current regular expression program against a character array, starting at a given index.
 public boolean match(CharacterIterator search,
    int i) 
    Matches the current regular expression program against a character array, starting at a given index.
 protected boolean matchAt(int i) 
    Match the current regular expression program against the current input string, starting at index i of the input string. This method is only meant for internal use.
 protected int matchNodes(int firstNode,
    int lastNode,
    int idxStart) 
    Try to match a string against a subset of nodes in the program
 public  void setMatchFlags(int matchFlags) 
    Sets match behaviour flags which alter the way RE does matching.
 protected final  void setParenEnd(int which,
    int i) 
    Sets the end of a paren level
 protected final  void setParenStart(int which,
    int i) 
    Sets the start of a paren level
 public  void setProgram(REProgram program) 
    Sets the current regular expression program used by this matcher object.
 public static String simplePatternToFullRegularExpression(String pattern) 
    Converts a 'simplified' regular expression to a full regular expression
 public String[] split(String s) 
    Splits a string into an array of strings on regular expression boundaries. This function works the same way as the Perl function of the same name. Given a regular expression of "[ab]+" and a string to split of "xyzzyababbayyzabbbab123", the result would be the array of Strings "[xyzzy, yyz, 123]".

    Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.

 public String subst(String substituteIn,
    String substitution) 
    Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".
 public String subst(String substituteIn,
    String substitution,
    int flags) 
    Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".

    It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".

    Note: $0 represents the whole match.