Docjar: A Java Source and Docuemnt Enginecom.*    java.*    javax.*    org.*    all    new    plug-in

Quick Search    Search Deep

org.apache.xerces.utils.regex
Class RegularExpression  view RegularExpression download RegularExpression.java

java.lang.Object
  extended byorg.apache.xerces.utils.regex.RegularExpression
All Implemented Interfaces:
java.io.Serializable

public class RegularExpression
extends java.lang.Object
implements java.io.Serializable

A regular expression matching engine using Non-deterministic Finite Automaton (NFA). This engine does not conform to the POSIX regular expression.


How to use

A. Standard way
 RegularExpression re = new RegularExpression(regex);
 if (re.matches(text)) { ... }
 
B. Capturing groups
 RegularExpression re = new RegularExpression(regex);
 Match match = new Match();
 if (re.matches(text, match)) {
     ... // You can refer captured texts with methods of the Match class.
 }
 

Case-insensitive matching

 RegularExpression re = new RegularExpression(regex, "i");
 if (re.matches(text) >= 0) { ...}
 

Options

You can specify options to RegularExpression(regex, options) or setPattern(regex, options). This options parameter consists of the following characters.

"i"
This option indicates case-insensitive matching.
"m"
^ and $ consider the EOL characters within the text.
"s"
. matches any one character.
"u"
Redefines \d \D \w \W \s \S \b \B \< \> as becoming to Unicode.
"w"
By this option, \b \B \< \> are processed with the method of 'Unicode Regular Expression Guidelines' Revision 4. When "w" and "u" are specified at the same time, \b \B \< \> are processed for the "w" option.
","
The parser treats a comma in a character class as a range separator. [a,b] matches a or , or b without this option. [a,b] matches a or b with this option.
"X"
By this option, the engine confoms to XML Schema: Regular Expression. The match() method does not do subsring matching but entire string matching.

Syntax

Differences from the Perl 5 regular expression

  • There is 6-digit hexadecimal character representation (\vHHHHHH.)
  • Supports subtraction, union, and intersection operations for character classes.
  • Not supported: \ooo (Octal character representations), \G, \C, \lc, \uc, \L, \U, \E, \Q, \N{name}, (?{code}), (??{code})

Meta characters are `. * + ? { [ ( ) | \ ^ $'.


BNF for the regular expression

 regex ::= ('(?' options ')')? term ('|' term)*
 term ::= factor+
 factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )?
            | '(?#' [^)]* ')'
 minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}'
 atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9]
          | '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X'
          | '(?>' regex ')' | '(?' options ':' regex ')'
          | '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')'
 options ::= [imsw]* ('-' [imsw]+)?
 anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>'
 looks ::= '(?=' regex ')'  | '(?!' regex ')'
           | '(?<=' regex ')' | '(?<!' regex ')'
 char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1
 category-block ::= '\' [pP] category-symbol-1
                    | ('\p{' | '\P{') (category-symbol | block-name
                                       | other-properties) '}'
 category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S'
 category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo'
                     | 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No'
                     | 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs'
                     | 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po'
                     | 'Sm' | 'Sc' | 'Sk' | 'So'
 block-name ::= (See above)
 other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED'
 character-1 ::= (any character except meta-characters)

 char-class ::= '[' ranges ']'
                | '(?[' ranges ']' ([-+&] '[' ranges ']')? ')'
 ranges ::= '^'? (range ','?)+
 range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block
           | range-char | range-char '-' range-char
 range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2
 code-point ::= '\x' hex-char hex-char
                | '\x{' hex-char+ '}'
                | '\v' hex-char hex-char hex-char hex-char hex-char hex-char
 hex-char ::= [0-9a-fA-F]
 character-2 ::= (any character except \[]-,)
 

TODO



Nested Class Summary
(package private) static class RegularExpression.Context
           
 
Field Summary
(package private) static int CARRIAGE_RETURN
           
(package private)  RegularExpression.Context context
           
(package private) static boolean DEBUG
           
(package private) static int EXTENDED_COMMENT
          "x"
(package private)  RangeToken firstChar
           
(package private)  java.lang.String fixedString
           
(package private)  boolean fixedStringOnly
           
(package private)  int fixedStringOptions
           
(package private)  BMPattern fixedStringTable
           
(package private)  boolean hasBackReferences
           
(package private) static int IGNORE_CASE
          "i"
(package private) static int LINE_FEED
           
(package private) static int LINE_SEPARATOR
           
(package private)  int minlength
           
(package private) static int MULTIPLE_LINES
          "m"
(package private)  int nofparen
          The number of parenthesis in the regular expression.
(package private)  int numberOfClosures
           
(package private)  Op operations
           
(package private)  int options
           
(package private) static int PARAGRAPH_SEPARATOR
           
(package private) static int PROHIBIT_FIXED_STRING_OPTIMIZATION
          "F"
(package private) static int PROHIBIT_HEAD_CHARACTER_OPTIMIZATION
          "H"
(package private)  java.lang.String regex
          A regular expression.
(package private) static int SINGLE_LINE
          "s"
(package private) static int SPECIAL_COMMA
          ",".
(package private)  Token tokentree
          Internal representation of the regular expression.
(package private) static int UNICODE_WORD_BOUNDARY
          An option.
(package private) static int USE_UNICODE_CATEGORY
          This option redefines \d \D \w \W \s \S.
(package private) static Token wordchar
           
private static int WT_IGNORE
           
private static int WT_LETTER
           
private static int WT_OTHER
           
(package private) static int XMLSCHEMA_MODE
          "X".
 
Constructor Summary
  RegularExpression(java.lang.String regex)
          Creates a new RegularExpression instance.
  RegularExpression(java.lang.String regex, java.lang.String options)
          Creates a new RegularExpression instance with options.
(package private) RegularExpression(java.lang.String regex, Token tok, int parens, boolean hasBackReferences, int options)
           
 
Method Summary
private  void compile(Token tok)
          Compiles a token tree into an operation flow.
private  Op compile(Token tok, Op next, boolean reverse)
          Converts a token to an operation.
 boolean equals(java.lang.Object obj)
          Return true if patterns are the same and the options are equivalent.
(package private)  boolean equals(java.lang.String pattern, int options)
           
 int getNumberOfGroups()
          Return the number of regular expression groups.
 java.lang.String getOptions()
          Returns a option string.
 java.lang.String getPattern()
           
private static int getPreviousWordType(char[] target, int begin, int end, int offset, int opts)
           
private static int getPreviousWordType(java.text.CharacterIterator target, int begin, int end, int offset, int opts)
           
private static int getPreviousWordType(java.lang.String target, int begin, int end, int offset, int opts)
           
private static int getWordType(char[] target, int begin, int end, int offset, int opts)
           
private static int getWordType(java.text.CharacterIterator target, int begin, int end, int offset, int opts)
           
private static int getWordType(java.lang.String target, int begin, int end, int offset, int opts)
           
private static int getWordType0(char ch, int opts)
           
 int hashCode()
          Get a value that represents this Object, as uniquely as possible within the confines of an int.
private static boolean isEOLChar(int ch)
           
private static boolean isSet(int options, int flag)
           
private static boolean isWordChar(int ch)
           
private  int matchCharacterIterator(RegularExpression.Context con, Op op, int offset, int dx, int opts)
           
private  int matchCharArray(RegularExpression.Context con, Op op, int offset, int dx, int opts)
           
 boolean matches(char[] target)
          Checks whether the target text contains this pattern or not.
 boolean matches(char[] target, int start, int end)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(char[] target, int start, int end, Match match)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(char[] target, Match match)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.text.CharacterIterator target)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.text.CharacterIterator target, Match match)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.lang.String target)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.lang.String target, int start, int end)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(java.lang.String target, int start, int end, Match match)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(java.lang.String target, Match match)
          Checks whether the target text contains this pattern or not.
private static boolean matchIgnoreCase(int chardata, int ch)
           
private  int matchString(RegularExpression.Context con, Op op, int offset, int dx, int opts)
           
(package private)  void prepare()
          Prepares for matching.
private static boolean regionMatches(char[] target, int offset, int limit, int offset2, int partlen)
           
private static boolean regionMatches(char[] target, int offset, int limit, java.lang.String part, int partlen)
           
private static boolean regionMatches(java.text.CharacterIterator target, int offset, int limit, int offset2, int partlen)
           
private static boolean regionMatches(java.text.CharacterIterator target, int offset, int limit, java.lang.String part, int partlen)
           
private static boolean regionMatches(java.lang.String text, int offset, int limit, int offset2, int partlen)
           
private static boolean regionMatches(java.lang.String text, int offset, int limit, java.lang.String part, int partlen)
           
private static boolean regionMatchesIgnoreCase(char[] target, int offset, int limit, int offset2, int partlen)
           
private static boolean regionMatchesIgnoreCase(char[] target, int offset, int limit, java.lang.String part, int partlen)
           
private static boolean regionMatchesIgnoreCase(java.text.CharacterIterator target, int offset, int limit, int offset2, int partlen)
           
private static boolean regionMatchesIgnoreCase(java.text.CharacterIterator target, int offset, int limit, java.lang.String part, int partlen)
           
private static boolean regionMatchesIgnoreCase(java.lang.String text, int offset, int limit, int offset2, int partlen)
           
private static boolean regionMatchesIgnoreCase(java.lang.String text, int offset, int limit, java.lang.String part, int partlen)
           
 void setPattern(java.lang.String newPattern)
           
private  void setPattern(java.lang.String newPattern, int options)
           
 void setPattern(java.lang.String newPattern, java.lang.String options)
           
 java.lang.String toString()
          Represents this instence in String.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

DEBUG

static final boolean DEBUG
See Also:
Constant Field Values

regex

java.lang.String regex
A regular expression.


options

int options

nofparen

int nofparen
The number of parenthesis in the regular expression.


tokentree

Token tokentree
Internal representation of the regular expression.


hasBackReferences

boolean hasBackReferences

minlength

transient int minlength

operations

transient Op operations

numberOfClosures

transient int numberOfClosures

context

transient RegularExpression.Context context

firstChar

transient RangeToken firstChar

fixedString

transient java.lang.String fixedString

fixedStringOptions

transient int fixedStringOptions

fixedStringTable

transient BMPattern fixedStringTable

fixedStringOnly

transient boolean fixedStringOnly

IGNORE_CASE

static final int IGNORE_CASE
"i"

See Also:
Constant Field Values

SINGLE_LINE

static final int SINGLE_LINE
"s"

See Also:
Constant Field Values

MULTIPLE_LINES

static final int MULTIPLE_LINES
"m"

See Also:
Constant Field Values

EXTENDED_COMMENT

static final int EXTENDED_COMMENT
"x"

See Also:
Constant Field Values

USE_UNICODE_CATEGORY

static final int USE_UNICODE_CATEGORY
This option redefines \d \D \w \W \s \S.

See Also:
#RegularExpression(java.lang.String,int), setPattern(java.lang.String,int) 55 , UNICODE_WORD_BOUNDARY 55 , Constant Field Values

UNICODE_WORD_BOUNDARY

static final int UNICODE_WORD_BOUNDARY
An option. This enables to process locale-independent word boundary for \b \B \< \>.

By default, the engine considers a position between a word character (\w) and a non word character is a word boundary.

By this option, the engine checks word boundaries with the method of 'Unicode Regular Expression Guidelines' Revision 4.

See Also:
#RegularExpression(java.lang.String,int), setPattern(java.lang.String,int) 55 , Constant Field Values

PROHIBIT_HEAD_CHARACTER_OPTIMIZATION

static final int PROHIBIT_HEAD_CHARACTER_OPTIMIZATION
"H"

See Also:
Constant Field Values

PROHIBIT_FIXED_STRING_OPTIMIZATION

static final int PROHIBIT_FIXED_STRING_OPTIMIZATION
"F"

See Also:
Constant Field Values

XMLSCHEMA_MODE

static final int XMLSCHEMA_MODE
"X". XML Schema mode.

See Also:
Constant Field Values

SPECIAL_COMMA

static final int SPECIAL_COMMA
",".

See Also:
Constant Field Values

WT_IGNORE

private static final int WT_IGNORE
See Also:
Constant Field Values

WT_LETTER

private static final int WT_LETTER
See Also:
Constant Field Values

WT_OTHER

private static final int WT_OTHER
See Also:
Constant Field Values

wordchar

static transient Token wordchar

LINE_FEED

static final int LINE_FEED
See Also:
Constant Field Values

CARRIAGE_RETURN

static final int CARRIAGE_RETURN
See Also:
Constant Field Values

LINE_SEPARATOR

static final int LINE_SEPARATOR
See Also:
Constant Field Values

PARAGRAPH_SEPARATOR

static final int PARAGRAPH_SEPARATOR
See Also:
Constant Field Values
Constructor Detail

RegularExpression

public RegularExpression(java.lang.String regex)
                  throws ParseException
Creates a new RegularExpression instance.


RegularExpression

public RegularExpression(java.lang.String regex,
                         java.lang.String options)
                  throws ParseException
Creates a new RegularExpression instance with options.


RegularExpression

RegularExpression(java.lang.String regex,
                  Token tok,
                  int parens,
                  boolean hasBackReferences,
                  int options)
Method Detail

compile

private void compile(Token tok)
Compiles a token tree into an operation flow.


compile

private Op compile(Token tok,
                   Op next,
                   boolean reverse)
Converts a token to an operation.


matches

public boolean matches(char[] target)
Checks whether the target text contains this pattern or not.


matches

public boolean matches(char[] target,
                       int start,
                       int end)
Checks whether the target text contains this pattern in specified range or not.


matches

public boolean matches(char[] target,
                       Match match)
Checks whether the target text contains this pattern or not.


matches

public boolean matches(char[] target,
                       int start,
                       int end,
                       Match match)
Checks whether the target text contains this pattern in specified range or not.


matchCharArray

private int matchCharArray(RegularExpression.Context con,
                           Op op,
                           int offset,
                           int dx,
                           int opts)

getPreviousWordType

private static final int getPreviousWordType(char[] target,
                                             int begin,
                                             int end,
                                             int offset,
                                             int opts)

getWordType

private static final int getWordType(char[] target,
                                     int begin,
                                     int end,
                                     int offset,
                                     int opts)

regionMatches

private static final boolean regionMatches(char[] target,
                                           int offset,
                                           int limit,
                                           java.lang.String part,
                                           int partlen)

regionMatches

private static final boolean regionMatches(char[] target,
                                           int offset,
                                           int limit,
                                           int offset2,
                                           int partlen)

regionMatchesIgnoreCase

private static final boolean regionMatchesIgnoreCase(char[] target,
                                                     int offset,
                                                     int limit,
                                                     java.lang.String part,
                                                     int partlen)

regionMatchesIgnoreCase

private static final boolean regionMatchesIgnoreCase(char[] target,
                                                     int offset,
                                                     int limit,
                                                     int offset2,
                                                     int partlen)

matches

public boolean matches(java.lang.String target)
Checks whether the target text contains this pattern or not.


matches

public boolean matches(java.lang.String target,
                       int start,
                       int end)
Checks whether the target text contains this pattern in specified range or not.


matches

public boolean matches(java.lang.String target,
                       Match match)
Checks whether the target text contains this pattern or not.


matches

public boolean matches(java.lang.String target,
                       int start,
                       int end,
                       Match match)
Checks whether the target text contains this pattern in specified range or not.


matchString

private int matchString(RegularExpression.Context con,
                        Op op,
                        int offset,
                        int dx,
                        int opts)

getPreviousWordType

private static final int getPreviousWordType(java.lang.String target,
                                             int begin,
                                             int end,
                                             int offset,
                                             int opts)

getWordType

private static final int getWordType(java.lang.String target,
                                     int begin,
                                     int end,
                                     int offset,
                                     int opts)

regionMatches

private static final boolean regionMatches(java.lang.String text,
                                           int offset,
                                           int limit,
                                           java.lang.String part,
                                           int partlen)

regionMatches

private static final boolean regionMatches(java.lang.String text,
                                           int offset,
                                           int limit,
                                           int offset2,
                                           int partlen)

regionMatchesIgnoreCase

private static final boolean regionMatchesIgnoreCase(java.lang.String text,
                                                     int offset,
                                                     int limit,
                                                     java.lang.String part,
                                                     int partlen)

regionMatchesIgnoreCase

private static final boolean regionMatchesIgnoreCase(java.lang.String text,
                                                     int offset,
                                                     int limit,
                                                     int offset2,
                                                     int partlen)

matches

public boolean matches(java.text.CharacterIterator target)
Checks whether the target text contains this pattern or not.


matches

public boolean matches(java.text.CharacterIterator target,
                       Match match)
Checks whether the target text contains this pattern or not.


matchCharacterIterator

private int matchCharacterIterator(RegularExpression.Context con,
                                   Op op,
                                   int offset,
                                   int dx,
                                   int opts)

getPreviousWordType

private static final int getPreviousWordType(java.text.CharacterIterator target,
                                             int begin,
                                             int end,
                                             int offset,
                                             int opts)

getWordType

private static final int getWordType(java.text.CharacterIterator target,
                                     int begin,
                                     int end,
                                     int offset,
                                     int opts)

regionMatches

private static final boolean regionMatches(java.text.CharacterIterator target,
                                           int offset,
                                           int limit,
                                           java.lang.String part,
                                           int partlen)

regionMatches

private static final boolean regionMatches(java.text.CharacterIterator target,
                                           int offset,
                                           int limit,
                                           int offset2,
                                           int partlen)

regionMatchesIgnoreCase

private static final boolean regionMatchesIgnoreCase(java.text.CharacterIterator target,
                                                     int offset,
                                                     int limit,
                                                     java.lang.String part,
                                                     int partlen)

regionMatchesIgnoreCase

private static final boolean regionMatchesIgnoreCase(java.text.CharacterIterator target,
                                                     int offset,
                                                     int limit,
                                                     int offset2,
                                                     int partlen)

prepare

void prepare()
Prepares for matching. This method is called just before starting matching.


isSet

private static final boolean isSet(int options,
                                   int flag)

setPattern

public void setPattern(java.lang.String newPattern)
                throws ParseException

setPattern

private void setPattern(java.lang.String newPattern,
                        int options)
                 throws ParseException

setPattern

public void setPattern(java.lang.String newPattern,
                       java.lang.String options)
                throws ParseException

getPattern

public java.lang.String getPattern()

toString

public java.lang.String toString()
Represents this instence in String.


getOptions

public java.lang.String getOptions()
Returns a option string. The order of letters in it may be different from a string specified in a constructor or setPattern().


equals

public boolean equals(java.lang.Object obj)
Return true if patterns are the same and the options are equivalent.


equals

boolean equals(java.lang.String pattern,
               int options)

hashCode

public int hashCode()
Description copied from class: java.lang.Object
Get a value that represents this Object, as uniquely as possible within the confines of an int.

There are some requirements on this method which subclasses must follow:

  • Semantic equality implies identical hashcodes. In other words, if a.equals(b) is true, then a.hashCode() == b.hashCode() must be as well. However, the reverse is not necessarily true, and two objects may have the same hashcode without being equal.
  • It must be consistent. Whichever value o.hashCode() returns on the first invocation must be the value returned on all later invocations as long as the object exists. Notice, however, that the result of hashCode may change between separate executions of a Virtual Machine, because it is not invoked on the same object.

Notice that since hashCode is used in java.util.Hashtable and other hashing classes, a poor implementation will degrade the performance of hashing (so don't blindly implement it as returning a constant!). Also, if calculating the hash is time-consuming, a class may consider caching the results.

The default implementation returns System.identityHashCode(this)


getNumberOfGroups

public int getNumberOfGroups()
Return the number of regular expression groups. This method returns 1 when the regular expression has no capturing-parenthesis.


getWordType0

private static final int getWordType0(char ch,
                                      int opts)

isEOLChar

private static final boolean isEOLChar(int ch)

isWordChar

private static final boolean isWordChar(int ch)

matchIgnoreCase

private static final boolean matchIgnoreCase(int chardata,
                                             int ch)