Home » Xerces-J-src.2.9.1 » org.apache.xerces » impl » xpath » regex » [javadoc | source]
org.apache.xerces.impl.xpath.regex
public class: RegularExpression [javadoc | source]
java.lang.Object
   org.apache.xerces.impl.xpath.regex.RegularExpression

All Implemented Interfaces:
    Serializable

A regular expression matching engine using Non-deterministic Finite Automaton (NFA). This engine does not conform to the POSIX regular expression.

How to use

A. Standard way
RegularExpression re = new RegularExpression(regex);
if (re.matches(text)) { ... }
B. Capturing groups
RegularExpression re = new RegularExpression(regex);
Match match = new Match();
if (re.matches(text, match)) {
... // You can refer captured texts with methods of the Match class.
}

Case-insensitive matching

RegularExpression re = new RegularExpression(regex, "i");
if (re.matches(text) >= 0) { ...}

Options

You can specify options to RegularExpression(regex, options) or setPattern(regex, options). This options parameter consists of the following characters.

"i"
This option indicates case-insensitive matching.
"m"
^ and $ consider the EOL characters within the text.
"s"
. matches any one character.
"u"
Redefines \d \D \w \W \s \S \b \B \< \> as becoming to Unicode.
"w"
By this option, \b \B \< \> are processed with the method of 'Unicode Regular Expression Guidelines' Revision 4. When "w" and "u" are specified at the same time, \b \B \< \> are processed for the "w" option.
","
The parser treats a comma in a character class as a range separator. [a,b] matches a or , or b without this option. [a,b] matches a or b with this option.
"X"
By this option, the engine confoms to XML Schema: Regular Expression. The match() method does not do subsring matching but entire string matching.

Syntax

Differences from the Perl 5 regular expression

  • There is 6-digit hexadecimal character representation (\u005cvHHHHHH.)
  • Supports subtraction, union, and intersection operations for character classes.
  • Not supported: \ooo (Octal character representations), \G, \C, \lc, \u005c uc, \L, \U, \E, \Q, \N{name}, (?{code}), (??{code})

Meta characters are `. * + ? { [ ( ) | \ ^ $'.


BNF for the regular expression

regex ::= ('(?' options ')')? term ('|' term)*
term ::= factor+
factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )?
| '(?#' [^)]* ')'
minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}'
atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9]
| '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X'
| '(?>' regex ')' | '(?' options ':' regex ')'
| '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')'
options ::= [imsw]* ('-' [imsw]+)?
anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>'
looks ::= '(?=' regex ')' | '(?!' regex ')'
| '(?<=' regex ')' | '(?<!' regex ')'
char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1
category-block ::= '\' [pP] category-symbol-1
| ('\p{' | '\P{') (category-symbol | block-name
| other-properties) '}'
category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S'
category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo'
| 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No'
| 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs'
| 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po'
| 'Sm' | 'Sc' | 'Sk' | 'So'
block-name ::= (See above)
other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED'
character-1 ::= (any character except meta-characters)

char-class ::= '[' ranges ']'
| '(?[' ranges ']' ([-+&] '[' ranges ']')? ')'
ranges ::= '^'? (range ','?)+
range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block
| range-char | range-char '-' range-char
range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2
code-point ::= '\x' hex-char hex-char
| '\x{' hex-char+ '}'
 | '\v' hex-char hex-char hex-char hex-char hex-char hex-char
hex-char ::= [0-9a-fA-F]
character-2 ::= (any character except \[]-,)

TODO


Nested Class Summary:
static final class  RegularExpression.Context   
Field Summary
static final  boolean DEBUG     
 String regex    A regular expression.
    serial:
 
 int options   
    serial:
 
 int nofparen    The number of parenthesis in the regular expression.
    serial:
 
 Token tokentree    Internal representation of the regular expression.
    serial:
 
 boolean hasBackReferences     
transient  int minlength     
transient  Op operations     
transient  int numberOfClosures     
transient  RegularExpression.Context context     
transient  RangeToken firstChar     
transient  String fixedString     
transient  int fixedStringOptions     
transient  BMPattern fixedStringTable     
transient  boolean fixedStringOnly     
static final  int IGNORE_CASE    "i" 
static final  int SINGLE_LINE    "s" 
static final  int MULTIPLE_LINES    "m" 
static final  int EXTENDED_COMMENT    "x" 
static final  int USE_UNICODE_CATEGORY    This option redefines \d \D \w \W \s \S. 
static final  int UNICODE_WORD_BOUNDARY    An option. This enables to process locale-independent word boundary for \b \B \< \>.

By default, the engine considers a position between a word character (\w) and a non word character is a word boundary.

By this option, the engine checks word boundaries with the method of 'Unicode Regular Expression Guidelines' Revision 4.

 
static final  int PROHIBIT_HEAD_CHARACTER_OPTIMIZATION    "H" 
static final  int PROHIBIT_FIXED_STRING_OPTIMIZATION    "F" 
static final  int XMLSCHEMA_MODE    "X". XML Schema mode. 
static final  int SPECIAL_COMMA    ",". 
static final  int LINE_FEED     
static final  int CARRIAGE_RETURN     
static final  int LINE_SEPARATOR     
static final  int PARAGRAPH_SEPARATOR     
Constructor:
 public RegularExpression(String regex) throws ParseException 
    Creates a new RegularExpression instance.
    Parameters:
    regex - A regular expression
    Throws:
    org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
    exception: org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
 public RegularExpression(String regex,
    String options) throws ParseException 
    Creates a new RegularExpression instance with options.
    Parameters:
    regex - A regular expression
    options - A String consisted of "i" "m" "s" "u" "w" "," "X"
    Throws:
    org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
    exception: org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
 RegularExpression(String regex,
    Token tok,
    int parens,
    boolean hasBackReferences,
    int options) 
Method from org.apache.xerces.impl.xpath.regex.RegularExpression Summary:
equals,   equals,   getNumberOfGroups,   getOptions,   getPattern,   hashCode,   matches,   matches,   matches,   matches,   matches,   matches,   matches,   matches,   matches,   matches,   prepare,   setPattern,   setPattern,   toString
Methods from java.lang.Object:
equals,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.xerces.impl.xpath.regex.RegularExpression Detail:
 public boolean equals(Object obj) 
    Return true if patterns are the same and the options are equivalent.
 boolean equals(String pattern,
    int options) 
 public int getNumberOfGroups() 
    Return the number of regular expression groups. This method returns 1 when the regular expression has no capturing-parenthesis.
 public String getOptions() 
    Returns a option string. The order of letters in it may be different from a string specified in a constructor or setPattern().
 public String getPattern() 
 public int hashCode() 
 public boolean matches(char[] target) 
    Checks whether the target text contains this pattern or not.
 public boolean matches(String target) 
    Checks whether the target text contains this pattern or not.
 public boolean matches(CharacterIterator target) 
    Checks whether the target text contains this pattern or not.
 public boolean matches(char[] target,
    Match match) 
    Checks whether the target text contains this pattern or not.
 public boolean matches(String target,
    Match match) 
    Checks whether the target text contains this pattern or not.
 public boolean matches(CharacterIterator target,
    Match match) 
    Checks whether the target text contains this pattern or not.
 public boolean matches(char[] target,
    int start,
    int end) 
    Checks whether the target text contains this pattern in specified range or not.
 public boolean matches(String target,
    int start,
    int end) 
    Checks whether the target text contains this pattern in specified range or not.
 public boolean matches(char[] target,
    int start,
    int end,
    Match match) 
    Checks whether the target text contains this pattern in specified range or not.
 public boolean matches(String target,
    int start,
    int end,
    Match match) 
    Checks whether the target text contains this pattern in specified range or not.
  void prepare() 
    Prepares for matching. This method is called just before starting matching.
 public  void setPattern(String newPattern) throws ParseException 
 public  void setPattern(String newPattern,
    String options) throws ParseException 
 public String toString() 
    Represents this instence in String.