Tuesday, October 14, 2008

REGEX

Regex stands for Regular Expression, it also called re, I remember got it in automata or discrete math when bachelor. Basically, a regular expression is a pattern describing a certain amount of text. Their name comes from the mathematical theory on which they are based. Most of this post content is taken from http://www.regular-expressions.info, and I just copy and paste it.

Regex is a powerful tool when we already learn it, in my project, I used regex as field validator of my pojos and as validator of the uploaded file name, fyi, in my application there is a part that user upload files, and that file is filtered first by application, because my appliaction is monthly bank reporting, so it needs to know that the file is report in what month, I mean the file must named year and month in the last of its name.

11 Special Charactes
(metacharacters)
- the opening square bracket [
- the backslash \
- the caret ^
- the dollar sign $
- the period or dot .
- the vertical bar or pipe symbol |
- the question mark ?
- the asterisk or star *
- the plus sign +
- the opening round bracket (
- the closing round bracket )

If we want to use one of these metacharacters, we need to escape them with a backslash, or double backslash in java.

Non-Printable Characters
- \t to match a tab character (ASCII 0x09),
- \r
for carriage return (0x0D) , and
- \n for line feed (0x0A).
- More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B).

And once again, we need double backslash in java, one backslash is for the regex commonly.

Regex Engine
A regular expression "engine" is a piece of software that can process regular expressions, trying to match the pattern to the given string.

There are two kinds of regex engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively.
http://www.regular-expressions.info is based on regex-directed engines, and that sites is based on pearl language. First, I thought java is also regex-directed, but surprisingly is not, I create mini program as told by the site to find out is java text-directed engines or regex-directed, and surprisingly it's text-directed. The regex is: regex|regex not and the string is regex not, Below is the code, If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. Below is the code, and the output is "Found 'regex not' at position 0":

public static void main(String[] args) {
Pattern p = Pattern.compile ( "regex|regex not", Pattern.CASE_INSENSITIVE ) ;
String text = "regex not";
Matcher m = p.matcher ( text ) ;
if ( m.matches() ) {
System.out.println ( "Found '" + m.group ( 0 ) + "' at position " + m.start ( 0 ) ) ;
}
else {
System.err.println("No found match");
}
}


Regex-directed always returns the leftmost match, if we applying regex cat to He captured a catfish for his cat, the engine will try to match the first token in the regex c to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the c with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the match, c matches c. The engine will then try to match the second token a to the 5th character, a. This succeeds too. But then, t fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, c fails to match here and the engine carries on. At the 15th character in the match, c again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that a matches a and t matches t.

The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any "better" matches.


But in java which is text
-directed, he report no match for that situation, I already try regex cat to He captured a catfish for his cat, and it report no match found, it only receive String cat.

Character Set or Character Class "[ ]"
With character set we can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. We can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9.


- \d is short for [0-9]
- \w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
- \s stands for "whitespace character".
- The negated versions, \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s].

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter will match any character that is not a digit or whitespace. So it will match x, but not 8. The former, however, will match any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] will match any character, digit, whitespace or otherwise.

If we repeat a character class by using the ?, * or + operators, we will repeat the entire character class, and not just the character that it matched. The regex [0-9]+ can match 837 as well as 222. But If we want to repeat the matched character, rather than the class, you will need to use backreferences. ([0-9])\1+ will match 222 but not 837.


Negated Character Classes
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character that is not a u". It will not match the q in the string Iraq. It will match the q and the space after the q in Iraq is a country. Indeed: the space will be part of the overall match, because it is the "character that is not a u" that is matched by the negated character class in the above regexp.

Metacharacters Inside Character Classes
Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.

To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x. The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning. I recommend the latter method, since it improves readability. To include a caret, place it anywhere except right after the opening bracket. [x^] matches an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret. []x][^]x] matches any character that is not a closing bracket or an x. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen. matches a closing bracket or an x.

The Dot Matches (Almost) Any Character
The dot matches a single character, without caring what that character is. The only exception are newline characters. So by default, the dot is short for
the [^\n].

Anchors
Anchors are a different breed. They do not match any character at all. Instead, they match a position before, after or between characters. And they are Zero-Length Matches.
- The caret ^ matches the position before the first character in the string.
- $ matches right after the last character in the string.
- \b matches at a position that is called a "word boundary"


Applying ^a to abc matches a. ^b will not match abc at all, because the b cannot be matched right after the start of the string, matched by ^. While c$ matches c in abc, while a$ does not match at all.

There are four different positions that qualify as word boundaries:
* Before the first character in the string, if the first character is a \w.
* After the last character in the string, if the last character is a
\w.
* Between a
\w and a non-word character following right after \w.
* Between a \W and a
\w following right after \W character.

In the regex-engine,
\bis\b will match This island is beautiful. Space is non word character, and the others are. The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.

Alternation "|"
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If we want to limit the reach of the alternation, we will need to use round brackets for grouping.

Optional "?"
It is greedy. The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine will always try to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

The effect is that if you apply the regex Feb 23(rd)? to the string Feb 23rd, the match will always be Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first.

"?,*, +,{}"
- ? is optional, 1 or 0
- * is 0 or more repetition
- + is 1 or more repetition
- {} is limiting repetition, The syntax is {min,max}
, if we want exactly n reeptition so the syntax is {n}

Again, they are greedy, suppose we want to receive html tag with this regex:
<.+>. when we test it on a string like This is a first test. we might expect the regex to match and when continuing after that match, . But it does not. The regex will match first. Obviously not what we wanted. The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex.

To fix the problem:
- laziness, by adding "?", so the regex will becomes <.+?>.

- better solution without backtracking: We can use a greedy plus and a negated character class: <[^>]+>.

Grouping and Backreference "()"
Besides grouping part of a regular expression together, round brackets also create a "backreference". A backreference stores the part of the string matched by the part of the regular expression inside the parentheses.

([a-c])x\1x\1 will match axaxa, bxbxb and cxcxc. \1 is calling backreference ([a-c]). and in java we need double backslash rather than single backslash.

The other example is <([A-Z][A-Z0-9]*)\b[^>]*>.*? that want to match a pair of opening and closing HTML tags, and the text in between, such text bold italic.


Java also support forward reference, that is: we can use a backreference to a group that appears later in the regex. Obviously only useful if they're inside a repeated group. The example is (\2two|(one))+ will match oneonetwo.

No comments: