Namespace: System.Text.RegularExpressions
A regular expression is a set of characters that can be
compared to a string to determine whether the string meets the specified format
requirement.
The static method System.Text.RegularExpressions.Regex.IsMatch is used to test a string against a
specified regular expression.
It’s a bit difficult to create regular expression
It’s confusing to understand and modify a regular expression
Creating Regular expressions:
A simple regular expression can be “abc” that matches abc,
abcdef, 123abcd, and all string that contain abc.
To Specify that string must start
with the expression use ‘^’ as “^abc”
To Specify that the string must end
with the specified expression use ‘$’ as “^abc$”
Other characters that can be used in regular expressions are:
^ will match beginning of an input.
In multi-line matches the beginning of each line.
$ will match end of an input. In multi-line
it matches the end of each line.
\A specifies that match being from the
first character and ignores multiple lines.
\Z specifies that match ends at the
last character or character before \n by ignoring multiple lines.
\z specifies that match ends at the
last character of the input.
\G specifies that match must occur
where the last match ended. It’s used with Match.NextMatch.
\b specifies that a match must occur
between \w (alphanumeric) \W non-alphanumeric. Like “cat\b” will match “cat”,
“cat ”, “Cat.” But will not
match “cats”
\B specifies that \b match must not
occur. Like “cat\B” will match “cats” but will not match “cat ” and “cat.” Etc
Important Wildcards:
* Match zero or more proceedings
+ Match one or more proceedings
? Match zero or one time (optional
scenario)
{n} Match the pattern n times. Like
“fo{2}d” will match “food”
{n,} Match the pattern n or greater than n
times. Like “fo{2,}d” will match “food” and “foood”. O* equal to O{0,} and O+
equal to O{1,}
{n,m} n must be >= to m. match at least n
times and at most m times. O? equals to O{0,1}
<wildcard>? Will match as little as
possible like O+? will match only one O for “OOO”.
X|Y matches either X or Y
[ABC] will try to find any character for match
like [abc] will match “a” in plain.
[a-c] will try to find any character in the
range from a to b.
Other digits wildcards:
\d for a digit
\D for a not-digit character
\s for any white-space character
including: [^ \f\n\r\t\v]
\S for any non-white-space character
\w any character from range [A-Za-z0-9_]
\W any character other than \w
One can also name groups. Like (?<PatternName>Pattern).
Later one can retrieve the Pattern by referencing the PatternName.
Matching data using Back references:
By using back-referencing one can force the engine to
remember a pattern for later use. Like (?<char>\w)\k<char>.
\k<char> forces the engine to compare the current character with the last
remembered character in <char>. So, the expression will look for two same
adjacent characters, and will return matches for I’ll, small, and coffee in the
sentence “I’ll have a small coffee”.
Back referencing Constructs:
\number Numbered back reference. For example (\w)\1
will find double word characters
\k<name> Named back reference. For example
(?<char>\w)\k<char> will find double word characters
How to specify regular expression Options:
Using regular expression options one can modify the matching
behavior. This can be accomplished by specifying option in Regex(pattern,
options) or as an inline option like (?imnsx-imnsx:) for grouping constructs
and (?imnsx-imnsx) for other constructs. (-) sign is used to remove an option.
Regular Expression Options:
Class
|
Inline
char
|
Description
|
Compiled
|
Specifies that the regular expression is
compiled to an assembly. This yields faster execution but increases startup
time.
|
|
CultureInvariant
|
Specifies that cultural differences in
language is ignored. See Performing Culture-Insensitive Operations in the
RegularExpressions Namespace for more information.
|
|
ECMAScript
|
Enables ECMAScript-compliant behavior for
the expression. This value can be used only in conjunction with the IgnoreCase, Multiline, and Compiled values. The use of this value with any other
values results in an exception.
|
|
ExplicitCapture
|
n
|
Specifies that the only valid captures are
explicitly named or numbered groups of the form (?<name>…). This allows
unnamed parentheses to act as noncapturing groups without the syntactic
clumsiness of the expression (?:…).
|
IgnoreCase
|
I
|
Specifies case-insensitive matching.
|
IgnorePatternWhitespace
|
X
|
Eliminates unescaped white space from the
pattern and enables comments marked with #. However, the IgnorePatternWhitespace
value does not affect or eliminate white space in character classes.
|
Multiline
|
M
|
Multiline mode. Changes the meaning of ^ and
$ so they match at the beginning and end, respectively, of any line, and not
just the beginning and end of the entire string.
|
None
|
Specifies that no options are set.
|
|
RightToLeft
|
Specifies that the search will be from right
to left instead of from left to right.
|
|
Singleline
|
S
|
Specifies single-line mode. Changes the
meaning of the dot (.) so it matches every character (instead of every
character except \n).
|
How to extract matched data:
One can extract information from a
flat document as well.
1. Create a regular expression.
2. Create an instance of the Match class.
3. Retrieve the match data by accessing the elements of the
Group class.
Simple way to retrieve the matched data:
string text = "One car red car
blue car";
string pat =
@"(\w+)\s+(car)";
// Compile the regular expression.
Regex r = new Regex(pat,
RegexOptions.IgnoreCase);
// Match the regular expression
pattern against a text string.
Match m = r.Match(text);
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match"+ (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group"+i+"='" + g +
"'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine("Capture"+j+"='" + c +
"', Position="+c.Index);
}
}
m = m.NextMatch();
}
//This
code produces output similar to the following;
//results may vary based on the computer/file structure/etc.:
//
//Match1
//Group1='One'
//Capture0='One', Position=0
//Group2='car'
//Capture0='car', Position=4
//Match2
//Group1='red'
//Capture0='red', Position=8
//Group2='car'
//Capture0='car', Position=12
//Match3
//Group1='blue'
//Capture0='blue', Position=16
//Group2='car'
//Capture0='car', Position=21
Extracting information from flat documents:
String input = “Company name: abc
def.”;
Match m = Regex.Match(input,
@“Company Name: (.*$)”);
Console.WriteLine(m.Groups[1]);
How to replace substring using Regular Expressions:
One can do complex replacements by using named or number back
references like:
String MDYtoDMY(String input)
{
return
Regex.Replace(Input,
@“\b(?<month>\d{1,2})/(?<day>\d{1,2})/(?<year>\d{2,4})\b”,
@“${day}-${month}-${year}”);
}
Character Escapes used in substitutions:
Character
|
Description
|
$number
|
Substitutes the last substring matched by
group number number (decimal).
|
${name}
|
Substitutes the last substring matched by a
(?<name> ) group.
|
$$
|
Substitutes a single "$" literal.
|
$&
|
Substitutes a copy of the entire match
itself.
|
$`
|
Substitutes all the text of the input string
before the match.
|
$'
|
Substitutes all the text of the input string
after the match.
|
$+
|
Substitutes the last group captured.
|
$_
|
Substitutes the entire input string.
|
NOTE: !, @, #, $, %, ^, *, (, ) , <, > These characters
can be used to make an attack, so an input must be cleaned from these
characters.
Comments