Regular Expression in .Net

Namespace: System.Text.RegularExpressions

A regular expression is a set of characters that can be compared to a string to determine whether the string meets the specified format requirement.

The static method System.Text.RegularExpressions.Regex.IsMatch is used to test a string against a specified regular expression.

It’s a bit difficult to create regular expression

It’s confusing to understand and modify a regular expression

Creating Regular expressions:

A simple regular expression can be “abc” that matches abc, abcdef, 123abcd, and all string that contain abc.

To Specify that string must start with the expression use ‘^’ as “^abc”

To Specify that the string must end with the specified expression use ‘$’ as “^abc$”

Other characters that can be used in regular expressions are:

^ will match beginning of an input. In multi-line matches the beginning of each line.

$ will match end of an input. In multi-line it matches the end of each line.

\A specifies that match being from the first character and ignores multiple lines.

\Z specifies that match ends at the last character or character before \n by ignoring multiple lines.

\z specifies that match ends at the last character of the input.

\G specifies that match must occur where the last match ended. It’s used with Match.NextMatch.

\b specifies that a match must occur between \w (alphanumeric) \W non-alphanumeric. Like “cat\b” will match “cat”,

“cat ”, “Cat.” But will not match “cats”

\B specifies that \b match must not occur. Like “cat\B” will match “cats” but will not match “cat ” and “cat.” Etc

Important Wildcards:

* Match zero or more proceedings

+ Match one or more proceedings

? Match zero or one time (optional scenario)

{n} Match the pattern n times. Like “fo{2}d” will match “food”

{n,} Match the pattern n or greater than n times. Like “fo{2,}d” will match “food” and “foood”. O* equal to O{0,} and O+

equal to O{1,}

{n,m} n must be >= to m. match at least n times and at most m times. O? equals to O{0,1}

<wildcard>? Will match as little as possible like O+? will match only one O for “OOO”.

X|Y matches either X or Y

[ABC] will try to find any character for match like [abc] will match “a” in plain.

[a-c] will try to find any character in the range from a to b.

Other digits wildcards:

\d for a digit

\D for a not-digit character

\s for any white-space character including: [^ \f\n\r\t\v]

\S for any non-white-space character

\w any character from range [A-Za-z0-9_]

\W any character other than \w

One can also name groups. Like (?<PatternName>Pattern). Later one can retrieve the Pattern by referencing the PatternName.

Matching data using Back references:

By using back-referencing one can force the engine to remember a pattern for later use. Like (?<char>\w)\k<char>. \k<char> forces the engine to compare the current character with the last remembered character in <char>. So, the expression will look for two same adjacent characters, and will return matches for I’ll, small, and coffee in the sentence “I’ll have a small coffee”.

Back referencing Constructs:

\number Numbered back reference. For example (\w)\1 will find double word characters

\k<name> Named back reference. For example (?<char>\w)\k<char> will find double word characters

How to specify regular expression Options:

Using regular expression options one can modify the matching behavior. This can be accomplished by specifying option in Regex(pattern, options) or as an inline option like (?imnsx-imnsx:) for grouping constructs and (?imnsx-imnsx) for other constructs. (-) sign is used to remove an option.

Regular Expression Options:

Class	Inline char	Description
Compiled		Specifies that the regular expression is compiled to an assembly. This yields faster execution but increases startup time.
CultureInvariant		Specifies that cultural differences in language is ignored. See Performing Culture-Insensitive Operations in the RegularExpressions Namespace for more information.
ECMAScript		Enables ECMAScript-compliant behavior for the expression. This value can be used only in conjunction with the IgnoreCase, Multiline, and Compiled values. The use of this value with any other values results in an exception.
ExplicitCapture	n	Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>…). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…).
IgnoreCase	I	Specifies case-insensitive matching.
IgnorePatternWhitespace	X	Eliminates unescaped white space from the pattern and enables comments marked with #. However, the IgnorePatternWhitespace value does not affect or eliminate white space in character classes.
Multiline	M	Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
None		Specifies that no options are set.
RightToLeft		Specifies that the search will be from right to left instead of from left to right.
Singleline	S	Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

How to extract matched data:

One can extract information from a flat document as well.

1. Create a regular expression.

2. Create an instance of the Match class.

3. Retrieve the match data by accessing the elements of the Group class.

Simple way to retrieve the matched data:

string text = "One car red car blue car";

string pat = @"(\w+)\s+(car)";

// Compile the regular expression.

Regex r = new Regex(pat, RegexOptions.IgnoreCase);

// Match the regular expression pattern against a text string.

Match m = r.Match(text);

int matchCount = 0;

while (m.Success)

{

Console.WriteLine("Match"+ (++matchCount));

for (int i = 1; i <= 2; i++)

{

Group g = m.Groups[i];

Console.WriteLine("Group"+i+"='" + g + "'");

CaptureCollection cc = g.Captures;

for (int j = 0; j < cc.Count; j++)

{

Capture c = cc[j];

System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);

}

m = m.NextMatch();

}

//This code produces output similar to the following;

//results may vary based on the computer/file structure/etc.:

//Match1

//Group1='One'

//Capture0='One', Position=0

//Group2='car'

//Capture0='car', Position=4

//Match2

//Group1='red'

//Capture0='red', Position=8

//Group2='car'

//Capture0='car', Position=12

//Match3

//Group1='blue'

//Capture0='blue', Position=16

//Group2='car'

//Capture0='car', Position=21

Extracting information from flat documents:

String input = “Company name: abc def.”;

Match m = Regex.Match(input, @“Company Name: (.*$)”);

Console.WriteLine(m.Groups[1]);

How to replace substring using Regular Expressions:

One can do complex replacements by using named or number back references like:

String MDYtoDMY(String input)

{

return Regex.Replace(Input,

@“\b(?<month>\d{1,2})/(?<day>\d{1,2})/(?<year>\d{2,4})\b”,

@“${day}-${month}-${year}”);

}

Character Escapes used in substitutions:

Character	Description
$number	Substitutes the last substring matched by group number number (decimal).
*${name}*	Substitutes the last substring matched by a (?<name> ) group.
$$	Substitutes a single "$" literal.
$&	Substitutes a copy of the entire match itself.
$`	Substitutes all the text of the input string before the match.
$'	Substitutes all the text of the input string after the match.
$+	Substitutes the last group captured.
$_	Substitutes the entire input string.

NOTE: !, @, #, $, %, ^, *, (, ) , <, > These characters can be used to make an attack, so an input must be cleaned from these characters.

Culture Information and Localization in .NET

Namespace: System.Globalization CultureInfo Class: It provides information like the Format of numbers and dates, Culture’s Calendar, Culture’s language and sublanguage (if applicable), Country and region of the culture. The Basic use of CultureInfo class is shown here: • How string Comparisons are performed • How Number Comparison & Formats are performed • Date Comparison and Formats. • How resources are retrieved and used. Cultures are grouped into three categories: Invariant Culture : It’s Culture Insensitive. It can be used to build some trial application. It can be also used to build an application with hard-coded expiry date that ignores cultures. But using it for every comparison will be incorrect and inappropriate. Neutral Culture : English(en), Frensh(fr), and Spanish(sp). A neutral culture is related to language but it’s not related to specific regi...

Mubbasher Mukhtar | Sharing Development Experience

Search This Blog

Regular Expression in .Net

Labels

Comments

Popular posts from this blog

Culture Information and Localization in .NET

Concept of App Domain in .Net

ASP.NET Working With Data-Bound Web Server Controls