C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 1 / 12
Programme Search C#Today Living Book
i Index j k l m n
j Full Text k l m n
Advanced
CATEGORIES
HOME
SITE MAP
SEARCH
REFERENCE
FORUM
FEEDBACK
ADVERTISE
Previous article January 31, 2002
The C#Today Article February 1, 2002
SU
Next art February
Introducing .NET Regular Expressions with C# by Tony Loton CATEGORY: Other Technologies ARTICLE TYPE: Cutting Edge
Reader Comments
ABSTRACT If you’re building any kind of application that involves looking for patterns in text, picking out segments of text according to certain criteria, or transforming the text itself, you can save a lot of time and energy by becoming familiar with regular expressions. In this article, Tony Loton demonstrates the Regular Expression language itself, and also introduces the main classes of the .NET System.Text.RegularExpressions namespace that allow you to harness the power of regular expressions from within your C#, Visual Basic or C++ programs.
Article Usefu Innov Inform 10 resp
Article Discussion
Rate this article
Related Links
Index Entries
ARTICLE Editor's Note: This article's code has been updated to work with the final release of the .Net framework. If you're building any kind of application that involves looking for patterns in text, picking out segments of text according to certain criteria, or transforming the text itself, you can save a lot of time and energy by becoming familiar with regular expressions. In this article, I'll demonstrate the Regular Expression language itself, which is not unique to the .NET framework. That discussion may well be of interest if you're thinking of using the RE engine of the JDK 1.4, or if you're working with one of the older languages - like AWK or Perl - that support regular expressions. Then I'll introduce the main classes of the .NET System.Text.RegularExpressions namespace that allow you to harness the power of regular expressions from within your C#, Visual Basic or C++ programs.
Introduction Consider the following problem: how many lines of code do you think you would need in order to transform all the $ (dollar) amounts in the following sentence from the form "$10" to the form "10 US dollars"?
Each item is priced at $10 but you can purchase ten for only $80 which is much cheaper than 10 * $10 = $100. One solution would be to loop through the string using strtok (for C) or StringTokenizer (for Java), read ahead from each $ symbol that you found up to the next word boundary, and meanwhile build a second version of the string using concatenation. Or, you could use the .NET Regular Expression engine and this single line of code
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 2 / 12
String newText = Regex.Replace(sourceText, @"\$(?
\d*)" ,"${amount} US dollars"); Which would give:
Each item is priced at 10 US dollars but you can purchase ten for only 80 US dollars whi is much cheaper than 10 * 10 US dollars = 100 US dollars. Other practical uses include parsing (of HTML / XML content, or even natural language), feature extraction (to pick out names and addresses from documents), and even the humble search-and-replace operation of a text editor. On the subject of HTML and XML, in a previous article entitled "Working with Web Data in C#" (http://www.csharptoday.com/content/articles/20020128.asp) I hinted at a practical use for regular expressions to extract data from web pages. In the following screenshot, the SELECT and MATCHES clauses both contain regular expressions.
The text " .html#0.table#1.tr#0.td#1.table#1.tr#1.td#0.table#0.tr#\d+.td#0$ " is a regular expression that matches certain HTML elements - in this case book titles from the Wrox web site - according to their positions on the page. The text " \w*.NET\w* " is a regular expression that picks out only those titles that mention .NET. Later you'll see that the RE language elements used in those examples are: \d (as in .tr#\d+ above) - a character class that matches any decimal digit. + (as in .tr#\d+ above) - a quantifier that matches one or more of the preceding character. $ (at the end of the SELECT string) - an atomic zero-width assertion that ensures a match up to the end of the string. \w (as in \w*.NET\w*) - a character class that matches any word character.
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 3 / 12
* (as in \w*.NET\w*) - a quantifier that matches zero or more of the preceding character.
Regular Expressions Language Elements We'll now look a little deeper at the elements that comprise the regular expressions language, which the .NET documentation splits up into these categories: z Character Classes and Character Escapes z Substitutions z Atomic Zero Width Assertions z Quantifiers z Grouping Constructs, Backreference Constructs and Alternation Constructs
The aim is not to provide an exhaustive account of each language element, but to demonstrate the use of each kind of element via a simple concrete example. For consistency, I'll make use of a single test sentence throughout, which I introduced earlier as:
Each item is priced at $10 but you can purchase ten for only $80 which is much cheaper than 10 * $10 = $100. After explaining the regular expression syntax, I'll show you the C# code that I used to drive the examples. Character Escapes The characters . $ ^ { [ ( | ) * + ? \ have special meanings as operators in regular expressions. You will see later how $ is an Atomic Zero-Width Assertion and how * is a Quantifier. This leaves us with a problem if we want to use a regular expression to discover, for example, the occurrences of the literal $ character in a given text. We solve the problem by prepending a backslash to the front of any operator that we wish to match literally. Consider our test sentence:
Each item is priced at $10 but you can purchase ten for only $80 which is much cheaper than 10 * $10 = $100. A regular expression "\$" (ignore the quotes, and notice the backslash) will match four times, corresponding with the four occurrences of the $ sign, whereas a regular expression of "$" (no backslash) will match only once as the special end-of-string assertion. More about end-of-string later.
Similarly, the regular expression "\*" will match once, whereas "*" will trigger an exception in C# / .NET with the following description (because the "*" has been treated as a quantifier rather then a literal character).
An unhandled exception of type 'System.ArgumentException' occurred in system.dll. Additional information: Parsing "*" - Quantifier {x,y} following nothing. So, any operator character will be treated literally when escaped, that is preceded by a backslash. For a complete list of Character Escapes, look in the "Character Escapes" section of the ".NET Framework General Reference". Character Classes We may wish to treat certain groups of characters as being of the same type, or class, of characters, and thus form a regular expression that matches any character of a particular class. For example, we might wish to treat characters 0 to 9 as belonging to the "decimal digit" class.
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 4 / 12
The general syntax for character classes uses the square brackets [ and ] to show that certain characters belong to the same class. For decimal digits we could therefore use the regular expression "[0123456789]" which would match 11 times in our test sentence for each of the 11 decimal digits.
For convenience we could shorten this to the range "[0-9]" (or a more limited range of say "[0-1]" for binary numbers) and we could reverse the sense in each case - that is, find non-digits - by adding the "^" character to give [^0123456789] or [^0-9].
You might be interested to know that we could instead take advantage of the special escape sequences "\d" (for decimal digits) and "\D" (for non-digits).
For a complete list of Character Classes, look in the "Character Classes" section of the ".NET Framework General Reference". Atomic Zero-Width Assertions Atomic Zero-Width Assertions cause a match to succeed or fail depending on the current position in the string. We've already met one of these, the $ character that matches the end-of-line or end-of-string position providing it is not preceded by a backslash. Do you remember the following result?
Technically that result was true, because there is exactly one end-of-string position in our sample text. Now suppose we wanted to count up the number of words in the input text. We can use the assertion \b to ensure that a regular expression matches at a word boundary, like this:
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 5 / 12
We're saying that each collection of one or more characters A-z - i.e. [A-z]+ - that has a word boundary \b before, and after, counts as a word. Note that there are only 17 such words (comprising only letters) in the input text. Note also the use of the + symbol to specify "one or more", which will lead us neatly onto quantifiers in the next section. For a complete list of Atomic Zero-Width Assertions, look in the "Atomic Zero-Width Assertions" section of the ".NET Framework General Reference". Quantifiers Quantifiers act like multiplicities in database or object modeling, allowing you to specify "one or more", "zero or more", "exactly 2", and so on. To find any dollar amounts in our input text that are multiples of $100 dollars (including $200, and up to $900) we could use the regular expression \$\d0{2}\b that matches a literal $ sign followed by a single digit, followed by exactly 2 zeros and a word boundary.
So we've matched $100, but not £10 or $80, and the word boundary is a necessary inclusion to avoid matching values of the form £1001 For a complete list of Quantifiers, look in the "Quantifiers" section of the ".NET Framework General Reference". Grouping Constructs and Substitutions Suppose we wanted to reformat the dollar amounts in our test sentence to replace the prefixed symbol "$" with postfixed text "US dollars". The .NET API allows us to do that with a piece of code like this:
String newText = Regex.Replace(sourceText, @"\$(?\d*)" ,"${amount} US dollars"); That single line of code transforms our test sentence as follows.
Each item is priced at $10 but you can purchase ten for only $80 which is much cheaper than 10 * $10 = $100. The sentence now becomes:
Each item is priced at 10 US dollars but you can purchase ten for only 80 US dollars whi is much cheaper than 10 * 10 US dollars = 100 US dollars. The regular expression we're searching for is "\$(?\d*)". In words it means "match a literal dollar sign followed by a group of characters (named "amount") comprising zero or more decimal digits". Think of amount as a variable that will contain the decimal amount (without the $ sign) for each match. We could write some .NET code to step through the matches one-by-one, and for each one we could pick out the value of the amount grouping. Or, as we have done in the code above, we could simply specify a replacement string to substitute the matched text with our new format. That simple substitution should be quite selfexplanatory. As an aside, you might be wondering why the regular expression string was preceded by the @ symbol in our method call. That's how we tell the C# compiler not to interpret backslashes as its own escape characters, and to preserve them to be interpreted by the Regex engine. This is not necessary if you're using Visual Basic. For a complete list of Grouping Constructs and Substitutions, look in the "Grouping Constructs" and "Substitutions" sections of the ".NET Framework General Reference".
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 6 / 12
Backreference Constructs The grouping constructs mentioned above allow us to mark out sections of text with variable names, so that we can refer to them in our code or for use in substitutions as we've seen. We can also make use of groups by referring back to them within the same regular expression. A classic example is to find instances of double letters in words with a regular expression like (?[A-z]) \k. When run against our test sentence we get no matches because there are no words with double letters.
But, if we run it against the substituted version (having $ symbols replaced by "US dollars" text, see above) we get this result corresponding with the four occurrences of the word "dollars".
In our regular expression we defined a group named "letter" comprising any character A-z, and we referred back to it with the backreference \k as the next matching character. The \k backreference succeeds if the subsequent character is the same as that matched by the prior grouping construct of the same name, in this case "". For your information, we could have used a simpler version ([A-z])\1 instead, which backreferences the 1st unnamed group.
For a more practical purpose, backreferences could be used to pick out content from between matching tags in a HTML or XML document using a regular expression like this one.
"<(?\w+)>(?(.|\n)*?)\k>" For a complete list of Backreference Constructs, look in the "Backreference Constructs" section of the ".NET Framework General Reference". Alternation Constructs To understand the regular expression that I've just shown you, you really need to understand the meaning of the pipe "|" symbol. It's an alternation construct meaning "OR". As a simpler example, consider the number of times the number 10 appears in our test sentence. I count four, if we include the spelled-out word "ten" in the calculation.
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 7 / 12
That regular expression looks for occurrences of the text "ten" or "10" followed by a word boundary. For a complete list of Alternation Constructs, look in the "Alternation Constructs" section of the ".NET Framework General Reference".
.NET Regular Expression Classes Once you understand how to build regular expressions, you'll need a mechanism for driving them from you're .NET program, so we'll look briefly at these .NET classes included in the System.Text.RegularExpressions namespace: z System.Text.RegularExpressions.Regex z System.Text.RegularExpressions.Match z System.Text.RegularExpressions.Group z System.Text.RegularExpressions.RegexOptions
Regex class encapsulates a regular expression string and a set of options. You can look for occurrences (returned as matches) of the regular expression pattern within an input text using instance or static methods of the class. RegexOptions provide enumeration values that may be combined to affect the matching operations, for example to run in case sensitive or case insensitive mode. Match class instances represent occurrences of the regular expression pattern within the input text and provide access to text segments via group names or numbers (demonstrated in the Review and Further Work section later). Group class instances represent text segments from the input text that were mapped to group name or numbers in the regular expression. Each match may have several groups associated with it as demonstrated in the Review and Further work section later. The first sample application, which I'll list next and which was used to drive the previous examples, makes use of the following classes from that namespace:
Sample Application #1 - RegexCounter To demonstrate the various regular expressions language elements, I used a simple C# / .NET program that took a regular expression as input, and which produced as output a message box like this:
You might like to try out the examples for yourself and experiment with some of your own, for which you'll need the following program listing:
using System; using System.Text.RegularExpressions; using System.Windows.Forms; namespace RegularExpressions {
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 8 / 12
public class RegexCounter { public static void Main(String[] args) { String sourceText ="Each item is priced at $10 but you can purchase ten for only $ is much cheaper than 10 * $10 = $100."; RegexCounter.countRegex(sourceText, @"\$\d0{2}\b",true); } That Main() method should be easy enough to understand. All the interesting work is done in the next method:
public static void countRegex(String sourceText, String regexText , bool ignoreCase) { First of all we can set some RegexOptions, for example to make the matches case-sensitive or case-insensitive, before creating the Regex instance:
RegexOptions options = 0; if (ignoreCase) options = RegexOptions.IgnoreCase; Regex countRegex = new Regex(regexText,options); Next we create a Match instance to match occurrences of the regular expression in our source text:
Match countMatch = countRegex.Match(sourceText); We step through each successful match, incrementing the count as we go:
int count=0; for ( ; countMatch.Success; countMatch = countMatch.NextMatch()) count++; Finally we display the result.
MessageBox.Show("The regular expression '"+regexText +"' appears "+count+" times."); } } } That code may be found in the RegexCounter.cs source file provided, and you can drive it using the RegexTester.cs file also provided. You can find out more about the classes used by looking in the"System.Text.RegularExpressions Namespace" section of the ".NET Framework Class Library".
Sample Application #2 - TagParser The previous sample application that counts up the number of occurrences of a given regular expression was sufficient for the basic demonstration. In any practical application we'd want to know a little more about each matching instance of the regular expression, and in particular we might want to gain access to the text segments enclosed within named groups. Remember this regular expression that I showed earlier?
"<(?\w+)>(?(.|\n)*?)\k>" It's used by the next class to pick out matching tag pairs from an input string and to print out the tags with appropriate indents to show the levels of embedding. A method call as follows:
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques
페이지 9 / 12
RegexSamples.showTags("<TITLE>Sample HTML Tablerow0,col0 | row0,col1 |
row1,col0 | row1,col1 |
/",""); Will result in console output of:
<TITLE> Hopefully that provides a more realistic example of something you might do with regular expressions, and the code is as follows:
using System; using System.Text.RegularExpressions; namespace RegularExpressions { public class TagParser { public static void showTags(String text, String indent) { There we defined a TagParser class within a RegularExpressions namespace, having a single showTags(.) method. Next we create an instance of our regular expression:
Regex tagRegex = new Regex(@"<(?\w+)>(?(.|\n)*?)\k>"); We're looking for: z A group (called ) of one or more (+) word characters (\w) within angled brackets < and > , followed by z A group (called ) of zero or more (*?) any character or newline (.|\n), followed by z A / character and the backreferenced (\k) content of the group within angles brackets < and >.
In a nutshell, we want to find matching and combinations with enclosed content. If you're wondering about the question mark (?) character that follows the asterisk (*) in our regular expression, I'll explain that shortly. Now we match the regular expression against our input text and loop through the matches:
Match tagMatch = tagRegex.Match(text); for ( ; tagMatch.Success; tagMatch = tagMatch.NextMatch()) {
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 10 / 12
In the following line, take note of the tagMatch.Groups["tag"] method invocation to extract the text that matches our regular expression's group:
System.Console.WriteLine(indent+"<"+tagMatch.Groups["tag"]+">"); Because the content enclosed between two matching tags may itself contain other pairs of matching tags, we're now doing some clever recursion by re-entering the showTags(.) method with the text of the group.
showTags(""+tagMatch.Groups["content"],indent+"
");
And when we wind back out of the recursion we print the closing tag at this level.
System.Console.WriteLine(indent+""+tagMatch.Groups["tag"]+">"); } Now close off the method, class, and namespace:
} } } That recursive approach is not the most performant solution, and for large input files the call stack will grow and grow, but it's the neatest way to demonstrate the principle. The code may be found in the TagParser.cs source file provided, and you can drive is using the RegexTester.cs file also provided. Lazy Quantifiers and the Unexplained Question Mark That second example illustrates the importance of lazy quantifiers, like *?, that match the minimum number of repetitions. Assuming TR as a value matched by the group in our regular expression, the group would first match the text shown in bold below because it is enclosed within
combination.
row0,col0 | row0,col1 |
row1,col0 | row1,col1 |
row0,col0 | row0,col1 |
row1,col0 | row1,col1 | group would be the whole of:
row0,col0 | row0,col1 |
row1,col0 | row1,col1 | and
tags, but not at all what we wanted, and the final output in that case would be as follows. Look carefully at how the and tags are indented.
<TITLE>
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 11 / 12
That was a practical demonstration of the subtle difference between a lazy quantifier (*?) and a greedy quantifier (*). The former matches the shortest text segment that meets the criteria, and the latter matches the longest text segment that matches the criteria.
Conclusion Over the years, many programming and scripting languages -- like AWK, Perl and Python - have included Regular Expression as part of their fabric. Many other languages and runtime environments - like the Java SDK1.4 and C#/VB/C++/.NET - have been furnished with a regular expressions capability, and with good reason. Regular Expressions provide a very powerful mechanism for parsing, transforming, and otherwise manipulating text with little coding effort. For those who are new to the regular expressions, I've tried to demystify them a little via simple, concrete examples of the various RE language elements. For those familiar with regular expressions, but unfamiliar with the .NET supporting classes, I've introduced the main classes from the System.Text.RegularExpressions namespace. If this article has whetted your appetite you will find plenty of documentation and some more examples in the official .NET documentation, and you might like to revisit my previous "Working with Web Data in C#" (http://www.csharptoday.com/content/articles/20020128.asp) article to try out a few more regular expression in a real-life scenario. Article Information Author
Tony Loton
Technical Editor Adam Ryland Author Agent
Charlotte Smith
Project Manager Helen Cuthill Reviewers
John Boyd Nolan, Phil Sidari
If you have any questions or comments about this article, please contact the technical editor.
USEFUL LINKS Related Tasks:
RATE THIS ARTICLE Please rate this article (1-5). Was this article... Useful?
No
j n k l m n j n k l m j n k l m j n k l m j Yes, Very k l m
Innovative?
No
j n k l m n j n k l m j n k l m j n k l m j Yes, Very k l m
Informative?
No
Yes, Very
j n k l m n j n k l m j n k l m j n k l m j k l m
Brief Reader Comments?
z Download the support material for this z Enter Technical Discussion on this Artic z Technical Support on this article - support@ z z z z z
See other articles in the Other Technologie See other Cutting Edge articles Reader Comments on this article Go to Previous Article Go to Next Article
Your Name: (Optional)
Related C#Today Articles
Index Entries in this Article
z Syntax highlighting with regular expressions – Part 2
z .NET Framework
z Match class
(March 8, 2002) z Working with Web Data in C# (January 28, 2002)
z Alternation Constructs
z match functio
z atomic zero-width assertions z NextMatch m z Backreference Constructs
z Perl
z character classes
z Quantified ex
z escape characters
z Regex class
z Regular Expressions tutorial material: http://gnosis.cx/publish/programming/regular_expressions.html
z greedy quantifiers
z RegexOption
z Group class
z Regular Expr
z Mastering Regular Expressions by Jeffrey E.F. Friedl:
z Grouping Constructs
z regular expre
Related Sources
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 12 / 12
http://www.amazon.co.uk/exec/obidos/tg/stores/detail//books/1565922573/toc/026-0887537-5738869
z Groups property
z substitutions
z IgnoreCase value
z Success prop
z Matchmaking with regular expressions:
z introduction
z syntax
z lazy quantifiers
z System.Text
http://www.javaworld.com/javaworld/jw-07-2001/jw-0713regex.html
namespace
Search the C#Today Living Book
i Index j k l m n HOME
Advanced
j Full Text k l m n |
SITE MAP
Ecommerce Data Access/ADO.NET
|
INDEX
Performance Application Development
|
SEARCH
|
REFERENCE
|
FEEDBACK
|
ADVERTIS
SO
Security
Site Design
XML
Web Services
Graphics/Games
Mobile
Other Technologies C#Today is brought to you by Wrox Press (www.wrox.com). Please see our terms and conditions and privacy C#Today is optimised for Microsoft Internet Explorer 5 browsers. Please report any website problems to [email protected]. Copyright © 2002 Wrox Press. All Rights
http://www.csharptoday.com/content/articles/20020201.asp?WROXE...
2002-07-10
Related Documents
More Documents from ""
|