poniedziałek, 14 września 2009

Regular Expressions in C#

This post is going to explain or memorize my thoughts about regular expressions. I don't use it very often, so that every time I come back to this topic I have to remind myself some of details and that's why I decided to publish this post. The library where you can find a lot of examples is here: http://regexlib.com/ There is also an good freeware program to test your expressions: http://www.radsoftware.com.au/regexdesigner/ The Regex class is placed in
using System.Text.RegularExpressions;
namespace. Our input string looks like this: Jacek Skowron **** Mielec 1984 Address line description description description description description description description description description description description description description description description description description description description description Krystian Kapel *** Szczucin 1983 Addres line description description description description description description description description description description description description description description description description description description description description This string contains set of people. Each person is described like that: First line: Name and rating. Second line: birth place and year, third line contains address and the last line description. My regex expresssion used to parse details from input string is:
   1:  (?<name>.+?)\s(?<rating>\*+).*?\n
   2:  (?<birthPlace>.+?)(?<year>\d{4}?).*?\n
   3:  (?<address>.+?)\n
   4:  (?<description>.+?)\n
Let examine this expression step by step starting from simple versions and exploring it.
First line: It could look like this:
   1:  .+?\s\*+.*?\n
Let's assume there is even no question marks. Explanation then is simple: '.' firstly dot sign stands for any character (except new line \n sing). '+' sign specifies that we want to find one or more characters (here any characters because of dot), '\s' stands for single white-space. '\*+' Next there is an escaped start sign combined with + sign responsible again for finding one or more occurrences of stars. '.*\n' All is left in this line is an hard enter sign - '\n', but I inserted before it '.*' and it is going to find any white-spaces between last star and hard enter. Adding question marks prevents Regular expressions from being greedy, meaning they match as much as they can. (for example it could find name in starting line and stars (rating) in last line of our input string. Having said, first question mark ensures that regex will find all text until first occurance of space fallowed by stars and second ? tells regex to find first hard return. We're almost done with first line, except of brackets, which allow us to divide expression into subexpressions (groups).
   1:  (?<name>.+?)
Referencing to our first line and finding person's name I surrounded expression responsible for finding name with brackets, i.e (?expression) and then added group name in angle bracket. The same was done for rating expression:
(?<rating>\*+)

Second line:
   2:  (?<birthPlace>.+?)(?<year>\d{4}?).*?\n
It introduces only one new aspect - finding digits (specified by '\d') and here where we want to find year (always 4 digits) '\d{4}'
Third and fourth line is just an subset of first line where we were looking for person's name. Regular expression is done. Let's use it in code: Listing - using our expression.
Regex rgx = new Regex(@"(?<name>.+?)\s(?<rating>\*+).*?\n(?<birthPlace>.+?)(?<birthYear>\d{4}?).*?\n(?<address>.+?)\n(?<description>.+?)\n",
 RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
                    MatchCollection m = rgx.Matches(inputstring);

                    foreach (Match item in m)
                    {
                        Person p = new Person ();
                        p.Name= item.Groups["name"].Value;
                        p.Rating= item.Groups["rating"].Value.Length;
                        p.BirthPlace = item.Groups["birthPlace"].Value.Trim();
                        p.BirthYear= item.Groups["birthYear"].Value.Trim();
                        p.Address = item.Groups["address"].Value.Trim();
                        p.Description = item.Groups["description"].Value.Trim();
                        // Process our person object
                    }
Variable inputstring contains text with our string to parse. MatchCollection contains all people parsed by regex. Because we've named groups earlier looping over regex result and finding person's properties is easy and straighforward.

Brak komentarzy: