Main class that provides the regular expression functionality in .net is the Regex class. This is in the System.Text.RegularExpressions namespace. To be able to access this functionality, use the following using directive:
using System.Text.RegularExpressions;
The Regex class in the above namespace contains several Replace methods. All these over-loaded methods are, of course, public methods; and a couple of them are static as well. When using a static method, you don’t need to create an instance of that class -- you can use this static method directly.
title = Regex.Replace(title, @"[^\w\s]", "");
In the above line, let’s say ’title’ is title of an article or a blog post. The above regular expression removes all the characters that are not alpha-numeric or spaces from the title. Meaning all the special characters like period (.), comma (,), quotes, etc are removed.
Removing Non-Alpha Numeric Characters
Let’s look at the regular expression string: [^\w\s]:
- \w means ’match all the alphanumeric characters’
- \s means ’match all the spaces’
- ^ means ’negate the above’ meaning, don’t match alphanumeric characters or spaces
- [] will match one character at a time.
So, with the above regular expression, we telling the Replace method to match all the characters that are NOT alphanumeric or spaces.
The Replace method has three parameters:
- title - the string input on which we are applying the regular expression pattern
- [^\w\s] - a regular expression pattern to look for in the string
- "" - the replacement string. In this case it is an empty string, so, the characters will be deleted.
So, if the title string is: "Why use @blah?"
It becomes: "Why use blah"
Removing Special Characters Selectively
Now, let’s say we don’t want to remove all the ’special’/non-alpha-numeric characters. Let’s say we want to keep the periods (.) and dashes (-) in. In this case, we can explicitly say what we want to keep:
title = Regex.Replace(title, @"[^a-zA-Z0-9\.\-\s]", "");
In this case, the pattern we are looking for is: [^a-zA-Z0-9\.\-\s]
- a-z : All the lower-case alphabet
- A-Z : All the upper-case alphabet
- 0-9 : All the numbers
- \. : The period, using the escape \
- \- : The dash
- \s : The spaces
By explicitly mentioning the lower and upper case letters and numbers, we are matching only a part of the alpha-numeric characters.
Now with the above pattern, following would happen:
Title: "Isn’t asp.net cool?"
Becomes: "Isnt asp.net cool"
Replacing the spaces and periods with dashes
On the above string (the string without some of the special characters), if you use the following Replace:
title = Regex.Replace(title, @"[\s+\.]", "-");
You will get a nice title that’s search engine friendly.
How to match and not match period and other non-alphanumeric characters?"
Becomes: "how-to-match-and-not-match-period-and-other-non-alphanumeric-characters"
This is futher discussed in the following article:
Some Quick and Easy Ways to Rewrite URLs.