How do you read HTML? Parser or RegEx?

When reading HTML with C#, do you use a parser or do you use Regular Expressions? This post will discuss how parsing is better than regular expressions.

Written by Jonathan "JD" Danylko • Last Updated: • Develop •
Which do you prefer? Parser or RegEx?

Over the past weekend, I've been working on my CMS (Content Management System) and came across some old code that uses regular expressions to read HTML.

WHAT!?!?? OMG...what have I done!? I'm succumbing to the dark side.

My thoughts were laid to rest when I saw some Html Agility Pack code used that was near my regex code. Phew! I must not have converted some old RegEx code to use the parser code.

I immediately replaced the regular expressions with parser code. Then, I had to take a shower because I felt so dirty for what I had done with regular expressions.

Once I saw the final code, then, and only then, did unicorns and Obama play together as one.

Obama riding Unicorn

Now, I don't know where I got this picture, but we all know it's not real, right? (The tip-off is that there's no way Obama can shoot rainbows from his hands!)

Kind of like parsing HTML with Regular Expressions...it's not real!

What is your reasoning?

Still think regular expressions are better for reading HTML, huh?

Ok, let's go through a couple common questions as to why parsing is better!

Question: Why do I need to use a parser? I can write regular expression code in fewer lines.

I'm sure you could, but honestly, it's almost the same amount of lines.

Let's compare the two ways of reading an HTML title tag: 

Html Agility Pack

var doc = new HtmlDocument { OptionOutputAsXml = true };
doc.LoadHtml(htmlContent);
var metaNode = doc.DocumentNode.SelectSingleNode("//head/title");
Console.Write("Title: {0}", metaNode.InnerText);

Regular Expression

var ex = new Regex(@"(?<=<title.*>)([\s\S]*)(?=</title>)"RegexOptions.IgnoreCase);
return ex.Match(htmlPage).Groups[1].Value.Trim();

Yes, the parser may be a line longer, but the next time you need to examine another part of HTML, you just need to call one line to XPath to the location and one line to display/assign the value/attribute.

With the regular expression, you have to create a new RegEx instance to start the whole process over again to find another tag.

Winner: Html Parser

Question: Why can't I use Regular Expressions? Everyone is using it to parse HTML!

I remember using regular expressions a long time ago. I kept saying there had to be a better way to read the HTML.

I'm a lazy programmer. Most programmers are lazy folks by nature. We find better ways of doing things and automate the crap out of it. Heck, that's our job.

Html Parsers provide better interfaces and perform all of the heavy lifting instead of writing code to find specific tags. It makes things easier on us.

If you are using the Html Agility Pack, you see the same parsing code all over the Internet. Maybe it's not exactly the same code, but it's consistent. It's just a one-off from the original parsing code.

Now...regular expressions...Yes, everyone is using regular expressions to parse HTML, but do you know how many versions of regular expressions I find on web sites that read a particular tag?

At least 5. It's different for every developer who wants to "read HTML" and they came up with the most awesome way of doing it...until it doesn't work with a specific web page.

I would rather stick to the tride-and-true, one piece of parser code to read HTML instead of 5 ways of reading it and waiting for it to fail on "Bob's Fabulous Bank" web site that is improperly formatted.

Also, this one StackOverflow question (at last check) has 4,428 developers who agree that you should NOT use Regular Expressions to parse your HTML.

NOT everyone is doing it!

Winner: Html Parser

Conclusion

While I can appreciate developers "giving their all" at trying to find a tag and extract the inner text from it, I feel that there are companies that sometimes encourage this type of thinking (umm...Regular Expression Examples at Microsoft.com).

Regular expressions do have their place (matching on string patterns, splitting text, string validation, etc.), but using it on HTML is not a good idea. Too many unknowns.

Keep the Html Parsers parsing data and let the regular expressions perform their string duties and keep the RegEx out of the HTML arena.

Do you have any reasons why parsing (or regular expressions) is better? What are your thoughts on this topic? Post your comments below!

 

Did you like this content? Show your support by buying me a coffee.

Buy me a coffee  Buy me a coffee
Picture of Jonathan "JD" Danylko

Jonathan Danylko is a web architect and entrepreneur who's been programming for over 25 years. He's developed websites for small, medium, and Fortune 500 companies since 1996.

He currently works at Insight Enterprises as an Principal Software Engineer Architect.

When asked what he likes to do in his spare time, he replies, "I like to write and I like to code. I also like to write about code."

comments powered by Disqus