November 30, 2009

Simple HTML Parser in Objective C

For my current project I needed a way to fetch remote html and then parse it into a more accessible data form. So I took my Java XML Parser work and ported it over to Objective C and extended it to work with HTML, which tends to be far more messy and broken... grr. To combat this, unlike a full html parser, this converts it to a psudo xml form, where all character data between > and < and > or /> is appended to the tag string.  The down side to this is that you need to parse out any needed tag attributes separately, but that is a price I am willing to pay in this case.

Check out the files below for the code...
HTMLNode.h
NTMLNode.m


Using the HTMLNode class should be simple enough, just import the HTMLNode.h file and then use the example below to get started. It is good to note that this parser expects clean and valid HTML/XHTML, however most sites have some issue or mistake. This may cause you a few headaches, it did for me. Still the parser should get most if not all the tags, so in this case use the search function "-(HTMLNode*) search:(HTMLNode*) root: (NSString*) term" to find a containing div tag and then use getChildN for traversing the rest.

// Setup and build html node tree in root...
NSString *url = @"http://www.google.com";
HTMLNode *root = [[HTMLNode alloc] init];
[root buildFromURL: url: root];

// Get the head tag which should be root child 0...
HTMLNode *headnode = [root getChildN:0];

// The tag of the head node should be "head"...
NSLog([headnode getTag]);


As usual the code is free to use, but please give me some credit if it is used in a large project, or at least leave a comment about what it was used in.

2 comments:

ApproachesZero said...

Saw this post in passing, and was just curious, would it be possible to simply parse the html as xml, and then work with it accordingly?

Efeion said...

Yep, and is actually what it is doing for the most part. In the example of an img tag, which does not have a closing tag, it sees the / at the end and treats it as one. Of course if the img tag does not have a / then the tree hierarchy will be strange. (Which is why the search is handy.) Also as it is really only focusing on making a tree out of the html data, so it just dumps the tag data, attributes and all. Which must be dealt with later if needed. I have updated this post with this important point, which I completely forget to mention.

Oh and shortly I will be posting the finished application that I am using this in, which you may find handy, so be on the lookout.