For my current project I needed a way to fetch remote html and then parse it into a more accessible data form. So I took my Java XML Parser work and ported it over to Objective C and extended it to work with HTML, which tends to be far more messy and broken... grr. To combat this, unlike a full html parser, this converts it to a psudo xml form, where all character data between > and < and > or /> is appended to the tag string. The down side to this is that you need to parse out any needed tag attributes separately, but that is a price I am willing to pay in this case.
Check out the files below for the code...
HTMLNode.h
NTMLNode.m
Using the HTMLNode class should be simple enough, just import the HTMLNode.h file and then use the example below to get started. It is good to note that this parser expects clean and valid HTML/XHTML, however most sites have some issue or mistake. This may cause you a few headaches, it did for me. Still the parser should get most if not all the tags, so in this case use the search function "-(HTMLNode*) search:(HTMLNode*) root: (NSString*) term" to find a containing div tag and then use getChildN for traversing the rest.
// Setup and build html node tree in root... NSString *url = @"http://www.google.com"; HTMLNode *root = [[HTMLNode alloc] init]; [root buildFromURL: url: root]; // Get the head tag which should be root child 0... HTMLNode *headnode = [root getChildN:0]; // The tag of the head node should be "head"... NSLog([headnode getTag]);
As usual the code is free to use, but please give me some credit if it is used in a large project, or at least leave a comment about what it was used in.