John Cowan has just published TagSoup 1.0 Release Candidate 1. What in the world is TagSoup, you say? It's a SAX-compliant parser written in Java that parses the stew of HTML out there. Not a simple broth, mind you, but a complex amalgamation of tags, often not well-formed that make up the web today.
How did we get this mess? All those sloppy browsers, namely Internet Exploder, er Explorer and Netscape, that allowed non well-formed markup to be displayed, when they should have returned error messages. It encouraged sloppy HTML coding for the masses, who are difficult to re-train. Even the tools developers got sloppy. Most HTML editors are just as bad as sloppy humans. Macromedia's Dreamweaver is generally the exception to that rule and puts out decent HTML, but even it slips at times.
What's the big deal with sloppy HTML? It DRIVES US XML GUYS CRAZY! STOP IT! Actually, it is painful and tedious to have to clean up someone else's HTML so you can reuse, harvest or do other interesting things with XSL.Thankfully XHTML is at least well-formed, but hasn't gained quite the traction that plain, sloppy HTML has.
You can find out more about the TagSoup parser at http://mercury.ccil.org/~cowan/XML/tagsoup/. It's Open Source software, available under both GPL and AFL licenses.
John has also re-packaged Saxon 6.5.3 as TSaxon to include the TagSoup parser! Hopefully this can be upgraded to use Saxon 8.1.1 in the near future...
In an unrelated note, John has also done a very interesting presentation on RelaxNG called "RELAX NG: DTDs on Warp Drive" available at http://mercury.ccil.org/~cowan/relaxng.pdf
Good work, John!
See also: XML