HTML & XML Parsers

Any web browser must be able to parse HTML, and XML becoming more and more important. This component would provide a pair of parsers (or a generic SGML parser that can deal with both subsets) that output a consistent parse tree. This would be the foundation for the DOM work (see below).

How to deal with old-style HTML, or HTML that does not conform to the DTD? Should we key off of the existence of a valid DOCTYPE and use a strict parser, and fall back to something based off of the current w3-parse.el code for DOCTYPE-less documents? Or always use the same heuristics to guess what the author really meant?
Do we really need two separate parsers for HTML and XML? PSGML can parse well-formed HTML or any XML document (which is by definition well-formed, or the parser can gleefully choke it to death).
Can PSGML be persuaded to do what we want? It seems that using the existing API (sgml-top-element, sgml-element-next, sgml-element-content) would be feasible. On the plus side, this would allow the DOM to work on arbitrary SGML documents (LinuxDoc or DocBook anyone?).
Should the parsers be able to deal with streaming data? It would be theoretically possible to parse the document as it comes in off the network. Do we really care?