HTML & XML Parsers
Any web browser must be able to parse HTML, and XML becoming
more and more important. This component would provide a pair of
parsers (or a generic SGML parser that can deal with both
subsets) that output a consistent parse tree. This would be the
foundation for the DOM work (see below).
-
How to deal with old-style HTML, or HTML that does not conform
to the DTD? Should we key off of the existence of a valid
DOCTYPE and use a strict parser, and fall back to something
based off of the current w3-parse.el code for DOCTYPE-less
documents? Or always use the same heuristics to guess what
the author really meant?
-
Do we really need two separate parsers for HTML and XML?
PSGML can parse well-formed HTML or any XML document (which is
by definition well-formed, or the parser can gleefully choke
it to death).
-
Can PSGML be persuaded to do what we want? It seems that
using the existing API (sgml-top-element, sgml-element-next,
sgml-element-content) would be feasible. On the plus side,
this would allow the DOM to work on arbitrary SGML documents
(LinuxDoc or DocBook anyone?).
-
Should the parsers be able to deal with streaming data? It
would be theoretically possible to parse the document as it
comes in off the network. Do we really care?