Package gnu.xml.pipeline
This package exposes a kind of XML processing pipeline, based on sending
SAX events, which can be used as components of application architectures.
EventConsumer | Collects the event consumption apparatus of a SAX pipeline stage. |
CallFilter | Input is sent as an XML request to given URI, and the output of this
filter is the parsed response to that request. |
DomConsumer | This consumer builds a DOM Document from its input, acting either as a
pipeline terminus or as an intermediate buffer. |
DomConsumer.Handler | Class used to intercept various parsing events and use them to
populate a DOM document. |
EventFilter | A customizable event consumer, used to assemble various kinds of filters
using SAX handlers and an optional second consumer. |
LinkFilter | Pipeline filter to remember XHTML links found in a document,
so they can later be crawled. |
NSFilter | This filter ensures that element and attribute names are properly prefixed,
and that such prefixes are declared. |
PipelineFactory | This provides static factory methods for creating simple event pipelines. |
TeeConsumer | Fans its events out to two other consumers, a "tee" filter stage in an
event pipeline. |
TextConsumer | Terminates a pipeline, consuming events to print them as well formed
XML (or XHTML) text. |
ValidationConsumer | This class checks SAX2 events to report validity errors; it works as
both a filter and a terminus on an event pipeline. |
WellFormednessFilter | This filter reports fatal exceptions in the case of event streams that
are not well formed. |
XIncludeFilter | Filter to process an XPointer-free subset of
XInclude, supporting its
use as a kind of replacement for parsed general entities. |
XsltFilter | Packages an XSLT transform as a pipeline component. |
This package exposes a kind of XML processing pipeline, based on sending
SAX events, which can be used as components of application architectures.
Pipelines are used to convey streams of processing events from a producer
to one or more consumers, and to let each consumer control the data seen by
later consumers.
There is a
PipelineFactory class which
accepts a syntax describing how to construct some simple pipelines. Strings
describing such pipelines can be used in command line tools (see the
DoParse class)
and in other places that it is
useful to let processing be easily reconfigured. Pipelines can of course
be constructed programmatically, providing access to options that the
factory won't.
Web applications are supported by making it easy for servlets (or
non-Java web application components) to be part of a pipeline. They can
originate XML (or XHTML) data through an
InputSource or in
response to XML messages sent from clients using
CallFilter
pipeline stages. Such facilities are available using the simple syntax
for pipeline construction.
Programming Models
Pipelines should be simple to understand.
- XML content, typically entire documents,
is pushed through consumers by producers.
- Pipelines are basically about consuming SAX2 callback events,
where the events encapsulate XML infoset-level data.
- Pipelines are constructed by taking one or more consumer
stages and combining them to produce a composite consumer.
- A pipeline is presumed to have pending tasks and state from
the beginning of its ContentHandler.startDocument() callback until
it's returned from its ContentHandler.doneDocument() callback.
- Pipelines may have multiple output stages ("fan-out")
or multiple input stages ("fan-in") when appropriate.
- Pipelines may be long-lived, but need not be.
There is flexibility about event production.
- SAX2 XMLReader objects are producers, which
provide a high level "pull" model: documents (text or DOM) are parsed,
and the parser pushes individual events through the pipeline.
- Events can be pushed directly to event consumer components
by application modules, if they invoke SAX2 callbacks directly.
That is, application modules use the XML Infoset as exposed
through SAX2 event callbacks.
Multiple producer threads may concurrently access a pipeline,
if they coordinate appropriately.
Pipeline processing is not the only framework applications
will use.
Producers: XMLReader or Custom
Many producers will be SAX2 XMLReader objects, and
will read (pull) data which is then written (pushed) as events.
Typically these will parse XML text (acquired from
org.xml.sax.helpers.XMLReaderFactory
) or a DOM tree
(using a DomParser
)
These may be bound to event consumer using a convenience routine,
EventFilter.bind().
Once bound, these producers may be given additional documents to
sent through its pipeline.
In other cases, you will write producers yourself. For example, some
data structures might know how to write themselves out using one or
more XML models, expressed as sequences of SAX2 event callbacks.
An application module might
itself be a producer, issuing startDocument and endDocument events
and then asking those data structures to write themselves out to a
given EventConsumer, or walking data structures (such as JDBC query
results) and applying its own conversion rules. WAP format XML
(WBMXL) can be directly converted to producer output.
SAX2 introduced an "XMLFilter" interface, which is a kind of XMLReader.
It is most useful in conjunction with its XMLFilterImpl helper class;
see the EventFilter javadoc
for information contrasting that XMLFilterImpl approach with the
relevant parts of this pipeline framework. Briefly, such XMLFilterImpl
children can be either producers or consumers, and are more limited in
configuration flexibility. In this framework, the focus of filters is
on the EventConsumer side; see the section on
pipe fitting below.
Consume to Standard or Custom Data Representations
Many consumers will be used to create standard representations of XML
data. The TextConsumer takes its events
and writes them as text for a single XML document,
using an internal XMLWriter.
The DomConsumer takes its events and uses
them to create and populate a DOM Document.
In other cases, you will write consumers yourself. For example,
you might use a particular unmarshaling filter to produce objects
that fit your application's requirements, instead of using DOM.
Such consumers work at the level of XML data models, rather than with
specific representations such as XML text or a DOM tree. You could
convert your output directly to WAP format data (WBXML).
Pipelines are composite event consumers, with each stage having
the opportunity to transform the data before delivering it to any
subsequent stages.
The PipelineFactory class
provides access to much of this functionality through a simple syntax.
See the table in that class's javadoc describing a number of standard
components. Direct API calls are still needed for many of the most
interesting pipeline configurations, including ones leveraging actual
or logical concurrency.
Four basic types of pipe fitting are directly supported. These may
be used to construct complex pipeline networks.
- TeeConsumer objects split event
flow so it goes to two two different consumers, one before the other.
This is a basic form of event fan-out; you can use this class to
copy events to any number of output pipelines.
- Clients can call remote components through HTTP or HTTPS using
the CallFilter component, and Servlets
can implement such components by extending the
XmlServlet component. Java is not
required on either end, and transport protocols other than HTTP may
also be used.
- EventFilter objects selectively
provide handling for callbacks, and can pass unhandled ones to a
subsequent stage. They are often subclassed, since much of the
basic filtering machinery is already in place in the base class.
- Applications can merge two event flows by just using the same
consumer in each one. If multiple threads are in use, synchronization
needs to be addressed by the appropriate application level policy.
Note that filters can be as complex as
XSLT transforms
available) on input data, or as simple as removing simple syntax data
such as ignorable whitespace, comments, and CDATA delimiters.
Some simple "built-in" filters are part of this package.
Coding Conventions: Filter and Terminus Stages
If you follow these coding conventions, your classes may be used
directly (give the full class name) in pipeline descriptions as understood
by the PipelineFactory. There are four constructors the factory may
try to use; in order of decreasing numbers of parameters, these are:
- Filters that need a single String setup parameter should have
a public constructor with two parameters: that string, then the
EventConsumer holding the "next" consumer to get events.
- Filters that don't need setup parameters should have a public
constructor that accepts a single EventConsumer holding the "next"
consumer to get events when they are done.
- Terminus stages may have a public constructor taking a single
paramter: the string value of that parameter.
- Terminus stages may have a public no-parameters constructor.
Of course, classes may support more than one such usage convention;
if they do, they can automatically be used in multiple modes. If you
try to use a terminus class as a filter, and that terminus has a constructor
with the appropriate number of arguments, it is automatically wrapped in
a "tee" filter.
Debugging Tip: "Tee" Joints can Snapshot Data
It can sometimes be hard to see what's happening, when something
goes wrong. Easily fixed: just snapshot the data. Then you can find
out where things start to go wrong.
If you're using pipeline descriptors so that they're easily
administered, just stick a write ( filename )
filter into the pipeline at an appropriate point.
Inside your programs, you can do the same thing directly: perhaps
by saving a Writer (perhaps a StringWriter) in a variable, using that
to create a TextConsumer, and making that the first part of a tee --
splicing that into your pipeline at a convenient location.
You can also use a DomConsumer to buffer the data, but remember
that DOM doesn't save all the information that XML provides, so that DOM
snapshots are relatively low fidelity. They also are substantially more
expensive in terms of memory than a StringWriter holding similar data.
Debugging Tip: Non-XML Producers
Producers in pipelines don't need to start from XML
data structures, such as text in XML syntax (likely coming
from some XMLReader that parses XML) or a
DOM representation (perhaps with a
DomParser).
One common type of event producer will instead make
direct calls to SAX event handlers returned from an
EventConsumer.
For example, making ContentHandler.startElement
calls and matching ContentHandler.endElement calls.
Applications making such calls can catch certain
common "syntax errors" by using a
WellFormednessFilter.
That filter will detect (and report) erroneous input data
such as mismatched document, element, or CDATA start/end calls.
Use such a filter near the head of the pipeline that your
producer feeds, at least while debugging, to help ensure that
you're providing legal XML Infoset data.
You can also arrange to validate data on the fly.
For DTD validation, you can configure a
ValidationConsumer
to work as a filter, using any DTD you choose.
Other validation schemes can be handled with other
validation filters.