SAX Input into Jena and ARP

Legacy Documentation : not up-to-date

The original ARQ parser will be removed from Jena


Normally, both ARP and Jena are used to read files either from the local machine or from the Web. A different use case, addressed here, is when the XML source is available in-memory in some way. In these cases, ARP and Jena can be used as a SAX event handler, turning SAX events into triples, or a DOM tree can be parsed into a Jena Model.

1. Overview

To read an arbitrary SAX source as triples to be added into a Jena model, it is not possible to use a Model.read() operation. Instead, you construct a SAX event handler of class SAX2Model, using the create method, install these as the handler on your SAX event source, and then stream the SAX events. It is possible to have fine-grained control over the SAX events, for instance, by inserting or deleting events, before passing them to the SAX2Model handler.

Sample Code

This code uses the Xerces parser as a SAX event stream, and adds the triple to a Model using default options.

// Use your own SAX source.
XMLReader saxParser = new SAXParser();

// set up SAX input
InputStream in = new FileInputStream("kb.rdf");
InputSource ins = new InputSource(in);
ins.setSystemId(base);

Model m = ModelFactory.createDefaultModel();
String base = "http://example.org/";

// create handler, linked to Model
SAX2Model handler = SAX2Model.create(base, m);

// install handler on SAX event stream
SAX2RDF.installHandlers(saxParser, handler);

try {
    try {
        saxParser.parse(ins);
    } finally {
        // MUST ensure handler is closed.
        handler.close();
    }
} catch (SAXParseException e) {
    // Fatal parsing errors end here,
    // but they will already have been reported.
}

Initializing SAX event source

If your SAX event source is a subclass of XMLReader, then the installHandlers static method can be used as shown in the sample. Otherwise, you have to do it yourself. The installHandlers code is like this:

static public void installHandlers(XMLReader rdr, XMLHandler sax2rdf)
throws SAXException
{
    rdr.setEntityResolver(sax2rdf);
    rdr.setDTDHandler(sax2rdf);
    rdr.setContentHandler(sax2rdf);
    rdr.setErrorHandler(sax2rdf);
    rdr.setFeature("http://xml.org/sax/features/namespaces", true);
    rdr.setFeature(
            "http://xml.org/sax/features/namespace-prefixes",
            true);
    rdr.setProperty(
            "http://xml.org/sax/properties/lexical-handler",
            sax2rdf);
}

For some other SAX source, the exact code will differ, but the required operations are as above.

Error Handler

The SAX2Model handler supports the setErrorHandler method, from the Jena RDFReader interface. This is used in the same way as that method to control error reporting.

A specific fatal error, new in Jena 2.3, is ERR_INTERRUPTED, which indicates that the current Thread received an interrupt. This allows long jobs to be aborted on user request.

Options

The SAX2Model handler supports the setProperty method, from the Jena RDFReader interface. This is used in nearly the same way to have fine grain control over ARPs behaviour, particularly over error reporting, see the I/O howto. Setting SAX or Xerces properties cannot be done using this method.

XML Lang and Namespaces

If you are only treating some document subset as RDF/XML then it is necessary to ensure that ARP knows the correct value for xml:lang and desirable that it knows the correct mappings of namespace prefixes.

There is a second version of the create method, which allows specification of the xml:lang value from the outer context. If this is inappropriate it is possible, but hard work, to synthesis an appropriate SAX event.

For the namespaces prefixes, it is possible to call the startPrefixMapping SAX event, before passing the other SAX events, to declare each namespace, one by one. Failure to do this is permitted, but, for instance, a Jena Model will then not know the (advisory) namespace prefix bindings. These should be paired with endPrefixMapping events, but nothing untoward is likely if such code is omitted.

Using your own triple handler

As with ARP, it is possible to use this functionality, without using other Jena features, in particular, without using a Jena Model. Instead of using the class SAX2Model, you use its superclass SAX2RDF. The create method on this class does not provide any means of specifying what to do with the triples. Instead, the class implements the ARPConfig interface, which permits the setting of handlers and parser options, as described in the documentation for using ARP without Jena.

Thus you need to:

  1. Create a SAX2RDF using SAX2RDF.create()
  2. Attach your StatementHandler and SAXErrorHandler and optionally your NamespaceHandler and ExtendedHandler to the SAX2RDF instance.
  3. Install the SAX2RDF instance as the SAX handler on your SAX source.
  4. Follow the remainder of the code sample above.

Using a DOM as Input

None of the approaches listed here work with Java 1.4.1_04. We suggest using Java 1.4.2_04 or greater for this functionality. This issue has no impact on any other Jena functionality.

Using a DOM as Input to Jena

The DOM2Model subclass of SAX2Model, allows the parsing of a DOM using ARP. The procedure to follow is:

  • Construct a DOM2Model, using a factory method such as createD2M, specifying the xml:base of the document to be loaded, the Model to load into, optionally the xml:lang value (particularly useful if using a DOM Node from within a Document).
  • Set any properties, error handlers etc. on the DOM2Model object.
  • The DOM is parsed simply by calling the load(Node) method.

Using a DOM as Input to ARP

DOM2Model is a subclass of SAX2RDF, and handlers etc. can be set on the DOM2Model as for SAX2RDF. Using a null model as the argument to the factory indicates this usage.