Jena Full Text Search

Title: Jena Full Text Search

This extension to ARQ combines SPARQL and full text search via Lucene 6.4.1 or ElasticSearch 5.2.1 (which is built on Lucene). It gives applications the ability to perform indexed full text searches within SPARQL queries.

SPARQL allows the use of regex in FILTERs which is a test on a value retrieved earlier in the query so its use is not indexed. For example, if you're searching for occurrences of "printer" in the rdfs:label of a bunch of products:

PREFIX   ex: <http://www.example.org/resources#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?s ?lbl
WHERE { 
    ?s a ex:Product ;
       rdfs:label ?lbl
    FILTER regex(?lbl, "printer", "i")
}

then the search will need to examine all selected rdfs:label statements and apply the regular expression to each label in turn. If there are many such statements and many such uses of regex, then it may be appropriate to consider using this extension to take advantage of the performance potential of full text indexing.

Text indexes provide additional information for accessing the RDF graph by allowing the application to have indexed access to the internal structure of string literals rather than treating such literals as opaque items. Unlike FILTER, an index can set the values of variables. Assuming appropriate configuration, the above query can use full text search via the ARQ property function extension, text:query:

PREFIX   ex: <http://www.example.org/resources#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX text: <http://jena.apache.org/text#>

SELECT ?s ?lbl
WHERE { 
    ?s a ex:Product ;
       text:query (rdfs:label 'printer') ;
       rdfs:label ?lbl
}

This query makes a text query for 'printer' on the rdfs:label property; and then looks in the RDF data and retrieves the complete label for each match.

The full text engine can be either Apache Lucene hosted with Jena on a single machine, or Elasticsearch for a large scale enterprise search application where the full text engine is potentially distributed across separate machines.

This example code illustrates creating an in-memory dataset with a Lucene index.

Table of Contents

Architecture

In general, a text index engine (Lucene or Elasticsearch) indexes documents where each document is a collection of fields, the values of which are indexed so that searches matching contents of specified fields can return a reference to the document containing the fields with matching values.

The basic idea of the Jena text extension is to associate a triple with a document and the property of the triple with a field of a document and the object of the triple (which must be a literal) with the value of the field in the document. The subject of the triple then becomes another field of the document that is returned as the result of a search match to identify what was matched. (NB, the particular triple that matched is not identified. Only, its subject.)

In this manner, the text index provides an inverted index that maps query string matches to subject URIs.

A text-indexed dataset is configured with a description of which properties are to be indexed. When triples are added, any properties matching the description cause a document to be added to the index by analyzing the literal value of the triple object and mapping to the subject URI. On the other hand, it is necessary to specifically configure the text-indexed dataset to delete index entries when the corresponding triples are dropped from the RDF store.

The text index uses the native query language of the index: Lucene query language (with restrictions) or Elasticsearch query language.

External content

It is also possible that the indexed text is content external to the RDF store with only additional triples (about the indexed text) in the RDF store. The subject URI returned as a search result may then be considered to refer via the indexed property to the external content.

There is no requirement that the text data indexed is present in the RDF data. As long as the index contains the index text documents to match the index description, then text search can be performed.

For example, if the content of a collection of documents is indexed and the URI naming the document is the result of the text search, then an RDF dataset with the document metadata can be combined with accessing the content by URI.

The maintenance of the index is external to the RDF data store.

External applications

By using Elasticsearch, other applications can share the text index with SPARQL search.

Document structure

As mentioned above, text indexing of a triple involves associating a Lucene document with the triple. How is this done?

Lucene documents are composed of Fields. Indexing and searching are performed over the contents of these Fields. For an RDF triple to be indexed in Lucene the property of the triple must be configured in the entity map of a TextIndex. This associates a Lucene analyzer with the property which will be used for indexing and search. The property becomes the searchable Lucene Field in the resulting document.

A Lucene index includes a default Field, which is specified in the configuration, that is the field to search if not otherwise named in the query. In jena-text this field is configured via the text:defaultField property which is then mapped to a specific RDF property via text:predicate (see entity map below).

There are several additional Fields that will be included in the document that is passed to the Lucene IndexWriter depending on the configuration options that are used. These additional fields are used to manage the interface between Jena and Lucene and are not generally searchable per se.

The most important of these additional Fields is the text:entityField. This configuration property defines the name of the Field that will contain the URI or blank node id of the subject of the triple being indexed. This property does not have a default and must be specified for most uses of jena-text. This Field is often given the name, uri, in examples. It is via this Field that ?s is bound in a typical use such as:

select ?s
where {
    ?s text:query "some text"
}

Other Fields that may be configured: text:uidField, text:graphField, and so on are discussed below.

Given the triple:

ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;

The following is an abbreviated illustration a Lucene document that Jena will create and request Lucene to index:

Document<
    <uri:http://example.org/SomeOne> 
    <graph:urn:x-arq:DefaultGraphNode> 
    <label:zorn protégé a prés> 
    <lang:fr> 
    <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00> 
    >

It may be instructive to refer back to this example when considering the various points below.

Query with SPARQL

The URI of the text extension property function is http://jena.apache.org/text#query more conveniently written:

PREFIX text: <http://jena.apache.org/text#>

...   text:query ...

Syntax

The following forms are all legal:

?s text:query 'word'                              # query
?s text:query ('word' 10)                         # with limit on results
?s text:query (rdfs:label 'word')                 # query specific property if multiple
?s text:query (rdfs:label 'protégé' 'lang:fr')    # restrict search to French
(?s ?score) text:query 'word'                     # query capturing also the score
(?s ?score ?literal) text:query 'word'            # ... and original literal value

The most general form is:

 (?s ?score ?literal) text:query (property 'query string' limit 'lang:xx')

Input arguments:

 Argument    Definition 
property (optional) URI (including prefix name form)
query string Lucene query string fragment
limit (optional) int limit on the number of results
lang:xx (optional) language tag spec

The property URI is only necessary if multiple properties have been indexed and the property being searched over is not the default field of the index.

The query string syntax conforms the underlying index Lucene or Elasticsearch. In the case of Lucene the syntax is restricted to Terms, Term modifiers, Boolean Operators applied to Terms, and Grouping of terms. No use of Fields within the query string is supported.

The optional limit indicates the maximum hits to be returned by Lucene.

The lang:xx specification is an optional string, where xx is a BCP-47 language tag. This restricts searches to field values that were originally indexed with the tag xx. Searches may be restricted to field values with no language tag via "lang:none".

If both limit and lang:xx are present, then limit must precede lang:xx.

If only the query string is required, the surrounding ( ) may be omitted.

Output arguments:

 Argument    Definition 
subject URI The subject of the indexed RDF triple.
score (optional) The score for the match.
literal (optional) The matched object literal.

The results include the subject URI; the score assigned by the text search engine; and the entire matched literal (if the index has been configured to store literal values). The subject URI may be a variable, e.g., ?s, or a URI. In the latter case the search is restricted to triples with the specified subject. The score and the literal must be variables.

If only the subject variable, ?s is needed then it must be written without surrounding ( ); otherwise, an error is signalled.

Query strings

There are several points that need to be considered when formulating SPARQL queries using the Lucene interface. As mentioned above, in the case of Lucene the query string syntax is restricted to Terms, Term modifiers, Boolean Operators applied to Terms, and Grouping of terms.

No explicit use of Fields within the query string is supported.

Simple queries

The simplest use of the jena-text Lucene integration is:

?s text:query "some phrase"

This will bind ?s to each entity URI that is the subject of a triple that has the default property and an object literal that matches the argument string, e.g.:

ex:AnEntity skos:prefLabel "this is some phrase to match"

This query form will indicate the subjects that have literals that match for the default property which is determined via the configuration of the text:predicate of the text:defaultField (in the above this has been assumed to be skos:prefLabel.

For a non-default property it is necessary to specify the property as an input argument to the text:query:

?s text:query (rdfs:label "protégé")

(see below for how RDF property names are mapped to Lucene Field names).

If this use case is sufficient for your needs you can skip on to the sections on configuration.

Queries with language tags

When working with rdf:langStrings it is necessary that the text:langField has been configured. Then it is as simple as writing queries such as:

?s text:query "protégé"@fr

to return results where the given term or phrase has been indexed under French in the text:defaultField.

It is also possible to use the optional lang:xx argument, for example:

?s text:query ("protégé" 'lang:fr') .

In general, the presence of a language tag, xx, on the query string or lang:xx in the text:query adds AND lang:xx to the query sent to Lucene, so the above example becomes the following Lucene query:

"label:protégé AND lang:fr"

For non-default properties the general form is used:

?s text:query (skos:altLabel "protégé" 'lang:fr')

Note that an explicit language tag on the query string takes precedence over the lang:xx, so the following

?s text:query ("protégé"@fr 'lang:none')

will find French matches rather than matches indexed without a language tag.

Queries that retrieve literals

It is possible to retrieve the literals that Lucene finds matches for assuming that

<#TextIndex#> text:storeValues true ;

has been specified in the TextIndex configuration. So

(?s ?sc ?lit) text:query (rdfs:label "protégé")

will bind the matching literals to ?lit, e.g.,

"zorn protégé a prés"@fr

Note it is necessary to include a variable to capture the Lucene score even if this value is not otherwise needed since the literal variable is determined by position.

Queries with graphs

Assuming that the text:graphField has been configured, then, when a triple is indexed, the graph that the triple resides in is included in the document and may be used to restrict searches or to retrieve the graph that a matching triple resides in.

For example:

select ?s ?lit
where {
  graph ex:G2 { (?s ?sc ?lit) text:query "zorn" } .
}

will restrict searches to triples with the default property that reside in graph, ex:G2.

On the other hand:

select ?g ?s ?lit
where {
  graph ?g { (?s ?sc ?lit) text:query "zorn" } .
}

will iterate over the graphs in the dataset, searching each in turn for matches.

If there is suitable structure to the graphs, e.g., a known rdf:type and depending on the selectivity of the text query and number of graphs, it may be more performant to express the query as follows:

select ?g ?s ?lit
where {
  (?s ?sc ?lit) text:query "zorn" .
  graph ?g { ?s a ex:Item } .
}

Queries across multiple Fields

As mentioned earlier, the text index uses the native Lucene query language; however, there are important constraints on how the Lucene query language is used within jena-text. In particular, explicit references to Lucene Fields with the query string are not supported. So how are Lucene queries that would otherwise refer to multiple Fields expressed?

The key is understanding that each triple is a separate document and so queries across Lucene Fields need to be expressed as SPARQL queries referring to the corresponding RDF properties. Note that there are typically three Fields in a document that are used during searching:

  1. the field corresponding to the property of the indexed triple,
  2. the field for the language of the literal (if configured), and
  3. the graph that the triple is in (if configured).

Given these it should be clear from the above that the Jena Text integration constructs a Lucene query from the property, query string, lang:xx, and SPARQL graph arguments.

For example, consider the following triples:

ex:SomePrinter 
    rdfs:label     "laser printer" ;
    ex:description "includes a large capacity cartridge" .

assuming an appropriate configuration, if we try to retrieve ex:SomePrinter with the following Lucene query string:

?s text:query "label:printer AND description:\"large capacity cartridge\""

then this query can not find the expected results since the AND is interpreted by Lucene to indicate that all documents that contain a matching label field and a matching description field are to be returned; yet, from the discussion above regarding the structure of Lucene documents in jena-text it is evident that there is not one but rather in fact two separate documents one with a label field and one with a description field so an effective SPARQL query is:

?s text:query (rdfs:label "printer") .
?s text:query (ex:description "large capacity cartridge") .

which leads to ?s being bound to ex:SomePrinter.

In other words when a query is to involve two or more properties then it expressed at the SPARQL level, as it were, versus in Lucene's query language.

It is worth noting that the equivalent of a Lucene OR of Fields is expressed simply via SPARQL union:

{ ?s text:query (rdfs:label "printer") . }
union
{ ?s text:query (ex:description "large capacity cartridge") . }

Suppose the matching literals are required for the above then it should be clear from the above that:

(?s ?sc1 ?lit1) text:query (skos:prefLabel "printer") .
(?s ?sc2 ?lit2) text:query (ex:description "large capacity cartridge") .

will be the appropriate form to retrieve the subject and the associated literals, ?lit1 and ?lit2. (Obviously, in general, the score variables, ?sc1 and ?sc2 must be distinct since it is very unlikely that the scores of the two Lucene queries will ever match).

There is no loss of expressiveness of the Lucene query language versus the jena-text integration of Lucene. Any cross-field ANDs are replaced by concurrent SPARQL calls to text:query as illustrated above and uses of Lucene OR can be converted to SPARQL unions. Uses of Lucene NOT are converted to appropriate SPARQL filters.

Queries with Boolean Operators and Term Modifiers

On the other hand the various features of the Lucene query language are all available to be used for searches within a Field. For example, Boolean Operators on Terms:

?s text:query (ex:description "(large AND cartridge)")

and

(?s ?sc ?lit) text:query (ex:description "(includes AND (large OR capacity))")

or fuzzy searches:

?s text:query (ex:description "include~")

and so on will work as expected.

Always surround the query string with ( ) if more than a single term or phrase are involved.

Good practice

From the above it should be clear that best practice, except in the simplest cases is to use explicit text:query forms such as:

(?s ?sc ?lit) text:query (ex:someProperty "a single Field query")

possibly with limit and lang:xx arguments.

Further, the query engine does not have information about the selectivity of the text index and so effective query plans cannot be determined programmatically. It is helpful to be aware of the following two general query patterns.

Query pattern 1 – Find in the text index and refine results

Access to the text index is first in the query and used to find a number of items of interest; further information is obtained about these items from the RDF data.

SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ; 
     rdfs:label ?label ;
     rdf:type   ?type 
}

The text:query limit argument is useful when working with large indexes to limit results to the higher scoring results – results are returned in the order of scoring by the text search engine.

Query pattern 2 – Filter results via the text index

By finding items of interest first in the RDF data, the text search can be used to restrict the items found still further.

SELECT ?s
{ ?s rdf:type     :book ;
     dc:creator  "John" .
  ?s text:query   (dc:title 'word') ; 
}

Configuration

The usual way to describe a text index is with a Jena assembler description. Configurations can also be built with code. The assembler describes a 'text dataset' which has an underlying RDF dataset and a text index. The text index describes the text index technology (Lucene or Elasticsearch) and the details needed for each.

A text index has an "entity map" which defines the properties to index, the name of the Lucene/Elasticsearch field and field used for storing the URI itself.

For simple RDF use, there will be one field, mapping a property to a text index field. More complex setups, with multiple properties per entity (URI) are possible.

Once configured, any data added to the text dataset is automatically indexed as well.

Text Dataset Assembler

The following is an example of a TDB dataset with a text index.

@prefix :        <http://localhost/jena_example/#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

## Example of a TDB dataset and text index
## Initialize TDB
[] ja:loadClass "org.apache.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
# Elasticsearch index
text:TextIndexES    rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    text:index     <#indexLucene> ;
    .

# A TDB dataset used for RDF storage
<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    tdb:unionDefaultGraph true ; # Optional
    .

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:/some/path/lucene-index> ;
    text:entityMap <#entMap> ;
    text:storeValues true ; 
    text:analyzer [ a text:StandardAnalyzer ] ;
    text:queryAnalyzer [ a text:KeywordAnalyzer ] ;
    text:queryParser text:AnalyzingQueryParser ;
    text:multilingualSupport true ;
 .

The text:TextDataset has two properties:

  • a text:dataset, e.g., a tdb:DatasetTDB, to contain the RDF triples; and

  • an index configured to use either text:TextIndexLucene or text:TextIndexES.

The <#indexLucene> instance of text:TextIndexLucene, above, has two required properties:

  • the text:directory file URI which specifies the directory that will contain the Lucene index files – if this has the value "mem" then the index resides in memory;

  • the text:entityMap, <#entMap> that will define what properties are to be indexed and other features of the index; and

and several optional properties:

  • text:storeValues controls the storing of literal values. It indicates whether values are stored or not – values must be stored for the ?literal return value to be available in text:query in SPARQL.

  • text:analyzer specifies the default analyzer configuration to used during indexing and querying. The default analyzer defaults to Lucene's StandardAnalyzer.

  • text:queryAnalyzer specifies an optional analyzer for query that will be used to analyze the query string. If not set the analyzer used to index a given field is used.

  • text:queryParser is optional and specifies an alternative query parser

  • text:multilingualSupport enables Multilingual Support

If using Elasticsearch then an index would be configured as follows:

<#indexES> a text:TextIndexES ;
      # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
    text:serverList "127.0.0.1:9300" ; 
      # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
    text:clusterName "elasticsearch" ; 
      # The number of shards for the index. Defaults to 1
    text:shards "1" ;
      # The number of replicas for the index. Defaults to 1
    text:replicas "1" ;         
      # Name of the Index. defaults to jena-text
    text:indexName "jena-text" ;
    text:entityMap <#entMap> ;
    .

and text:index <#indexES> ; would be used in the configuration of :text_dataset.

To use a text index assembler configuration in Java code is it necessary to identify the dataset URI to be assembled, such as in:

Dataset ds = DatasetFactory.assemble(
    "text-config.ttl", 
    "http://localhost/jena_example/#text_dataset") ;

since the assembler contains two dataset definitions, one for the text dataset, one for the base data. Therefore, the application needs to identify the text dataset by it's URI http://localhost/jena_example/#text_dataset.

Entity Map definition

A text:EntityMap has several properties that condition what is indexed, what information is stored, and what analyzers are used.

<#entMap> a text:EntityMap ;
    text:defaultField     "label" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "label" ; 
           text:predicate rdfs:label ]
         ) .

Default text field

The text:defaultField specifies the default field name that Lucene will use in a query that does not otherwise specify a field. For example,

?s text:query "\"bread and butter\""

will perform a search in the label field for the phrase "bread and butter"

Entity field

The text:entityField specifies the field name of the field that will contain the subject URI that is returned on a match. The value of the property is arbitrary so long as it is unique among the defined names.

UID Field and automatic document deletion

When the text:uidField is defined in the EntityMap then dropping a triple will result in the corresponding document, if any, being deleted from the text index. The value, "uid", is arbitrary and defines the name of a stored field in Lucene that holds a unique ID that represents the triple.

If you configure the index via Java code, you need to set this parameter to the EntityDefinition instance, e.g.

EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
docDef.setUidField("uid");

Note: If you migrate from an index without deletion support to an index with automatic deletion, you will need to rebuild the index to ensure that the uid information is stored.

Language Field

The text:langField is the name of the field that will store the language attribute of the literal in the case of an rdf:langString. This Entity Map property is a key element of the Linguistic support with Lucene index

Graph Field

Setting the text:graphField allows graph-specific indexing of the text index to limit searching to a specified graph when a SPARQL query targets a single named graph. The field value is arbitrary and serves to store the graph ID that a triple belongs to when the index is updated.

The Analyzer Map

The text:map is a list of analyzer specifications as described below.

Configuring an Analyzer

Text to be indexed is passed through a text analyzer that divides it into tokens and may perform other transformations such as eliminating stop words. If a Lucene or Elasticsearch text index is used, then by default the Lucene StandardAnalyzer is used.

In case of a TextIndexLucene the default analyzer can be replaced by another analyzer with the text:analyzer property on the text:TextIndexLucene resource in the text dataset assembler, for example with a SimpleAnalyzer:

<#indexLucene> a text:TextIndexLucene ;
        text:directory <file:Lucene> ;
        text:analyzer [
            a text:SimpleAnalyzer
        ]
        .

It is possible to configure an alternative analyzer for each field indexed in a Lucene index. For example:

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label ;
           text:analyzer [
               a text:StandardAnalyzer ;
               text:stopWords ("a" "an" "and" "but")
           ]
         ]
         ) .

will configure the index to analyze values of the 'text' field using a StandardAnalyzer with the given list of stop words.

Other analyzer types that may be specified are SimpleAnalyzer and KeywordAnalyzer, neither of which has any configuration parameters. See the Lucene documentation for details of what these analyzers do. Jena also provides LowerCaseKeywordAnalyzer, which is a case-insensitive version of KeywordAnalyzer, and ConfigurableAnalyzer.

Support for the new LocalizedAnalyzer has been introduced in Jena 3.0.0 to deal with Lucene language specific analyzers. See Linguistic Support with Lucene Index for details.

Support for GenericAnalyzers has been introduced in Jena 3.4.0 to allow the use of Analyzers that do not have built-in support, e.g., BrazilianAnalyzer; require constructor parameters not otherwise supported, e.g., a stop words FileReader or a stemExclusionSet; and finally use of Analyzers not included in the bundled Lucene distribution, e.g., a SanskritIASTAnalyzer. See Generic and Defined Analyzer Support

ConfigurableAnalyzer

ConfigurableAnalyzer was introduced in Jena 3.0.1. It allows more detailed configuration of text analysis parameters by independently selecting a Tokenizer and zero or more TokenFilters which are applied in order after tokenization. See the Lucene documentation for details on what each tokenizer and token filter does.

The available Tokenizer implementations are:

  • StandardTokenizer
  • KeywordTokenizer
  • WhitespaceTokenizer
  • LetterTokenizer

The available TokenFilter implementations are:

  • StandardFilter
  • LowerCaseFilter
  • ASCIIFoldingFilter

Configuration is done using Jena assembler like this:

text:analyzer [
  a text:ConfigurableAnalyzer ;
  text:tokenizer text:KeywordTokenizer ;
  text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
]

Here, text:tokenizer must be one of the four tokenizers listed above and the optional text:filters property specifies a list of token filters.

Analyzer for Query

New in Jena 2.13.0.

There is an ability to specify an analyzer to be used for the query string itself. It will find terms in the query text. If not set, then the analyzer used for the document will be used. The query analyzer is specified on the TextIndexLucene resource:

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    text:entityMap <#entMap> ;
    text:queryAnalyzer [
        a text:KeywordAnalyzer
    ]
    .

Alternative Query Parsers

New in Jena 3.1.0.

It is possible to select a query parser other than the default QueryParser.

The available QueryParser implementations are:

  • AnalyzingQueryParser: Performs analysis for wildcard queries . This is useful in combination with accent-insensitive wildcard queries.

  • ComplexPhraseQueryParser: Permits complex phrase query syntax. Eg: "(john jon jonathan~) peters*". This is useful for performing wildcard or fuzzy queries on individual terms in a phrase.

The query parser is specified on the TextIndexLucene resource:

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    text:entityMap <#entMap> ;
    text:queryParser text:AnalyzingQueryParser .

Elasticsearch currently doesn't support Analyzers beyond Standard Analyzer.

Configuration by Code

A text dataset can also be constructed in code as might be done for a purely in-memory setup:

    // Example of building a text dataset with code.
    // Example is in-memory.
    // Base dataset
    Dataset ds1 = DatasetFactory.createMem() ;

    EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label) ;

    // Lucene, in memory.
    Directory dir =  new RAMDirectory();

    // Join together into a dataset
    Dataset ds = TextDatasetFactory.createLucene(ds1, dir, entDef) ;

Graph-specific Indexing

jena-text supports storing information about the source graph into the text index. This allows for more efficient text queries when the query targets only a single named graph. Without graph-specific indexing, text queries do not distinguish named graphs and will always return results from all graphs.

Support for graph-specific indexing is enabled by defining the name of the index field to use for storing the graph identifier.

If you use an assembler configuration, set the graph field using the text:graphField property on the EntityMap, e.g.

# Mapping in the index
# URI stored in field "uri"
# Graph stored in field "graph"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

If you configure the index in Java code, you need to use one of the EntityDefinition constructors that support the graphField parameter, e.g.

    EntityDefinition entDef = new EntityDefinition("uri", "text", "graph", RDFS.label.asNode()) ;

Note: If you migrate from a global (non-graph-aware) index to a graph-aware index, you need to rebuild the index to ensure that the graph information is stored.

Linguistic support with Lucene index

Language tags associated with rdfs:langStrings occuring as literals in triples may be used to enhance indexing and queries. Sub-sections below detail different settings with the index, and use cases with SPARQL queries.

Explicit Language Field in the Index

The language tag for object literals of triples can be stored (during triple insert/update) into the index to extend query capabilities. For that, the text:langField property must be set in the EntityMap assembler :

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;        
    text:langField        "lang" ;       
    .

If you configure the index via Java code, you need to set this parameter to the EntityDefinition instance, e.g.

EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
docDef.setLangField("lang");

Note that configuring the text:langField does not determine a language specific analyzer. It merely records the tag associated with an indexed rdfs:langString.

SPARQL Linguistic Clause Forms

Once the langField is set, you can use it directly inside SPARQL queries. For that the lang:xx argument allows you to target specific localized values. For example:

//target english literals
?s text:query (rdfs:label 'word' 'lang:en' )

//target unlocalized literals
?s text:query (rdfs:label 'word' 'lang:none')

//ignore language field
?s text:query (rdfs:label 'word')

Refer above for further discussion on querying.

LocalizedAnalyzer

You can specify a LocalizedAnalyzer in order to benefit from Lucene language specific analyzers (stemming, stop words,...). Like any other analyzers, it can be done for default text indexing, for each different field or for query.

Using an assembler configuration, the text:language property needs to be provided, e.g :

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    text:entityMap <#entMap> ;
    text:analyzer [
        a text:LocalizedAnalyzer ;
        text:language "fr"
    ]
    .

will configure the index to analyze values of the default property field using a FrenchAnalyzer.

To configure the same example via Java code, you need to provide the analyzer to the index configuration object:

    TextIndexConfig config = new TextIndexConfig(def);
    Analyzer analyzer = Util.getLocalizedAnalyzer("fr");
    config.setAnalyzer(analyzer);
    Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;

Where def, ds1 and dir are instances of EntityDefinition, Dataset and Directory classes.

Note: You do not have to set the text:langField property with a single localized analyzer. Also note that the above configuration will use the FrenchAnalyzer for all strings indexed under the default property regardless of the language tag associated with the literal (if any).

Multilingual Support

Let us suppose that we have many triples with many localized literals in many different languages. It is possible to take all these languages into account for future mixed localized queries. Configure the text:multilingualSupport property to enable indexing and search via localized analyzers based on the language tag:

<#indexLucene> a text:TextIndexLucene ;
    text:directory "mem" ;
    text:multilingualSupport true;     
    .

Via Java code, set the multilingual support flag :

    TextIndexConfig config = new TextIndexConfig(def);
    config.setMultilingualSupport(true);
    Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;

This multilingual index combines dynamically all localized analyzers of existing languages and the storage of langField properties.

The multilingual analyzer becomes the default analyzer and the Lucene StandardAnalyzer is the default analyzer used when there is no language tag.

It is straightforward to refer to different languages in the same text search query:

SELECT ?s
WHERE {
    { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
    UNION
    { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
}

Hence, the result set of the query will contain "institute" related subjects (institution, institutional,...) in French and in English.

Note When multilingual indexing is enabled for a property, e.g., rdfs:label, there will actually be two copies of each literal indexed. One under the Field name, "label", and one under the name "label_xx", where "xx" is the language tag.

Generic and Defined Analyzer Support

There are many Analyzers that do not have built-in support, e.g., BrazilianAnalyzer; require constructor parameters not otherwise supported, e.g., a stop words FileReader or a stemExclusionSet; or make use of Analyzers not included in the bundled Lucene distribution, e.g., a SanskritIASTAnalyzer. Two features have been added to enhance the utility of jena-text: 1) text:GenericAnalyzer; and 2) text:DefinedAnalyzer.

Generic Analyzer

A text:GenericAnalyzer includes a text:class which is the fully qualified class name of an Analyzer that is accessible on the jena classpath. This is trivial for Analyzer classes that are included in the bundled Lucene distribution and for other custom Analyzers a simple matter of including a jar containing the custom Analyzer and any associated Tokenizer and Filters on the classpath.

In addition to the text:class it is generally useful to include an ordered list of text:params that will be used to select an appropriate constructor of the Analyzer class. If there are no text:params in the analyzer specification or if the text:params is an empty list then the nullary constructor is used to instantiate the analyzer. Each element of the list of text:params includes:

  • an optional text:paramName of type Literal that is useful to identify the purpose of a parameter in the assembler configuration
  • a required text:paramType which is one of:
 Type    Description 
text:TypeAnalyzer a subclass of org.apache.lucene.analysis.Analyzer
text:TypeBoolean a java boolean
text:TypeFile the String path to a file materialized as a java.io.FileReader
text:TypeInt a java int
text:TypeString a java String
text:TypeSet an org.apache.lucene.analysis.CharArraySet
  • a required text:paramValue with an object of the type corresponding to text:paramType

In the case of an analyzer parameter the text:paramValue is any text:analyzer resource as describe throughout this document.

An example of the use of text:GenericAnalyzer to configure an EnglishAnalyzer with stop words and stem exclusions is:

text:map (
     [ text:field "text" ; 
       text:predicate rdfs:label;
       text:analyzer [
           a text:GenericAnalyzer ;
           text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
           text:params (
                [ text:paramName "stopwords" ;
                  text:paramType text:TypeSet ;
                  text:paramValue ("the" "a" "an") ]
                [ text:paramName "stemExclusionSet" ;
                  text:paramType text:TypeSet ;
                  text:paramValue ("ing" "ed") ]
                )
       ] .

Here is an example of defining an instance of ShingleAnalyzerWrapper:

text:map (
     [ text:field "text" ; 
       text:predicate rdfs:label;
       text:analyzer [
           a text:GenericAnalyzer ;
           text:class "org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
           text:params (
                [ text:paramName "defaultAnalyzer" ;
                  text:paramType text:TypeAnalyzer ;
                  text:paramValue [ a text:SimpleAnalyzer ] ]
                [ text:paramName "maxShingleSize" ;
                  text:paramType text:TypeInt ;
                  text:paramValue 3 ]
                )
       ] .

If there is need of using an analyzer with constructor parameter types not included here then one approach is to define an AnalyzerWrapper that uses available parameter types, such as file, to collect the information needed to instantiate the desired analyzer. An example of such an analyzer is the Kuromoji morphological analyzer for Japanese text that uses constructor parameters of types: UserDictionary, JapaneseTokenizer.Mode, CharArraySet and Set<String>.

Defined Analyzers

The text:defineAnalyzers feature allows to extend the Multilingual Support defined above. Further, this feature can also be used to name analyzers defined via text:GenericAnalyzer so that a single (perhaps complex) analyzer configuration can be used is several places.

The text:defineAnalyzers is used with text:TextIndexLucene to provide a list of analyzer definitions:

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    text:entityMap <#entMap> ;
    text:defineAnalyzers (
        [ text:addLang "sa-x-iast" ;
          text:analyzer [ . . . ] ]
        [ text:defineAnalyzer <#foo> ;
          text:analyzer [ . . . ] ]
    )
    .

References to a defined analyzer may be made in the entity map like:

text:analyzer [
    a text:DefinedAnalyzer
    text:useAnalyzer <#foo> ]
Extending multilingual support

The Multilingual Support described above allows for a limited set of ISO 2-letter codes to be used to select from among built-in analyzers using the nullary constructor associated with each analyzer. So if one is wanting to use:

  • a language not included, e.g., Brazilian; or
  • use additional constructors defining stop words, stem exclusions and so on; or
  • refer to custom analyzers that might be associated with generalized BCP-47 language tags, such as, sa-x-iast for Sanskrit in the IAST transliteration,

then text:defineAnalyzers with text:addLang will add the desired analyzers to the multilingual support so that fields with the appropriate language tags will use the appropriate custom analyzer.

When text:defineAnalyzers is used with text:addLang then text:multilingualSupport is implicitly added if not already specified and a warning is put in the log:

    text:defineAnalyzers (
        [ text:addLang "sa-x-iast" ;
          text:analyzer [ . . . ] ]

this adds an analyzer to be used when the text:langField has the value sa-x-iast during indexing and search.

Naming analyzers for later use

Repeating a text:GenericAnalyzer specification for use with multiple fields in an entity map may be cumbersome. The text:defineAnalyzer is used in an element of a text:defineAnalyzers list to associate a resource with an analyzer so that it may be referred to later in a text:analyzer object. Assuming that an analyzer definition such as the following has appeared among the text:defineAnalyzers list:

[ text:defineAnalyzer <#foo>
  text:analyzer [ . . . ] ]

then in a text:analyzer specification in an entity map, for example, a reference to analyzer <#foo> is made via:

text:map (
     [ text:field "text" ; 
       text:predicate rdfs:label;
       text:analyzer [
           a text:DefinedAnalyzer
           text:useAnalyzer <#foo> ]

This makes it straightforward to refer to the same (possibly complex) analyzer definition in multiple fields.

Storing Literal Values

New in Jena 3.0.0.

It is possible to configure the text index to store enough information in the text index to be able to access the original indexed literal values at query time. This is controlled by two configuration options. First, the text:storeValues property must be set to true for the text index:

<#indexLucene> a text:TextIndexLucene ;
    text:directory "mem" ;
    text:storeValues true;     
    .

Or using Java code, used the setValueStored method of TextIndexConfig:

    TextIndexConfig config = new TextIndexConfig(def);
    config.setValueStored(true);

Additionally, setting the langField configuration option is recommended. See Linguistic Support with Lucene Index for details. Without the langField setting, the stored literals will not have language tag or datatype information.

At query time, the stored literals can be accessed by using a 3-element list of variables as the subject of the text:query property function. The literal value will be bound to the third variable:

(?s ?score ?literal) text:query 'word'

Working with Fuseki

The Fuseki configuration simply points to the text dataset as the fuseki:dataset of the service.

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  :text_dataset ;
    .

Building a Text Index

When working at scale, or when preparing a published, read-only, SPARQL service, creating the index by loading the text dataset is impractical.
The index and the dataset can be built using command line tools in two steps: first load the RDF data, second create an index from the existing RDF dataset.

Step 1 - Building a TDB dataset

Note: If you have an existing TDB dataset then you can skip this step

Build the TDB dataset:

java -cp $FUSEKI_HOME/fuseki-server.jar tdb.tdbloader --tdb=assembler_file data_file

using the copy of TDB included with Fuseki.

Alternatively, use one of the TDB utilities tdbloader or tdbloader2 which are better for bulk loading:

$JENA_HOME/bin/tdbloader --loc=directory  data_file

Step 2 - Build the Text Index

You can then build the text index with the jena.textindexer tool:

java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --desc=assembler_file

Because a Fuseki assembler description can have several datasets descriptions, and several text indexes, it may be necessary to extract a single dataset and index description into a separate assembler file for use in loading.

Updating the index

If you allow updates to the dataset through Fuseki, the configured index will automatically be updated on every modification. This means that you do not have to run the above mentioned jena.textindexer after updates, only when you want to rebuild the index from scratch.

Configuring Alternative TextDocProducers

The default behaviour when text indexing is to index a single property as a single field, generating a different Document for each indexed triple. To change this behaviour requires writing and configuring an alternative TextDocProducer.

To configure a TextDocProducer, say net.code.MyProducer in a dataset assembly, use the property textDocProducer, eg:

<#ds-with-lucene> rdf:type text:TextDataset;
    text:index <#indexLucene> ;
    text:dataset <#ds> ;
    text:textDocProducer <java:net.code.MyProducer> ;
    .

where CLASSNAME is the full java class name. It must have either a single-argument constructor of type TextIndex, or a two-argument constructor (DatasetGraph, TextIndex). The TextIndex argument will be the configured text index, and the DatasetGraph argument will be the graph of the configured dataset.

For example, to explicitly create the default TextDocProducer use:

...
    text:textDocProducer <java:org.apache.jena.query.text.TextDocProducerTriples> ;
...

TextDocProducerTriples produces a new document for each subject/field added to the dataset, using TextIndex.addEntity(Entity).

Example

The example class below is a TextDocProducer that only indexes ADDs of quads for which the subject already had at least one property-value. It uses the two-argument constructor to give it access to the dataset so that it count the (?G, S, P, ?O) quads with that subject and predicate, and delegates the indexing to TextDocProducerTriples if there are at least two values for that property (one of those values, of course, is the one that gives rise to this change()).

  public class Example extends TextDocProducerTriples {

      final DatasetGraph dg;

      public Example(DatasetGraph dg, TextIndex indexer) {
          super(indexer);
          this.dg = dg;
      }

      public void change(QuadAction qaction, Node g, Node s, Node p, Node o) {
          if (qaction == QuadAction.ADD) {
              if (alreadyHasOne(s, p)) super.change(qaction, g, s, p, o);
          }
      }

      private boolean alreadyHasOne(Node s, Node p) {
          int count = 0;
          Iterator<Quad> quads = dg.find( null, s, p, null );
          while (quads.hasNext()) { quads.next(); count += 1; }
          return count > 1;
      }
  }

Maven Dependency

The jena-text module is included in Fuseki. To use it within application code, then use the following maven dependency:

<dependency>
  <groupId>org.apache.jena</groupId>
  <artifactId>jena-text</artifactId>
  <version>X.Y.Z</version>
</dependency>

adjusting the version X.Y.Z as necessary. This will automatically include a compatible version of Lucene.

For Elasticsearch implementation, you can include the following Maven Dependency:

<dependency>
  <groupId>org.apache.jena</groupId>
  <artifactId>jena-text-es</artifactId>
  <version>X.Y.Z</version>
</dependency>

adjusting the version X.Y.Z as necessary.