org.apache.jena.assembler.assemblers.AssemblerBase

org.apache.jena.query.text.assembler.GenericTokenizerAssembler

All Implemented Interfaces:: org.apache.jena.assembler.Assembler

public class GenericTokenizerAssembler extends org.apache.jena.assembler.assemblers.AssemblerBase

Creates generic tokenizers given a fully qualified Class name and a list of parameters for a constructor of the Class.

The parameters may be of the following types:

     text:TypeString    String
     text:TypeSet       org.apache.lucene.analysis.util.CharArraySet
     text:TypeFile      java.io.FileReader
     text:TypeInt       int
     text:TypeBoolean   boolean
     text:TypeAnalyzer  org.apache.lucene.analysis.Analyzer

Although the list of types is not exhaustive it is a simple matter to create a wrapper Analyzer that reads a file with information that can be used to initialize any sort of parameters that may be needed for a given Analyzer. The provided types cover the vast majority of cases.

For example, org.apache.lucene.analysis.ja.JapaneseAnalyzer has a constructor with 4 parameters: a UserDict, a CharArraySet, a JapaneseTokenizer.Mode, and a Set<String>. So a simple wrapper can extract the values needed for the various parameters with types not available in this extension, construct the required instances, and instantiate the JapaneseAnalyzer.

Adding custom Analyzers such as the above wrapper analyzer is a simple matter of adding the Analyzer class and any associated filters and tokenizer and so on to the classpath for Jena - usually in a jar. Of course, all of the Analyzers that are included in the Lucene distribution bundled with Jena are available as generic Analyzers as well.

Each parameter object is specified with:

an optional text:paramName that may be used to document which parameter is represented
a text:paramType which is one of: text:TypeString, text:TypeSet, text:TypeFile, text:TypeInt, text:TypeBoolean, text:TypeAnalyzer.
a text:paramValue which is an xsd:string, xsd:boolean or xsd:int or resource.

A parameter of type text:TypeSet must have a list of zero or more Strings.

A parameter of type text:TypeString, text:TypeFile, text:TypeBoolean, text:TypeInt or text:TypeAnalyzer must have a single text:paramValue of the appropriate type.

Examples:

 
    <#indexLucene> a text:TextIndexLucene ;
        text:directory <file:Lucene> ;
        text:entityMap <#entMap> ;
        text:defineAnalyzers (
            [text:addLang "sa-x-iast" ;
             text:analyzer [ . . . ]]
            [text:defineAnalyzer <#foo> ;
             text:analyzer [ . . . ]]
            [text:defineTokenizer <#bar> ;
             text:tokenizer [
               a text:GenericTokenizer ;
               text:class "org.apache.lucene.analysis.ngram.NGramTokenizer" ;
               text:params (
                    [ text:paramName "minGram" ;
                      text:paramType text:TypeInt ;
                      text:paramValue 3 ]
                    [ text:paramName "maxGram" ;
                      text:paramType text:TypeInt ;
                      text:paramValue 7 ]
                    )
              ]
            ]
        )

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

GenericTokenizerAssembler.TokenizerSpec
Field Summary

Fields inherited from interface org.apache.jena.assembler.Assembler
content, defaultModel, documentManager, general, infModel, memoryModel, ontModel, ontModelSpec, prefixMapping, reasonerFactory, ruleSet, unionModel
Constructor Summary

Constructors

Constructor

Description

GenericTokenizerAssembler()
Method Summary

Modifier and Type

Method

Description

GenericTokenizerAssembler.TokenizerSpec

open(org.apache.jena.assembler.Assembler a, org.apache.jena.rdf.model.Resource root, org.apache.jena.assembler.Mode mode)

Methods inherited from class org.apache.jena.assembler.assemblers.AssemblerBase
getOptionalClassName, getRequiredResource, open, open, openModel, openModel

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- GenericTokenizerAssembler
  
  public GenericTokenizerAssembler()
Method Details
- open
  
  public GenericTokenizerAssembler.TokenizerSpec open(org.apache.jena.assembler.Assembler a, org.apache.jena.rdf.model.Resource root, org.apache.jena.assembler.Mode mode)
  
  Specified by:
  
  open in interface org.apache.jena.assembler.Assembler
  
  Specified by:
  
  open in class org.apache.jena.assembler.assemblers.AssemblerBase

Class GenericTokenizerAssembler

Nested Class Summary

Field Summary

Fields inherited from interface org.apache.jena.assembler.Assembler

Constructor Summary

Method Summary

Methods inherited from class org.apache.jena.assembler.assemblers.AssemblerBase

Methods inherited from class java.lang.Object

Constructor Details

GenericTokenizerAssembler

Method Details

open