Class LanguagePrefixedTokenStream
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.alfresco.solr.schema.highlight.LanguagePrefixedTokenStream
- All Implemented Interfaces:
Closeable,AutoCloseable
public final class LanguagePrefixedTokenStream
extends org.apache.lucene.analysis.Tokenizer
A
TokenStream decorator which determines dynamically the field type and the analyzer used for
executing the analysis of an input text.
Although this class extends Tokenizer, actually it is not a tokenizer: this because in order to individuate
the analyzer dynamically, a component must access to a IndexSchema instance, and usually this is not possible
in the components involved in the analysis chain (e.g. tokenizer, token filters, char filters).
The field type and the analyzer that will control the text analysis are computed in the following way:
-
pre-process the input reader given to this chain in order to detect the locale language code at the very
beginning.
The locale language prefix includes
- a beginning sentinel token #0;
- a language code (two or three chars)
- a closing sentinel token #0;
- if any language code has been found, it is used for determine a field type name composed by the prefix "highlighted_text_" and the detected language code (e.g. highlighted_text_ + en = highlighted_text_en).
- If the field type above doesn't exist in the schema, the the same procedure is repeated using the prefix "text_" (e.g. text_ + en = text_en)
- If the field type above doesn't exist in the schema, then the "text___" general text field type is used.
- The input text is analyzed using the (query or index) analyzer associated to the field type determined above.
- Author:
- Andrea Gazzarini
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.State -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected org.apache.lucene.analysis.Analyzerprotected Stringprotected org.apache.solr.schema.IndexSchemaprotected AlfrescoAnalyzerWrapper.ModeFields inherited from class org.apache.lucene.analysis.Tokenizer
inputFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Method Summary
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReaderMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
fieldName
-
indexSchema
protected org.apache.solr.schema.IndexSchema indexSchema -
mode
-
analyzer
protected org.apache.lucene.analysis.Analyzer analyzer
-
-
Method Details
-
reset
- Overrides:
resetin classorg.apache.lucene.analysis.Tokenizer- Throws:
IOException
-
incrementToken
- Specified by:
incrementTokenin classorg.apache.lucene.analysis.TokenStream- Throws:
IOException
-
end
- Overrides:
endin classorg.apache.lucene.analysis.TokenStream- Throws:
IOException
-
close
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classorg.apache.lucene.analysis.Tokenizer- Throws:
IOException
-