Use pretokenized text with lucene

Use pretokenized text with lucene - java

My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01 and use a WhiteSpaceTokenizer to split them again. Is there a better idea? (the input is in XML)
As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).

Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):
public class MyTokenStream extends TokenStream {
CharTermAttribute charTermAtt;
OffsetAttribute offsetAtt;
final Iterator<MyToken> listOfTokens;
MyTokenStream(Iterator<MyToken> tokenList) {
listOfTokens = tokenList;
charTermAtt = addAttribute(CharTermAttribute.class);
offsetAtt = addAttribute(OffsetAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if(listOfTokens.hasNext()) {
super.clearAttributes();
MyToken myToken = listOfTokens.next();
charTermAtt.setLength(0);
charTermAtt.append(myToken.getText());
offsetAtt.setOffset(myToken.begin(), myToken.end());
return true;
}
return false;
}
}

WhitespaceTokenizer is unfit for strings joined with 0x01. Instead, derive from CharTokenizer, overriding isTokenChar.
The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial TokenStream that just emits the tokens from its input.
If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a PositionIncrementAttribute to control this.
For an example of PositionIncrementAttribute usage, see my lemmatizing TokenStream which emits both word forms found in full text and their lemmas at the same position.

Related

How to extract a String from a changing template in Java?

I have a question regarding best practices considering Java regular expressions/Strings manipulation.
I have a changing String template, let's say this time it looks like this:
/get/{id}/person
I have another String that matches this pattern eg.
/get/1234ewq/person
Keep in mind that the pattern could change anytime, slashes could disappear etc.
I would like to extract the difference between the two of them i.e. the result of the processing would be 1234ewq.
I know I could iterate over them char by char and compare, but, if it is possible, I wanted to find some smart approach to it with regular expressions.
What would be the best Java approach?
Thank you.

For you to answer your question with a regex approach I built a small example class which should hint you into a direction you could go with this (see below).
The problem with this approach is that you dynamically create a regular expression that depends on your template strings. This means that you have to somehow verify that your templates do not interfere with the regex compilation and matching process itself.
Also atm if you would use the same placeholder multiple times within a template the resulting HashMap only contains the value for the last placeholder mapping of that kind.
Normally this is the expected behaviour but this depends on your strategy of filling your templates.
For template processing in general you could have a look at the mustache library.
Also as Uli Sotschok mentioned, you probably would be better of with using something like google-diff-match-patch.
public class StringExtractionFromTemplate {
public static void main(String[] args) {
String template = "/get/{id}/person";
String filledTemplate = "/get/1234ewq/person";
System.out.println(diffTemplateInsertion(template, filledTemplate).get("id"));
}
private static HashMap<String, String> diffTemplateInsertion(String template, String filledTemplate){
//language=RegExp
String placeHolderPattern = "\\{(.+)}";
HashMap<String, String> templateTranslation = new HashMap<>();
String regexedTemplate = template.replaceAll(placeHolderPattern, "(.+)");
Pattern pattern = Pattern.compile(regexedTemplate);
Matcher templateMatcher = pattern.matcher(template);
Matcher filledTemplateMatcher = pattern.matcher(filledTemplate);
while (templateMatcher.find() && filledTemplateMatcher.find()) {
if(templateMatcher.groupCount() == filledTemplateMatcher.groupCount()){
for (int i = 1; i <= templateMatcher.groupCount(); i++) {
templateTranslation.put(
templateMatcher.group(i).replaceAll(placeHolderPattern,"$1"),
filledTemplateMatcher.group(i)
);
}
}
}
return templateTranslation;
}
}

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

I am currently in the process of upgrading a search engine application from Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes in version 4 that break backward compatibility. I have managed to fix most of them, but a few issues remain that I could use some help with:
"cannot override final method from Analyzer"
The original code extended the Analyzer class and the overrode tokenStream(...).
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
CharStream charStream = CharReader.get(reader);
return
new LowerCaseFilter(version,
new SeparationFilter(version,
new WhitespaceTokenizer(version,
new HTMLStripFilter(charStream))));
}
But this method is final now and I am not sure how to understand the following note from the change log:
ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer implementations must now use Analyzer.TokenStreamComponents, rather than overriding .tokenStream() and .reusableTokenStream() (which are now final).
There is another problem in the method quoted above:
"The method get(Reader) is undefined for the type CharReader"
There seem to have been some considerable changes here, too.
"TermPositionVector cannot be resolved to a type"
This class is gone now in Lucene 4. Are there any simple fixes for this? From the change log:
The term vectors APIs (TermFreqVector, TermPositionVector, TermVectorMapper) have been removed in favor of the above flexible indexing APIs, presenting a single-document inverted index of the document from the term vectors.
Probably related to this:
"The method getTermFreqVector(int, String) is undefined for the type IndexReader."
Both problems occur here, for instance:
TermPositionVector termVector = (TermPositionVector) reader.getTermFreqVector(...);
("reader" is of Type IndexReader)
I would appreciate any help with these issues.

I found core developer Uwe Schindler's response to your question on the Lucene mailing list. It took me some time to wrap my head around the new API, so I need to write down something before I forget.
These notes apply to Lucene 4.10.3.
Implementing an Analyzer (1-2)
new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(new HTMLStripCharFilter(reader));
TokenStream sink = new LowerCaseFilter(source);
return new TokenStreamComponents(source, sink);
}
};
The constructor of TokenStreamComponents takes a source and a sink. The sink is the end result of your token stream, returned by Analyzer.tokenStream(), so set it to your filter chain. The source is the token stream before you apply any filters.
HTMLStripCharFilter, despite its name, is actually a subclass of java.io.Reader which removes HTML constructs, so you no longer need CharReader.
Term vector replacements (3-4)
Term vectors work differently in Lucene 4, so there are no straightforward method swaps. The specific answer depends on what your requirements are.
If you want positional information, you have to index your fields with positional information in the first place:
Document doc = new Document();
FieldType f = new FieldType();
f.setIndexed(true);
f.setStoreTermVectors(true);
f.setStoreTermVectorPositions(true);
doc.add(new Field("text", "hello", f));
Finally, in order to get at the frequency and positional info of a field of a document, you drill down the new API like this (adapted from this answer):
// IndexReader ir;
// int docID = 0;
Terms terms = ir.getTermVector(docID, "text");
terms.hasPositions(); // should be true if you set the field to store positions
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// Explore the terms for this field
while ((term = termsEnum.next()) != null) {
// Enumerate through documents, in this case only one
DocsAndPositionsEnum docsEnum = termsEnum.docsAndPositions(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
for (int i = 0; i < docsEnum.freq(); i++) {
System.out.println(term.utf8ToString() + " " + docIdEnum + " "
+ docsEnum.nextPosition());
}
}
}
It'd be nice if Terms.iterator() returned an actual Iterable.

Lucene multi-value field indexing

(More specific problem details are below in the update) I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.)
I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer.
(For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
The analyzer is just the following:
class AmbiguousTokenAnalyzer extends Analyzer {
private PayloadEncoder encoder = new IntegerEncoder();
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
sink.addAttribute(OffsetAttribute.class);
sink.addAttribute(CharTermAttribute.class);
sink.addAttribute(PayloadAttribute.class);
sink.addAttribute(PositionIncrementAttribute.class);
return new TokenStreamComponents(source, sink);
}
}
CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.
The field is configured as:
attributeFieldType.setIndexed(true);
attributeFieldType.setStored(true);
attributeFieldType.setOmitNorms(true);
attributeFieldType.setTokenized(true);
attributeFieldType.setStoreTermVectorOffsets(true);
attributeFieldType.setStoreTermVectorPositions(true);
attributeFieldType.setStoreTermVectors(true);
attributeFieldType.setStoreTermVectorPayloads(true);
The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage.
As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?
UPDATE
Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them.
What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?

Well, it turns out that all the values are actually stored.
Here is what I get after indexing:
indexSearcher.doc(0).getFields("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000 S|1|0>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
And the "single-field" version
indexSearcher.doc(0).getField("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
I don't know why getField() returns only the first value, but it seems that for my needs getFields() is OK.

Writing a filter that goes between a charstreams inputsupplier and outputsupplier

I would like to write a grep type filter that takes a Guava Charstreams InputSupplier as an input and uses a Charstreams OutputSupplier as its output. It should only pass lines from the inputsupplier to the outputsupplier if they satisfy a particular regular expression.
What is the correct design pattern/paradim for doing this?
I would guess you would do the line filter like this:
InputSupplier<InputStreamReader> ris = CharStreams.newReaderSupplier(....
CharStreams.readLines(ris, new LineProcessor<....
and implementing the LineProcessor methods.
But what should the LineProcessor.getResult() return - just a succcess of failure? Should I be using a 'final' outputsupplier in the surrounding function?
Or am I using completely the wrong api/approach!!
A bit of pseudocode to demonstrate the best way would be much appreciated.
Thanks for your suggestions.

CharStreams.readLines returns a List<String>. So according to source code it is asking LineProcessor to accumulate lines and then return the result. In my opinion your LineReader can be something as below
new LineProcessor<List<String>>() {
List<String> result = Lists.newArrayList();
public boolean processLine(String line) {
if (line matches regex){
result.add(line.trim());
}
return true;//continue processing
}
public List<String> getResult() {return result;}
});

Adding two TokenStream streams together (an ASCIIFoldingFilter case)

I wrote a custom analyzer that uses ASCIIFoldingFilter in order to reduce the extended Latin set in location names to the regular Latin.
public class LocationNameAnalyzer extends Analyzer {
#Override
public TokenStream tokenStream(String arg0, Reader reader) {
//TokenStream result = new WhitespaceTokenizer(Version.LUCENE_36, reader);
StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_36, reader);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
return result;
}
}
I know it is full of deprecated stuff, as it is now, but that I will correct later on. My problem right now is that when I apply this analyzer, I am able to find results using standard Latin, but not when searching for the name in original.
For example: "Munchen" brings me results related to Munich, but "München" does not anymore.
I assume that in my case, the ASCIIFoldingFilter simply overrides the characters in my stream, so the question is how to add the two streams together (the normal one, and the folded Latin one)

You should use your filter both on the Analyzer and on the Searcher, in that way the tokens used to search will be the same as the ones stored on the index.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Use pretokenized text with lucene - java

Related

How to extract a String from a changing template in Java?

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

Lucene multi-value field indexing

Writing a filter that goes between a charstreams inputsupplier and outputsupplier

Adding two TokenStream streams together (an ASCIIFoldingFilter case)

Categories

Resources