Lucene multi-value field indexing

Lucene multi-value field indexing - java

(More specific problem details are below in the update) I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.)
I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer.
(For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
The analyzer is just the following:
class AmbiguousTokenAnalyzer extends Analyzer {
private PayloadEncoder encoder = new IntegerEncoder();
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
sink.addAttribute(OffsetAttribute.class);
sink.addAttribute(CharTermAttribute.class);
sink.addAttribute(PayloadAttribute.class);
sink.addAttribute(PositionIncrementAttribute.class);
return new TokenStreamComponents(source, sink);
}
}
CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.
The field is configured as:
attributeFieldType.setIndexed(true);
attributeFieldType.setStored(true);
attributeFieldType.setOmitNorms(true);
attributeFieldType.setTokenized(true);
attributeFieldType.setStoreTermVectorOffsets(true);
attributeFieldType.setStoreTermVectorPositions(true);
attributeFieldType.setStoreTermVectors(true);
attributeFieldType.setStoreTermVectorPayloads(true);
The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage.
As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?
UPDATE
Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them.
What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?

Well, it turns out that all the values are actually stored.
Here is what I get after indexing:
indexSearcher.doc(0).getFields("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000 S|1|0>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
And the "single-field" version
indexSearcher.doc(0).getField("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
I don't know why getField() returns only the first value, but it seems that for my needs getFields() is OK.

Related

Write elements of a map to a CSV correctly in a simplified way in Java 8

I have a countries Map with the following design:
England=24
Spain=21
Italy=10
etc
Then, I have a different citiesMap with the following design:
London=10
Manchester=5
Madrid=7
Barcelona=4
Roma=3
etc
Currently, I am printing these results on screen:
System.out.println("\nCountries:");
Map<String, Long> countryMap = countTotalResults(orderDataList, OrderData::getCountry);
writeResultInCsv(countryMap);
countryMap.entrySet().stream().forEach(System.out::println);
System.out.println("\nCities:\n");
Map<String, Long> citiesMap = countTotalResults(orderDataList, OrderData::getCity);
writeResultInCsv(citiesMap);
citiesMap.entrySet().stream().forEach(System.out::println);
I want to write each line of my 2 maps in the same CSV file. I have the following code:
public void writeResultInCsv(Map<String, Long> resultMap) throws Exception {
File csvOutputFile = new File(RUTA_FICHERO_RESULTADO);
try (PrintWriter pw = new PrintWriter(csvOutputFile)) {
resultMap.entrySet().stream()
.map(this::convertToCSV)
.forEach(pw::println);
}
}
public String convertToCSV(String[] data) {
return Stream.of(data)
.map(this::escapeSpecialCharacters)
.collect(Collectors.joining("="));
}
public String escapeSpecialCharacters(String data) {
String escapedData = data.replaceAll("\\R", " ");
if (data.contains(",") || data.contains("\"") || data.contains("'")) {
data = data.replace("\"", "\"\"");
escapedData = "\"" + data + "\"";
}
return escapedData;
}
But I get compilation error in writeResultInCsv method, in the following line:
.map(this::convertToCSV)
This is the compilation error I get:
reason: Incompatible types: Entry is not convertible to String[]
How can I indicate the following result in a CSV file in Java 8 in a simplified way?
This is the result and design that I want my CSV file to have:
Countries:
England=24
Spain=21
Italy=10
etc
Cities:
London=10
Manchester=5
Madrid=7
Barcelona=4
Roma=3
etc

Your resultMap.entrySet() is a Set<Map.Entry<String, Long>>. You then turn that into a Stream<Map.Entry<String, Long>>, and then run .map on this. Thus, the mapper you provide there needs to map objects of type Map.Entry<String, Long> to whatever you like. but you pass the convertToCSV method to it, which maps string arrays.
Your code tries to join on comma (Collectors.joining(",")), but your desired output contains zero commas.
It feels like one of two things is going on:
you copy/pasted this code from someplace or it was provided to you and you have no idea what any of it does. I would advise tearing this code into pieces: Take each individual piece, experiment with it until you understand it, then put it back together again and now you know what you're looking at. At that point you would know that having Collectors.joining(",") in this makes no sense whatsoever, and that you're trying to map an entry of String, Long using a mapping function that maps string arrays - which obviously doesn't work.
You would know all this stuff but you haven't bothered to actually look at your code. That seems a bit surprising, so I don't think this is it. But if it is - the code you have is so unrelated to the job you want to do, that you might as well remove your code entirely and turn this question into: "I have this. I want this. How do I do it?"
NB: A text file listing key=value pairs is not usually called a CSV file.

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

I am currently in the process of upgrading a search engine application from Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes in version 4 that break backward compatibility. I have managed to fix most of them, but a few issues remain that I could use some help with:
"cannot override final method from Analyzer"
The original code extended the Analyzer class and the overrode tokenStream(...).
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
CharStream charStream = CharReader.get(reader);
return
new LowerCaseFilter(version,
new SeparationFilter(version,
new WhitespaceTokenizer(version,
new HTMLStripFilter(charStream))));
}
But this method is final now and I am not sure how to understand the following note from the change log:
ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer implementations must now use Analyzer.TokenStreamComponents, rather than overriding .tokenStream() and .reusableTokenStream() (which are now final).
There is another problem in the method quoted above:
"The method get(Reader) is undefined for the type CharReader"
There seem to have been some considerable changes here, too.
"TermPositionVector cannot be resolved to a type"
This class is gone now in Lucene 4. Are there any simple fixes for this? From the change log:
The term vectors APIs (TermFreqVector, TermPositionVector, TermVectorMapper) have been removed in favor of the above flexible indexing APIs, presenting a single-document inverted index of the document from the term vectors.
Probably related to this:
"The method getTermFreqVector(int, String) is undefined for the type IndexReader."
Both problems occur here, for instance:
TermPositionVector termVector = (TermPositionVector) reader.getTermFreqVector(...);
("reader" is of Type IndexReader)
I would appreciate any help with these issues.

I found core developer Uwe Schindler's response to your question on the Lucene mailing list. It took me some time to wrap my head around the new API, so I need to write down something before I forget.
These notes apply to Lucene 4.10.3.
Implementing an Analyzer (1-2)
new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(new HTMLStripCharFilter(reader));
TokenStream sink = new LowerCaseFilter(source);
return new TokenStreamComponents(source, sink);
}
};
The constructor of TokenStreamComponents takes a source and a sink. The sink is the end result of your token stream, returned by Analyzer.tokenStream(), so set it to your filter chain. The source is the token stream before you apply any filters.
HTMLStripCharFilter, despite its name, is actually a subclass of java.io.Reader which removes HTML constructs, so you no longer need CharReader.
Term vector replacements (3-4)
Term vectors work differently in Lucene 4, so there are no straightforward method swaps. The specific answer depends on what your requirements are.
If you want positional information, you have to index your fields with positional information in the first place:
Document doc = new Document();
FieldType f = new FieldType();
f.setIndexed(true);
f.setStoreTermVectors(true);
f.setStoreTermVectorPositions(true);
doc.add(new Field("text", "hello", f));
Finally, in order to get at the frequency and positional info of a field of a document, you drill down the new API like this (adapted from this answer):
// IndexReader ir;
// int docID = 0;
Terms terms = ir.getTermVector(docID, "text");
terms.hasPositions(); // should be true if you set the field to store positions
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// Explore the terms for this field
while ((term = termsEnum.next()) != null) {
// Enumerate through documents, in this case only one
DocsAndPositionsEnum docsEnum = termsEnum.docsAndPositions(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
for (int i = 0; i < docsEnum.freq(); i++) {
System.out.println(term.utf8ToString() + " " + docIdEnum + " "
+ docsEnum.nextPosition());
}
}
}
It'd be nice if Terms.iterator() returned an actual Iterable.

How to skip lines with ItemReader in Spring-Batch?

I have a custom item reader that transforms lines from a textfile to my entity:
public class EntityItemReader extends AbstractItemStreamItemReader<MyEntity> {
#Override
public MyEntity read() {
String line = delegate.read();
//analyze line and skip by condition
//line.split
//create entity with line values
}
}
This is similar to the FlatFileItemReader.
The read MyEntity will then be persisted to a DB by a JdbcItemReader.
Problem: sometimes I have lines that contain values that should be skipped.
BUT when I just return null inside the read() method of the reader, then not only this item is skipped, by the reading is terminated completely, and all further lines will be skipped. Because a null element is the "signal" for all spring-readers that the file to be read is finished.
So: what can I do to skip specific lines by condition inside the reader if I cannot return null? Because by nature of the reader I'm forced to return an object here.

I think the good practice to filter some lines is to use not the reader but a processor (in which you can return null when you want to filter the line).
Please see http://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html :
6.3.2 Filtering Records
One typical use for an item processor is to filter out records before they are passed to the ItemWriter. Filtering is an action distinct from skipping; skipping indicates that a record is invalid whereas filtering simply indicates that a record should not be written.
For example, consider a batch job that reads a file containing three different types of records: records to insert, records to update, and records to delete. If record deletion is not supported by the system, then we would not want to send any "delete" records to the ItemWriter. But, since these records are not actually bad records, we would want to filter them out, rather than skip. As a result, the ItemWriter would receive only "insert" and "update" records.
To filter a record, one simply returns "null" from the ItemProcessor. The framework will detect that the result is "null" and avoid adding that item to the list of records delivered to the ItemWriter. As usual, an exception thrown from the ItemProcessor will result in a skip.

I've had a similar problem for the more general case where I'm using a custom reader. That is backed by an iterator over an object type and returns a new item (of different type) for each object read. Problem is some of those objects don't map to anything, so I'd like to return something that marks that.
Eventually I've decided to define an INVALID_ITEM and return that. Another approach could be to advance the iterator in the read() method, until the next valid item, with null returned if .hasNext() becomes false, but that is more cumbersome.
Initially I have also tried to return a custom ecxeption and tell Spring to skip the item upon it, but it seemed to be ignored, so I gave up (if there are too many invalids isn't performant anyway).

I do not think you can have your cake and eat it too in this case (and after reading all the comments).
By best opinion would (as suggested) to throw a custom Exception and skip 'on it'.
You can maybe optimize your entity creation or processes elsewhere so you don't loose so much performance.
Good luck.

We can handle it via a custom Dummy Object.
private final MyClass DUMMYMyClassObject ;
private MyClass(){
// create blank Object .
}
public static final MyClass getDummyyClassObject(){
if(null == DUMMYMyClassObject){
DUMMYMyClassObject = new MyClass();
}
return DUMMYMyClassObject ;
}
And just use the below when required to skip the record in the reader :
return MyClass.getDummyyClassObject();
The same can be ignored in the processor , checking if the object is blank OR as per the logic written in the private default constructor .

For skipping lines you can throw Exception when you want to skip some lines, like below.
My Spring batch Step
#Bean
Step processStep() {
return stepBuilderFactory.get("job step")
.<String, String>chunk(1000)
.reader(ItemReader)
.writer(DataWriter)
.faultTolerant() //allowing spring batch to skip line
.skipLimit(1000) //skip line limit
.skip(CustomException.class) //skip lines when this exception is thrown
.build();
}
My Item reader
#Bean(name = "reader")
public FlatFileItemReader<String> fileItemReader() throws Exception {
FlatFileItemReader<String> reader = new FlatFileItemReader<String>();
reader.setResource(resourceLoader.getResource("c://file_location/file.txt"));
CustomLineMapper lineMapper = new CustomLineMapper();
reader.setLineMapper(lineMapper);
return reader;
}
My custom line mapper
public class CustomLineMapper implements LineMapper<String> {
#Override
public String mapLine(String s, int i) throws Exception {
if(Condition) //put your condition here when you want to skip lines
throw new CustomException();
return s;
}
}

Adding two TokenStream streams together (an ASCIIFoldingFilter case)

I wrote a custom analyzer that uses ASCIIFoldingFilter in order to reduce the extended Latin set in location names to the regular Latin.
public class LocationNameAnalyzer extends Analyzer {
#Override
public TokenStream tokenStream(String arg0, Reader reader) {
//TokenStream result = new WhitespaceTokenizer(Version.LUCENE_36, reader);
StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_36, reader);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
return result;
}
}
I know it is full of deprecated stuff, as it is now, but that I will correct later on. My problem right now is that when I apply this analyzer, I am able to find results using standard Latin, but not when searching for the name in original.
For example: "Munchen" brings me results related to Munich, but "München" does not anymore.
I assume that in my case, the ASCIIFoldingFilter simply overrides the characters in my stream, so the question is how to add the two streams together (the normal one, and the folded Latin one)

You should use your filter both on the Analyzer and on the Searcher, in that way the tokens used to search will be the same as the ones stored on the index.

Use pretokenized text with lucene

My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01 and use a WhiteSpaceTokenizer to split them again. Is there a better idea? (the input is in XML)
As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).

Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):
public class MyTokenStream extends TokenStream {
CharTermAttribute charTermAtt;
OffsetAttribute offsetAtt;
final Iterator<MyToken> listOfTokens;
MyTokenStream(Iterator<MyToken> tokenList) {
listOfTokens = tokenList;
charTermAtt = addAttribute(CharTermAttribute.class);
offsetAtt = addAttribute(OffsetAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if(listOfTokens.hasNext()) {
super.clearAttributes();
MyToken myToken = listOfTokens.next();
charTermAtt.setLength(0);
charTermAtt.append(myToken.getText());
offsetAtt.setOffset(myToken.begin(), myToken.end());
return true;
}
return false;
}
}

WhitespaceTokenizer is unfit for strings joined with 0x01. Instead, derive from CharTokenizer, overriding isTokenChar.
The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial TokenStream that just emits the tokens from its input.
If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a PositionIncrementAttribute to control this.
For an example of PositionIncrementAttribute usage, see my lemmatizing TokenStream which emits both word forms found in full text and their lemmas at the same position.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene multi-value field indexing - java

Related

Write elements of a map to a CSV correctly in a simplified way in Java 8

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

How to skip lines with ItemReader in Spring-Batch?

Adding two TokenStream streams together (an ASCIIFoldingFilter case)

Use pretokenized text with lucene

Categories

Resources