How to get a Token from a Lucene TokenStream?

How to get a Token from a Lucene TokenStream? - java

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29
Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.
Can anyone explain how to get token-like information from a TokenStream?

Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Edit: The new way
According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}

This is how it should be (a clean version of Adam's answer):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();

For the latest version of lucene 7.3.1
// Test the tokenizer
Analyzer testAnalyzer = new CJKAnalyzer();
String testText = "Test Tokenizer";
TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " + offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html

There are two variations in the OP question:
What is "the process to obtain Tokens from a TokenStream"?
"Can anyone explain how to get token-like information from a TokenStream?"
Recent versions of the Lucene documentation for Token say (emphasis added):
NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.
And TokenStream says its API:
... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.
The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.
But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?
With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.
It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

Related

Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.
For example, given this (contrived) input string...
" Someone’s - [texté] goes here, foo . "
...and a Lucene analyzer like this...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
I want to get the following output:
someone's texte goes here foo
The below Java method does what I want.
But is there a better (i.e. more typical and/or concise) way that I should be doing this?
I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.
Here is the code:
Lucene 8.3.0 imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
My method:
private String transform(String input) throws IOException {
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
StringBuilder sb = new StringBuilder();
try {
ts.reset();
while (ts.incrementToken()) {
sb.append(charTermAtt.toString()).append(" ");
}
ts.end();
} finally {
ts.close();
}
return sb.toString().trim();
}

I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

How to test a Lucene Analyzer?

I'm not getting the expected results from my Analyzer and would like to test the tokenization process.
The answer this question: How to use a Lucene Analyzer to tokenize a String?
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
Uses the TermAttribute to extract the tokens from the stream. The problem is that TermAttribute is no longer in Lucene 6.
What has it been replaced by?
What would the equivalent be with Lucene 6.6.0?

I'm pretty sure it was replaced by CharTermAttribute javadoc
The ticket is pretty old, but maybe the code was kept around a bit longer:
https://issues.apache.org/jira/browse/LUCENE-2372

Storing Elements From A Text Document Into A Map Java

So I have a text document called stock.txt which contains the following:
AAPL; Apple Inc.
IBM; International Business Machines Corp.
KO; The Coca-Cola Company
FB; Facebook Inc.
SBUX; Starbucks Corp.
And I want to store each element into a HashMap with the stock code as a key and the name of the company as the item. I originally tried storing all of it in an ArrayList however when I wanted to print out one line, for example:
AAPL;Apple Inc.
I would do:
System.out.println(array.get(0));
and it would give me the output:
APPL;Apple
and printing array.get(1) would give me the "Inc." part.
So my overarching question is how to I make sure that I can store these things properly in a HashMap so that I can get the whole string "Apple Inc." into one part of the Map.
Thanks!

You can try following:
InputStream stream=new FileInputStream(new File("path"));
BufferedReader reader=new BufferedReader(new InputStreamReader(stream));
String line;
String tok[];
Map<String, Object> map=new HashMap<String, Object>();
while((line=reader.readLine())!=null){
tok=line.split(";");
map.put(tok[0].trim(), tok[1].trim());
}
System.out.println(map);
Above code reads a file from a specific path and splits the read line from ; character and stores it into map.
Hope it helps.

There is no reason why you couldn't store the information into an ArrayList. The missing data has more to do with how you are reading and processing the file than the structure in which you are storing it. Having said that, using a HashMap will allow you to store the two parts of the line separately while maintaining the link between them. The ArrayList approach does not preserve that link - it is simply an ordered list.
Here's what I would do, which is similar to Darshan's approach (you will need Java 1.7+):
public HashMap<String, String> readStockFile(Path filePath, Charset charset)
{
HashMap<String, String> map = new HashMap<>();
try (BufferedReader fileReader =
Files.newBufferedReader(filePath, charset))
{
final int STOCK_CODE_GROUP = 1;
final int STOCK_NAME_GROUP = 2;
/*
* Regular expression - everything up to ';' goes in stock code group,
* everything after in stock name group, ignoring any whitespace after ';'.
*/
final Pattern STOCK_PATTERN = Pattern.compile("^([^;]+);\s*(.+)$");
String nextLine = null;
Matcher stockMatcher = null;
while ((nextLine = reader.readLine()) != null)
{
stockMatcher = STOCK_PATTERN.matcher(nextLine.trim());
if (stockMatcher.find(0))
if (!map.containsKey(stockMatcher.group(STOCK_CODE_GROUP)))
map.put(stockMatcher.group(STOCK_CODE_GROUP),
stockMatcher.group(STOCK_NAME_GROUP));
}
}
catch (IOException ioEx)
{
ioEx.printStackTrace(System.err); // Do something useful.
}
return map;
}
If you wish to retain insertion order into the map, substitute HashMap for a LinkedHashMap. The regular expression stuff (Pattern and Matcher) belongs to the java.util.regex package, while Path and Charset are in java.nio.file and java.nio.charset respectively. You'll need to use Path.getRoot(), Path.resolve(String filePath) and Charset.forName(String charset) to set up your arguments properly.
You may also want to consider what to do if you encounter a line that is not properly formatted or if a stock appears in the file twice. These will form 'else' clauses to the two 'ifs'.

How to give umlauts more weight in lucene?

I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?

One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.

How to chain multiple different InputStreams into one InputStream

I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.

It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)

This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)

Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))

Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader

Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.