Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

Using Lucene Analyzer Without Indexing - Is My Approach Reasonable? - java

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.
For example, given this (contrived) input string...
" Someone’s - [texté] goes here, foo . "
...and a Lucene analyzer like this...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
I want to get the following output:
someone's texte goes here foo
The below Java method does what I want.
But is there a better (i.e. more typical and/or concise) way that I should be doing this?
I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.
Here is the code:
Lucene 8.3.0 imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
My method:
private String transform(String input) throws IOException {
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
StringBuilder sb = new StringBuilder();
try {
ts.reset();
while (ts.incrementToken()) {
sb.append(charTermAtt.toString()).append(" ");
}
ts.end();
} finally {
ts.close();
}
return sb.toString().trim();
}

I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

Related

Lucene how can i turn off "toLowerCase" in StandardAnalyzer?

I want to tokenaze my text, I use tokenStream from StandardAnalyzer, but it has by default "toLowerCase".
My code:
ArrayList<String> toTextWord = new ArrayList<>();
Analyzer analyzer = new StandardAnalyzer();
try (TokenStream stream = analyzer.tokenStream("tags", new StringReader(iterStr))) {
stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
System.out.println(token.toString());
toTextWord.add(token.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
How can i use StandardAnalyzer without "toLowerCase"? How can i turn off "toLowerCase" in this StandardAnalyzer?

You cannot turn off toLowerCase directly in the StandardAnalyzer.
You can create a custom analyzer which behaves the same way as the StandardAnalyzer, and then customize it to meet your needs:
Example using org.apache.lucene.analysis.custom.CustomAnalyzer:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.build();
Now you can comment out (or remove) the lowercase token filter:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("stop")
.build();
Note that if you want to exactly match the default Standard Analyzer, then you should also comment out or remove the stop-word filter, since by default stopwords are not removed from the Standard Analyzer unless you provide an explicit list.
That gives us this:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
If I use the following input with my custom analyzer:
String iterStr = "Eric the quick brown fox jumps over Freddy the lazy dog, LOL.";
then the output from your code is as follows:
Eric
the
quick
brown
fox
jumps
over
Freddy
the
lazy
dog
LOL
Update
When using the CustomAnalyzer you can use string values to identify the different tokenizer and filter objects - such as "standard" and "lowercase", as used in my examples above.
If you want to avoid using these identifiers, you can use the relevant factory object with the NAME field:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(LowerCaseFilterFactory.NAME)
.addTokenFilter(StopFilterFactory.NAME)
.build();

How to test a Lucene Analyzer?

I'm not getting the expected results from my Analyzer and would like to test the tokenization process.
The answer this question: How to use a Lucene Analyzer to tokenize a String?
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
Uses the TermAttribute to extract the tokens from the stream. The problem is that TermAttribute is no longer in Lucene 6.
What has it been replaced by?
What would the equivalent be with Lucene 6.6.0?

I'm pretty sure it was replaced by CharTermAttribute javadoc
The ticket is pretty old, but maybe the code was kept around a bit longer:
https://issues.apache.org/jira/browse/LUCENE-2372

Get a tweet from html content in Java through either regex or at least without external libraries

How can I get the latest tweet from html content through either regex or without any external libraries. I am happy to use external libraries I would just prefer not to. I just wanted to know how it would be possible. I have written the html download part in Java and if anyone wants I will post it here.
So I'll do a pit of pseudo code so that I'm not only targeting Java developers This is how my program looks so far.
1.)Load site("www.twitter.com/user123")
2.)Get initial string and write it to variable->buffer
3.)Loop start
4.) Append string->buffer
5.) If there is no more ->break
6.)print buffer
Obviously the variable buffer will now have raw html content. How can I sort this out to get the tweet. I have found a way but this is too inconsistent. The way I managed it was to find the string which held the tweets and get the content surrounded by the code. However there were too many changes in this section. What I mean is some content inside of it changes, like the font size. I could write multiple if statements but is there a neater solution?

Let me just start off by saying that jsoup is an amazing lightweight HTML parsing library. You can use things like CSS selectors and whatnot. If you ever decide to use a library jsoup will make your life a lot easier.
You can just query for the element with the class of TweetTextSize, then get the text content. This will give you all text, hashtags, and links. (The downside being pictures are also given in links)
Otherwise, you'll need to manually traverse the DOM. For example, use regex to find the beginning of the first TweetTextSize, and then just keep all text which is not between a < and a >.
Unfortunately, this second solution is volatile and may not work in the future, and you'll end up with a big glob of code which is overly complex and hard to debug.

Simple answer if you want a regex and not a sophisticated third party library.
<p[^>]+js-tweet-text[^>]*>(.*)</p>
Try the above on the "view-source" of https://twitter.com/a
Thanks.
EDIT:
Source Code:
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TweetSucker {
public static void main(String[] args) throws Exception {
URLConnection urlConnection = new URL("https://twitter.com/a").openConnection();
InputStream inputStream = urlConnection.getInputStream();
String encoding = urlConnection.getContentEncoding();
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[8192];
int len = 0;
while ((len = inputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
String htmlContent = null;
if (encoding != null) {
htmlContent = new String(byteArrayOutputStream.toByteArray(), encoding);
} else {
htmlContent = new String(byteArrayOutputStream.toByteArray());
}
Pattern TWEET_PATTERN = Pattern.compile("(<p[^>]+js-tweet-text[^>]*>(.*)</p>)", Pattern.CASE_INSENSITIVE);
Matcher matcher = TWEET_PATTERN.matcher(htmlContent);
while (matcher.find()) {
System.out.println("Tweet Found: " + matcher.group(2));
}
}
}

I know that you don't want any libraries but if you want something really quick this is working code in C#:
using (IE browser = new IE())
{
browser.GoTo("https://twitter.com/user");
List tweets = browser.List(Find.ById("stream-items-id"));
if (tweets != null)
{
foreach (var tweet in tweets.ListItems)
{
var tweetText = tweet.Paras.FirstOrDefault();
if (tweetText != null)
{
MessageBox.Show(tweetText.Text);
}
}
}
}
This program uses a library called WatiN (if you use Visual Studio go to Tools Menu, select "NuGet Package Manager" then select "Manage Nuget Packages for Solution" and then select "Browse" and then type "Watin" on the search box, after you find the library hit "Install", after it is installed you just add a reference in your code and then a using statement:
using WatiN.Core;
You can just copy and paste the code I wrote above in a button handler and it'll work, you need to change the twitter.com/XXXXXX user name to list all their tweets. Modify code accordingly to meet your needs.

How to give umlauts more weight in lucene?

I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?

One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29
Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.
Can anyone explain how to get token-like information from a TokenStream?

Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Edit: The new way
According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}

This is how it should be (a clean version of Adam's answer):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();

For the latest version of lucene 7.3.1
// Test the tokenizer
Analyzer testAnalyzer = new CJKAnalyzer();
String testText = "Test Tokenizer";
TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " + offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html

There are two variations in the OP question:
What is "the process to obtain Tokens from a TokenStream"?
"Can anyone explain how to get token-like information from a TokenStream?"
Recent versions of the Lucene documentation for Token say (emphasis added):
NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.
And TokenStream says its API:
... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.
The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.
But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?
With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.
It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.