How to do query auto-completion/suggestions in Lucene? - java

I'm looking for a way to do query auto-completion/suggestions in Lucene. I've Googled around a bit and played around a bit, but all of the examples I've seen seem to be setting up filters in Solr. We don't use Solr and aren't planning to move to using Solr in the near future, and Solr is obviously just wrapping around Lucene anyway, so I imagine there must be a way to do it!
I've looked into using EdgeNGramFilter, and I realise that I'd have to run the filter on the index fields and get the tokens out and then compare them against the inputted Query... I'm just struggling to make the connection between the two into a bit of code, so help is much appreciated!
To be clear on what I'm looking for (I realised I wasn't being overly clear, sorry) - I'm looking for a solution where when searching for a term, it'd return a list of suggested queries. When typing 'inter' into the search field, it'll come back with a list of suggested queries, such as 'internet', 'international', etc.

Based on #Alexandre Victoor's answer, I wrote a little class based on the Lucene Spellchecker in the contrib package (and using the LuceneDictionary included in it) that does exactly what I want.
This allows re-indexing from a single source index with a single field, and provides suggestions for terms. Results are sorted by the number of matching documents with that term in the original index, so more popular terms appear first. Seems to work pretty well :)
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ISOLatin1AccentFilter;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.Side;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spell.LuceneDictionary;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
/**
* Search term auto-completer, works for single terms (so use on the last term
* of the query).
* <p>
* Returns more popular terms first.
*
* #author Mat Mannion, M.Mannion#warwick.ac.uk
*/
public final class Autocompleter {
private static final String GRAMMED_WORDS_FIELD = "words";
private static final String SOURCE_WORD_FIELD = "sourceWord";
private static final String COUNT_FIELD = "count";
private static final String[] ENGLISH_STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "i", "if", "in", "into", "is",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
private final Directory autoCompleteDirectory;
private IndexReader autoCompleteReader;
private IndexSearcher autoCompleteSearcher;
public Autocompleter(String autoCompleteDir) throws IOException {
this.autoCompleteDirectory = FSDirectory.getDirectory(autoCompleteDir,
null);
reOpenReader();
}
public List<String> suggestTermsFor(String term) throws IOException {
// get the top 5 terms for query
Query query = new TermQuery(new Term(GRAMMED_WORDS_FIELD, term));
Sort sort = new Sort(COUNT_FIELD, true);
TopDocs docs = autoCompleteSearcher.search(query, null, 5, sort);
List<String> suggestions = new ArrayList<String>();
for (ScoreDoc doc : docs.scoreDocs) {
suggestions.add(autoCompleteReader.document(doc.doc).get(
SOURCE_WORD_FIELD));
}
return suggestions;
}
#SuppressWarnings("unchecked")
public void reIndex(Directory sourceDirectory, String fieldToAutocomplete)
throws CorruptIndexException, IOException {
// build a dictionary (from the spell package)
IndexReader sourceReader = IndexReader.open(sourceDirectory);
LuceneDictionary dict = new LuceneDictionary(sourceReader,
fieldToAutocomplete);
// code from
// org.apache.lucene.search.spell.SpellChecker.indexDictionary(
// Dictionary)
IndexReader.unlock(autoCompleteDirectory);
// use a custom analyzer so we can do EdgeNGramFiltering
IndexWriter writer = new IndexWriter(autoCompleteDirectory,
new Analyzer() {
public TokenStream tokenStream(String fieldName,
Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new ISOLatin1AccentFilter(result);
result = new StopFilter(result,
ENGLISH_STOP_WORDS);
result = new EdgeNGramTokenFilter(
result, Side.FRONT,1, 20);
return result;
}
}, true);
writer.setMergeFactor(300);
writer.setMaxBufferedDocs(150);
// go through every word, storing the original word (incl. n-grams)
// and the number of times it occurs
Map<String, Integer> wordsMap = new HashMap<String, Integer>();
Iterator<String> iter = (Iterator<String>) dict.getWordsIterator();
while (iter.hasNext()) {
String word = iter.next();
int len = word.length();
if (len < 3) {
continue; // too short we bail but "too long" is fine...
}
if (wordsMap.containsKey(word)) {
throw new IllegalStateException(
"This should never happen in Lucene 2.3.2");
// wordsMap.put(word, wordsMap.get(word) + 1);
} else {
// use the number of documents this word appears in
wordsMap.put(word, sourceReader.docFreq(new Term(
fieldToAutocomplete, word)));
}
}
for (String word : wordsMap.keySet()) {
// ok index the word
Document doc = new Document();
doc.add(new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
Field.Index.UN_TOKENIZED)); // orig term
doc.add(new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES,
Field.Index.TOKENIZED)); // grammed
doc.add(new Field(COUNT_FIELD,
Integer.toString(wordsMap.get(word)), Field.Store.NO,
Field.Index.UN_TOKENIZED)); // count
writer.addDocument(doc);
}
sourceReader.close();
// close writer
writer.optimize();
writer.close();
// re-open our reader
reOpenReader();
}
private void reOpenReader() throws CorruptIndexException, IOException {
if (autoCompleteReader == null) {
autoCompleteReader = IndexReader.open(autoCompleteDirectory);
} else {
autoCompleteReader.reopen();
}
autoCompleteSearcher = new IndexSearcher(autoCompleteReader);
}
public static void main(String[] args) throws Exception {
Autocompleter autocomplete = new Autocompleter("/index/autocomplete");
// run this to re-index from the current index, shouldn't need to do
// this very often
// autocomplete.reIndex(FSDirectory.getDirectory("/index/live", null),
// "content");
String term = "steve";
System.out.println(autocomplete.suggestTermsFor(term));
// prints [steve, steven, stevens, stevenson, stevenage]
}
}

Here's a transliteration of Mat's implementation into C# for Lucene.NET, along with a snippet for wiring a text box using jQuery's autocomplete feature.
<input id="search-input" name="query" placeholder="Search database." type="text" />
... JQuery Autocomplete:
// don't navigate away from the field when pressing tab on a selected item
$( "#search-input" ).keydown(function (event) {
if (event.keyCode === $.ui.keyCode.TAB && $(this).data("autocomplete").menu.active) {
event.preventDefault();
}
});
$( "#search-input" ).autocomplete({
source: '#Url.Action("SuggestTerms")', // <-- ASP.NET MVC Razor syntax
minLength: 2,
delay: 500,
focus: function () {
// prevent value inserted on focus
return false;
},
select: function (event, ui) {
var terms = this.value.split(/\s+/);
terms.pop(); // remove dropdown item
terms.push(ui.item.value.trim()); // add completed item
this.value = terms.join(" ");
return false;
},
});
... here's the ASP.NET MVC Controller code:
//
// GET: /MyApp/SuggestTerms?term=something
public JsonResult SuggestTerms(string term)
{
if (string.IsNullOrWhiteSpace(term))
return Json(new string[] {});
term = term.Split().Last();
// Fetch suggestions
string[] suggestions = SearchSvc.SuggestTermsFor(term).ToArray();
return Json(suggestions, JsonRequestBehavior.AllowGet);
}
... and here's Mat's code in C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Lucene.Net.Store;
using Lucene.Net.Index;
using Lucene.Net.Search;
using SpellChecker.Net.Search.Spell;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.NGram;
using Lucene.Net.Documents;
namespace Cipher.Services
{
/// <summary>
/// Search term auto-completer, works for single terms (so use on the last term of the query).
/// Returns more popular terms first.
/// <br/>
/// Author: Mat Mannion, M.Mannion#warwick.ac.uk
/// <seealso cref="http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene"/>
/// </summary>
///
public class SearchAutoComplete {
public int MaxResults { get; set; }
private class AutoCompleteAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
TokenStream result = new StandardTokenizer(kLuceneVersion, reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(false, result, StopFilter.MakeStopSet(kEnglishStopWords));
result = new EdgeNGramTokenFilter(
result, Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.DEFAULT_SIDE,1, 20);
return result;
}
}
private static readonly Lucene.Net.Util.Version kLuceneVersion = Lucene.Net.Util.Version.LUCENE_29;
private static readonly String kGrammedWordsField = "words";
private static readonly String kSourceWordField = "sourceWord";
private static readonly String kCountField = "count";
private static readonly String[] kEnglishStopWords = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "i", "if", "in", "into", "is",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
private readonly Directory m_directory;
private IndexReader m_reader;
private IndexSearcher m_searcher;
public SearchAutoComplete(string autoCompleteDir) :
this(FSDirectory.Open(new System.IO.DirectoryInfo(autoCompleteDir)))
{
}
public SearchAutoComplete(Directory autoCompleteDir, int maxResults = 8)
{
this.m_directory = autoCompleteDir;
MaxResults = maxResults;
ReplaceSearcher();
}
/// <summary>
/// Find terms matching the given partial word that appear in the highest number of documents.</summary>
/// <param name="term">A word or part of a word</param>
/// <returns>A list of suggested completions</returns>
public IEnumerable<String> SuggestTermsFor(string term)
{
if (m_searcher == null)
return new string[] { };
// get the top terms for query
Query query = new TermQuery(new Term(kGrammedWordsField, term.ToLower()));
Sort sort = new Sort(new SortField(kCountField, SortField.INT));
TopDocs docs = m_searcher.Search(query, null, MaxResults, sort);
string[] suggestions = docs.ScoreDocs.Select(doc =>
m_reader.Document(doc.Doc).Get(kSourceWordField)).ToArray();
return suggestions;
}
/// <summary>
/// Open the index in the given directory and create a new index of word frequency for the
/// given index.</summary>
/// <param name="sourceDirectory">Directory containing the index to count words in.</param>
/// <param name="fieldToAutocomplete">The field in the index that should be analyzed.</param>
public void BuildAutoCompleteIndex(Directory sourceDirectory, String fieldToAutocomplete)
{
// build a dictionary (from the spell package)
using (IndexReader sourceReader = IndexReader.Open(sourceDirectory, true))
{
LuceneDictionary dict = new LuceneDictionary(sourceReader, fieldToAutocomplete);
// code from
// org.apache.lucene.search.spell.SpellChecker.indexDictionary(
// Dictionary)
//IndexWriter.Unlock(m_directory);
// use a custom analyzer so we can do EdgeNGramFiltering
var analyzer = new AutoCompleteAnalyzer();
using (var writer = new IndexWriter(m_directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED))
{
writer.MergeFactor = 300;
writer.SetMaxBufferedDocs(150);
// go through every word, storing the original word (incl. n-grams)
// and the number of times it occurs
foreach (string word in dict)
{
if (word.Length < 3)
continue; // too short we bail but "too long" is fine...
// ok index the word
// use the number of documents this word appears in
int freq = sourceReader.DocFreq(new Term(fieldToAutocomplete, word));
var doc = MakeDocument(fieldToAutocomplete, word, freq);
writer.AddDocument(doc);
}
writer.Optimize();
}
}
// re-open our reader
ReplaceSearcher();
}
private static Document MakeDocument(String fieldToAutocomplete, string word, int frequency)
{
var doc = new Document();
doc.Add(new Field(kSourceWordField, word, Field.Store.YES,
Field.Index.NOT_ANALYZED)); // orig term
doc.Add(new Field(kGrammedWordsField, word, Field.Store.YES,
Field.Index.ANALYZED)); // grammed
doc.Add(new Field(kCountField,
frequency.ToString(), Field.Store.NO,
Field.Index.NOT_ANALYZED)); // count
return doc;
}
private void ReplaceSearcher()
{
if (IndexReader.IndexExists(m_directory))
{
if (m_reader == null)
m_reader = IndexReader.Open(m_directory, true);
else
m_reader.Reopen();
m_searcher = new IndexSearcher(m_reader);
}
else
{
m_searcher = null;
}
}
}
}

my code based on lucene 4.2,may help you
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.spell.Dictionary;
import org.apache.lucene.search.spell.LuceneDictionary;
import org.apache.lucene.search.spell.PlainTextDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.wltea4pinyin.analyzer.lucene.IKAnalyzer4PinYin;
/**
*
*
* #author
* #version 2013-11-25上午11:13:59
*/
public class LuceneSpellCheckerDemoService {
private static final String INDEX_FILE = "/Users/r/Documents/jar/luke/youtui/index";
private static final String INDEX_FILE_SPELL = "/Users/r/Documents/jar/luke/spell";
private static final String INDEX_FIELD = "app_name_quanpin";
public static void main(String args[]) {
try {
//
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new IKAnalyzer4PinYin(
true));
// read index conf
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_42, wrapper);
conf.setOpenMode(OpenMode.CREATE_OR_APPEND);
// read dictionary
Directory directory = FSDirectory.open(new File(INDEX_FILE));
RAMDirectory ramDir = new RAMDirectory(directory, IOContext.READ);
DirectoryReader indexReader = DirectoryReader.open(ramDir);
Dictionary dic = new LuceneDictionary(indexReader, INDEX_FIELD);
SpellChecker sc = new SpellChecker(FSDirectory.open(new File(INDEX_FILE_SPELL)));
//sc.indexDictionary(new PlainTextDictionary(new File("myfile.txt")), conf, false);
sc.indexDictionary(dic, conf, true);
String[] strs = sc.suggestSimilar("zhsiwusdazhanjiangshi", 10);
for (int i = 0; i < strs.length; i++) {
System.out.println(strs[i]);
}
sc.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

You can use the class PrefixQuery on a "dictionary" index. The class LuceneDictionary could be helpful too.
Take a look at this article linked below. It explains how to implement the feature "Did you mean ?" available in modern search engine such as Google. You may not need something as complex as described in the article. However the article explains how to use the Lucene spell package.
One way to build a "dictionary" index would be to iterate on a LuceneDictionary.
Hope it helps
Did You Mean: Lucene? (page 1)
Did You Mean: Lucene? (page 2)
Did You Mean: Lucene? (page 3)

In addition to the above (much appreciated) post re: c# conversion, should you be using .NET 3.5 you'll need to include the code for the EdgeNGramTokenFilter - or at least I did - using Lucene 2.9.2 - this filter is missing from the .NET version as far as I could tell. I had to go and find the .NET 4 version online in 2.9.3 and port back - hope this makes the procedure less painful for someone...
Edit : Please also note that the array returned by the SuggestTermsFor() function is sorted by count ascending, you'll probably want to reverse it to get the most popular terms first in your list
using System.IO;
using System.Collections;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Tokenattributes;
using Lucene.Net.Util;
namespace Lucene.Net.Analysis.NGram
{
/**
* Tokenizes the given token into n-grams of given size(s).
* <p>
* This {#link TokenFilter} create n-grams from the beginning edge or ending edge of a input token.
* </p>
*/
public class EdgeNGramTokenFilter : TokenFilter
{
public static Side DEFAULT_SIDE = Side.FRONT;
public static int DEFAULT_MAX_GRAM_SIZE = 1;
public static int DEFAULT_MIN_GRAM_SIZE = 1;
// Replace this with an enum when the Java 1.5 upgrade is made, the impl will be simplified
/** Specifies which side of the input the n-gram should be generated from */
public class Side
{
private string label;
/** Get the n-gram from the front of the input */
public static Side FRONT = new Side("front");
/** Get the n-gram from the end of the input */
public static Side BACK = new Side("back");
// Private ctor
private Side(string label) { this.label = label; }
public string getLabel() { return label; }
// Get the appropriate Side from a string
public static Side getSide(string sideName)
{
if (FRONT.getLabel().Equals(sideName))
{
return FRONT;
}
else if (BACK.getLabel().Equals(sideName))
{
return BACK;
}
return null;
}
}
private int minGram;
private int maxGram;
private Side side;
private char[] curTermBuffer;
private int curTermLength;
private int curGramSize;
private int tokStart;
private TermAttribute termAtt;
private OffsetAttribute offsetAtt;
protected EdgeNGramTokenFilter(TokenStream input) : base(input)
{
this.termAtt = (TermAttribute)AddAttribute(typeof(TermAttribute));
this.offsetAtt = (OffsetAttribute)AddAttribute(typeof(OffsetAttribute));
}
/**
* Creates EdgeNGramTokenFilter that can generate n-grams in the sizes of the given range
*
* #param input {#link TokenStream} holding the input to be tokenized
* #param side the {#link Side} from which to chop off an n-gram
* #param minGram the smallest n-gram to generate
* #param maxGram the largest n-gram to generate
*/
public EdgeNGramTokenFilter(TokenStream input, Side side, int minGram, int maxGram)
: base(input)
{
if (side == null)
{
throw new System.ArgumentException("sideLabel must be either front or back");
}
if (minGram < 1)
{
throw new System.ArgumentException("minGram must be greater than zero");
}
if (minGram > maxGram)
{
throw new System.ArgumentException("minGram must not be greater than maxGram");
}
this.minGram = minGram;
this.maxGram = maxGram;
this.side = side;
this.termAtt = (TermAttribute)AddAttribute(typeof(TermAttribute));
this.offsetAtt = (OffsetAttribute)AddAttribute(typeof(OffsetAttribute));
}
/**
* Creates EdgeNGramTokenFilter that can generate n-grams in the sizes of the given range
*
* #param input {#link TokenStream} holding the input to be tokenized
* #param sideLabel the name of the {#link Side} from which to chop off an n-gram
* #param minGram the smallest n-gram to generate
* #param maxGram the largest n-gram to generate
*/
public EdgeNGramTokenFilter(TokenStream input, string sideLabel, int minGram, int maxGram)
: this(input, Side.getSide(sideLabel), minGram, maxGram)
{
}
public override bool IncrementToken()
{
while (true)
{
if (curTermBuffer == null)
{
if (!input.IncrementToken())
{
return false;
}
else
{
curTermBuffer = (char[])termAtt.TermBuffer().Clone();
curTermLength = termAtt.TermLength();
curGramSize = minGram;
tokStart = offsetAtt.StartOffset();
}
}
if (curGramSize <= maxGram)
{
if (!(curGramSize > curTermLength // if the remaining input is too short, we can't generate any n-grams
|| curGramSize > maxGram))
{ // if we have hit the end of our n-gram size range, quit
// grab gramSize chars from front or back
int start = side == Side.FRONT ? 0 : curTermLength - curGramSize;
int end = start + curGramSize;
ClearAttributes();
offsetAtt.SetOffset(tokStart + start, tokStart + end);
termAtt.SetTermBuffer(curTermBuffer, start, curGramSize);
curGramSize++;
return true;
}
}
curTermBuffer = null;
}
}
public override Token Next(Token reusableToken)
{
return base.Next(reusableToken);
}
public override Token Next()
{
return base.Next();
}
public override void Reset()
{
base.Reset();
curTermBuffer = null;
}
}
}

Related

Java Lucene search - is it possible to search a number in a range?

Using the Lucene libs, I need to make some changes to the existing search function:
Let's assume the following object:
Name: "Port Object 1"
Data: "TCP (1)/1000-2000"
And the query (or the search text) is "1142"
Is it possible to search for "1142" inside Data field and find the Port Object 1, since it refers to a range between 1000-2000?
I only managed to find the numeric range query, but that does not apply in this case, since I dont know the ranges...
package com.company;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
public class Main {
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "TCP (6)/1100-2000", "193398817");
addDoc(w, "TCP (6)/3000-4200", "55320055Z");
addDoc(w, "UDP (12)/50000-65000", "55063554A");
w.close();
// 2. query
String querystr = "1200";
Query q = new QueryParser("title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(q, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
Refer to above code.
The query "1200" should find the first doc.
LE:
I think what I need is exactly the opposite of range search:
https://lucene.apache.org/core/5_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Range_Searches
Here is one approach, but it requires you to parse the range data into separate values, before your data can be indexed by Lucene. So, for example, from this:
"TCP (6)/1100-2000"
You would need to extract these two values (e.g. using a regex): 1100 and 2000.
LongRange with ContainsQuery
Add a new field to each document (e.g. named "tcpRange") and define it as a LongRange field.
(There is also IntRange if you don't need long values.)
long[] min = { 1100 };
long[] max = { 2000 };
Field tcpRange = new LongRange("tcpRange", min, max);
The values are defined in arrays, because this range type can handle multiple ranges in one field. But we only need the one range in our case.
Then you can make use of the "contains" query to search for your specific value, e.g. 1200:
long[] searchValue = { 1200 };
Query containsQuery = LongRange.newContainsQuery("tcpRange", searchValue, searchValue);
Note: My examples are based on the latest version of Lucene (8.5). I believe this should apply to other earlier versions also.
EDIT
Regarding additional questions asked in the comments to this answer...
The following method converts an IPv4 address to a long value. Using this allows IP address ranges to be handled (and the same LongRange approach as above can be used):
public long ipToLong(String ipAddress) {
long result = 0;
String[] ipAddressInArray = ipAddress.split("\\.");
for (int i = 3; i >= 0; i--) {
long ip = Long.parseLong(ipAddressInArray[3 - i]);
// left shifting 24, 16, 8, 0 with bitwise OR
result |= ip << (i * 8);
}
return result;
}
This also means valid subnet ranges to not have to be handled - any two IP adresses will generate a sequential set of numbers.
Credit to this mkyong site for the approach.
I managed to add another field, and it works now. Also, do you know how I could do the same search but for IPv4? if I search something like "192.168.0.100" in a "192.168.0.1-192.168.0.255" string?
Hi #CristianNicolaePerjescu I can't comment because my reputation, but you can create a class that extends Field and add this in your lucene index. For example:
public class InetAddressRange extends Field {
...
/**
* Create a new InetAddressRange from min/max value
* #param name field name. must not be null.
* #param min range min value; defined as an {#code InetAddress}
* #param max range max value; defined as an {#code InetAddress}
*/
public InetAddressRange(String name, final InetAddress min, final InetAddress max) {
super(name, TYPE);
setRangeValues(min, max);
}
...
}
And then add to the index:
document.add(new InetAddressRange("field", InetAddressFrom, InetAddressTo));
In your class you can add your own Query format, like:
public static Query newIntersectsQuery(String field, final InetAddress min, final InetAddress max) {
return newRelationQuery(field, min, max, QueryType.INTERSECTS);
}
/** helper method for creating the desired relational query */
private static Query newRelationQuery(String field, final InetAddress min, final InetAddress max, QueryType relation) {
return new RangeFieldQuery(field, encode(min, max), 1, relation) {
#Override
protected String toString(byte[] ranges, int dimension) {
return InetAddressRange.toString(ranges, dimension);
}
};
}
I hope this is helpful for you.

How to highlight Boolean FuzzyQueries in Lucene - boost must be a positive float?

I'm trying to be nice for users that make a lot of typos (like myself).
I try to create a simple search page for some data. I build FuzzyQuerys in a BooleanQuery because I would like the user to make typos, for example this:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new FuzzyQuery(new Term("body", "pzza")), BooleanClause.Occur.SHOULD);
builder.add(new FuzzyQuery(new Term("body", "tcyoon")), BooleanClause.Occur.SHOULD);
BooleanQuery query = builder.build();
Searching works as expected, but the code I got from the Lucene 8.5 API docs to build the highlighting fails:
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
int id = hits[i].doc;
Document doc = searcher.doc(id);
System.out.println("HIT:" + doc.get("url"));
String text = doc.get("body");
TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "body", analyzer);
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);//highlighter.getBestFragments(tokenStream, text, 3, "...");
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
}
With error:
java.lang.IllegalArgumentException: boost must be a positive float, got -1.0
at org.apache.lucene.search.BoostQuery.<init>(BoostQuery.java:44)
at org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:69)
at org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:54)
at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:117)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:246)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:135)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:530)
at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:218)
at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)
at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:201)
The code uses a deprecated method, but I took it straight from the documentation.
Can somebody explain why I get this error? How can I create a highlighter that works with this query construction? Or do I need a different Query?
The following approach to highlighting uses Lucene v8.5.0 with the question's fuzzy boolean example.
The results look like this, in my stripped-down demo (but you can refine how the highlighted fragments are displayed, of course):
The highlighting code is as follows:
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.search.highlight.TextFragment;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
public class CustomHighlighter {
private static final String PRE_TAG = "<span class=\"hilite\">";
private static final String POST_TAG = "</span>";
public static String[] highlight(Query query, IndexSearcher searcher,
Analyzer analyzer, ScoreDoc hit, String fieldName)
throws IOException, InvalidTokenOffsetsException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter(PRE_TAG, POST_TAG);
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
int id = hit.doc;
Document doc = searcher.doc(id);
String text = doc.get(fieldName);
TokenStream tokenStream = TokenSources.getTokenStream(fieldName,
searcher.getIndexReader().getTermVectors(id), text, analyzer, -1);
int maxNumFragments = 10;
boolean mergeContiguousFragments = Boolean.TRUE;
TextFragment[] frags = highlighter.getBestTextFragments(tokenStream,
text, mergeContiguousFragments, maxNumFragments);
String[] highlightedText = new String[frags.length];
for (int i = 0; i < frags.length; i++) {
highlightedText[i] = frags[i].toString();
}
// control how you handle each fragment for display...
//for (TextFragment frag : frags) {
// if ((frag != null) && (frag.getScore() > 0)) {
// highlightedText = frag.toString();
// }
//}
return highlightedText;
}
}
The class is used as follows (where SearchResults is just one of my classes for collecting results, for later presentation to the user):
for (ScoreDoc hit : hits) {
String[] highlightedText = CustomHighlighter.highlight(query, searcher,
analyzer, hit, field);
String document = searcher.doc(hit.doc).get("path");
SearchResults.Match match = new SearchResults.Match(document, highlightedText, hit.score);
results.getMatches().add(match);
}
And the fuzzy query is this:
private static Query useFuzzyBooleanQuery() {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new FuzzyQuery(new Term("contents", "pzza")), BooleanClause.Occur.SHOULD);
builder.add(new FuzzyQuery(new Term("contents", "tcyoon")), BooleanClause.Occur.SHOULD);
return builder.build();
}
The above code does not give me any deprecation warnings.
I can't explain why you get that specific "boost" error - I have not seen that myself, and I was not able to recreate it. But I did not try too hard, I confess.

Java code to compare excel sheets does not work for larger files

I have recently done a project in java to compare excel sheets in 2 different folders and generates the result in a summary folder created in the source folder directories. All the code was working fine except for files which have more than 10000 rows. its just creating an empty sheet instead of compared mismatches for larger files. here is the code i used Please help me out.
package com.validation.comparators;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
import org.bson.Document;
/**
* The utility class SheetComparator
*/
public class SheetComparator {
private SheetComparator() {
// The utility class
}
/**
* Compares the document equivalent of two sheets
*
* #param document1
* The document 1
* #param document2
* The document 2
* #return The compared output
*/
#SuppressWarnings("unchecked")
public static Document compare(Document document1, Document document2) {
List<String> headers = (List<String>) document1.get("headers");
List<Document> sheet1Rows = (List<Document>) document1.get("data");
List<Document> sheet2Rows = (List<Document>) document2.get("data");
List<Document> temp;
List<Document> comparedOutput = new ArrayList<>();
if (sheet1Rows.size() < sheet2Rows.size()) {
temp = sheet1Rows;
sheet1Rows = sheet2Rows;
sheet2Rows = temp;
}
int length = sheet1Rows.size();
int length2 = sheet2Rows.size();
for (int i = 0; i < length2; i++) {
Document sheet1Row = sheet1Rows.get(i);
Document sheet2Row = sheet2Rows.get(i);
Document comparedRow = new Document("row number",
new Document("value", sheet1Row.getString("row number")).append("color", "WHITE"));
Boolean completeMatch = true;
for (String header : headers) {
Boolean isNull = false;
String value1 = sheet1Row.getString(header).trim();
String value2 = sheet2Row.getString(header).trim();
if (StringUtils.isAnyBlank(value1, value2)) {
completeMatch = false;
isNull = true;
} else if (!StringUtils.equals(value1, value2)) {
completeMatch = false;
}
if (isNull) {
comparedRow.append(header, new Document("value", StringUtils.isBlank(value1) ? value2 : value1)
.append("color", "RED"));
} else {
comparedRow.append(header, new Document("value", value1).append("color", "WHITE"));
}
}
if (!completeMatch) {
comparedOutput.add(comparedRow);
}
}
for (int i = length2; i < length; i++) {
Document row = sheet1Rows.get(i);
Document comparedRow = new Document();
for (String header : headers) {
String value = row.getString(header);
comparedRow.put(header, new Document("value", value).append("color", "RED"));
}
comparedRow.append("row number",
new Document("value", row.getString("row number")).append("color", "WHITE"));
comparedOutput.add(comparedRow);
}
headers.add(0, "row number");
return new Document("data", comparedOutput).append("headers", headers);
}
}
Try it, set the jvm's (java's) memory high. The problem is you have to read the entire DOM object hierarchy.
Otherwise you need ony to sequentially read the documents, by sheet, by row, by cell.
So instead of having some DOM object in memory, you could:
Write a sequential stream (as text document) of the documents to a file.
Convert both Excels to text.
Read two streams and make a diff for every token.
Probably you can immediately write an diff report.

Fetch all the hyperlinks from a webpage and recursively doing that in java

1 .Fetch all contents from a Webpage
2. fetch hyperlinks from the webpage.
3. Repeat the 1 & 2 from the fetched hyperlink
4. repeat the process untill 200 hyperlinks regietered or no more hyperlink to fetch.
I wrote a sample programs but due to poor understanding of recursion , my loop became an infinite loop.
Suggest me to solve the code matching the expectation.
import java.net.URL;
import java.net.URLConnection;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Content
{
private static final String HTML_A_HREF_TAG_PATTERN =
"\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
Pattern pattern;
public Content ()
{
pattern = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
}
private void fetchContentFromURL(String strLink) {
String content = null;
URLConnection connection = null;
try {
connection = new URL(strLink).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
}catch ( Exception ex ) {
ex.printStackTrace();
return;
}
fetchURL(content);
}
private void fetchURL ( String content )
{
Matcher matcher = pattern.matcher( content );
while(matcher.find()) {
String group = matcher.group();
if(group.toLowerCase().contains( "http" ) || group.toLowerCase().contains( "https" )) {
group = group.substring( group.indexOf( "=" )+1 );
group = group.replaceAll( "'", "" );
group = group.replaceAll( "\"", "" );
System.out.println("lINK "+group);
fetchContentFromURL(group);
}
}
System.out.println("DONE");
}
/**
* #param args
*/
public static void main ( String[] args )
{
new Content().fetchContentFromURL( "http://www.google.co.in" );
}
}
I am open for any other solution as well but want to stick with core java Api only no 3rd party.
One possible option here is to remember all visited links to avoid cyclic paths. Here's how to archive it with additional Set storage for already visited links:
public class Content {
private static final String HTML_A_HREF_TAG_PATTERN =
"\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private Pattern pattern;
private Set<String> visitedUrls = new HashSet<String>();
public Content() {
pattern = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
}
private void fetchContentFromURL(String strLink) {
String content = null;
URLConnection connection = null;
try {
connection = new URL(strLink).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
if (scanner.hasNext()) {
content = scanner.next();
visitedUrls.add(strLink);
fetchURL(content);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
private void fetchURL(String content) {
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
String group = matcher.group();
if (group.toLowerCase().contains("http") || group.toLowerCase().contains("https")) {
group = group.substring(group.indexOf("=") + 1);
group = group.replaceAll("'", "");
group = group.replaceAll("\"", "");
System.out.println("lINK " + group);
if (!visitedUrls.contains(group) && visitedUrls.size() < 200) {
fetchContentFromURL(group);
}
}
}
System.out.println("DONE");
}
/**
* #param args
*/
public static void main(String[] args) {
new Content().fetchContentFromURL("http://www.google.co.in");
}
}
I also fixed some other issues in fetching logic, now it works as expected.
inside the fetchContentFromURL method you should record which url u r currently fetching, and if that url has already be fetched then skip it. otherwise two page A, B, which has a link point to each other will cause your code keep fetching.
In addition to JK1's answer, for achieving target 4 of your question, you might want to maintain the count of hyperlinks as instance variable. A rough pseudo code might be(you can adjust the exact count. Also as an alternate, you can use HashSet length to know the number of Hyperlinks your program has parsed till now):
if (!visitedUrls.contains(group) && noOfHyperlinksVisited++ < 200) {
fetchContentFromURL(group);
}
However, I was not sure whether you want a total of 200 hyperlinks OR want to traverse to a depth of 200 links from starting page. In case it is later, you might wish to explore Breadth First Search, which will let you know when you have reached your target depth.

How to convert Microsoft Locale ID (LCID) into language code or Locale object in Java

I need to translate a Microsoft locale ID, such as 1033 (for US English), into either an ISO 639 language code or directly into a Java Locale instance. (Edit: or even simply into the "Language - Country/Region" in Microsoft's table.)
Is this possible, and what's the easiest way? Preferably using only JDK standard libraries, of course, but if that's not possible, with a 3rd party library.
You could use GetLocaleInfo to do this (assuming you were running on Windows (win2k+)).
This C++ code demonstrates how to use the function:
#include "windows.h"
int main()
{
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
if(INVALID_HANDLE_VALUE == stdout) return 1;
LCID Locale = 0x0c01; //Arabic - Egypt
int nchars = GetLocaleInfoW(Locale, LOCALE_SISO639LANGNAME, NULL, 0);
wchar_t* LanguageCode = new wchar_t[nchars];
GetLocaleInfoW(Locale, LOCALE_SISO639LANGNAME, LanguageCode, nchars);
WriteConsoleW(stdout, LanguageCode, nchars, NULL, NULL);
delete[] LanguageCode;
return 0;
}
It would not take much work to turn this into a JNA call. (Tip: emit constants as ints to find their values.)
Sample JNA code:
draw a Windows cursor
print Unicode on a Windows console
Using JNI is a bit more involved, but is manageable for a relatively trivial task.
At the very least, I would look into using native calls to build your conversion database. I'm not sure if Windows has a way to enumerate the LCIDs, but there's bound to be something in .Net. As a build-level thing, this isn't a huge burden. I would want to avoid manual maintenance of the list.
As it started to look like there is no ready Java solution to do this mapping, we took the ~20 minutes to roll something of our own, at least for now.
We took the information from the horse's mouth, i.e. http://msdn.microsoft.com/en-us/goglobal/bb964664.aspx, and copy-pasted it (through Excel) into a .properties file like this:
1078 = Afrikaans - South Africa
1052 = Albanian - Albania
1118 = Amharic - Ethiopia
1025 = Arabic - Saudi Arabia
5121 = Arabic - Algeria
...
(You can download the file here if you have similar needs.)
Then there's a very simple class that reads the information from the .properties file into a map, and has a method for doing the conversion.
Map<String, String> lcidToDescription;
public String getDescription(String lcid) { ... }
And yes, this doesn't actually map to language code or Locale object (which is what I originally asked), but to Microsoft's "Language - Country/Region" description. It turned out this was sufficient for our current need.
Disclaimer: this really is a minimalistic, "dummy" way of doing it yourself in Java, and obviously keeping (and maintaining) a copy of the LCID mapping information in your own codebase is not very elegant. (On the other hand, neither would I want to include a huge library jar or do anything overly complicated just for this simple mapping.) So despite this answer, feel free to post more elegant solutions or existing libraries if you know of anything like that.
The following code will programmatically create a mapping between Microsoft LCID codes and Java Locales, making it easier to keep the mapping up-to-date:
import java.io.IOException;
import java.util.HashMap;
import java.util.Locale;
import java.util.Map;
/**
* #author Gili Tzabari
*/
public final class Locales
{
/**
* Maps a Microsoft LCID to a Java Locale.
*/
private final Map<Integer, Locale> lcidToLocale = new HashMap<>(LcidToLocaleMapping.NUM_LOCALES);
public Locales()
{
// Try loading the mapping from cache
File file = new File("lcid-to-locale.properties");
Properties properties = new Properties();
try (FileInputStream in = new FileInputStream(file))
{
properties.load(in);
for (Object key: properties.keySet())
{
String keyString = key.toString();
Integer lcid = Integer.parseInt(keyString);
String languageTag = properties.getProperty(keyString);
lcidToLocale.put(lcid, Locale.forLanguageTag(languageTag));
}
return;
}
catch (IOException unused)
{
// Cache does not exist or is invalid, regenerate...
lcidToLocale.clear();
}
LcidToLocaleMapping mapping;
try
{
mapping = new LcidToLocaleMapping();
}
catch (IOException e)
{
// Unrecoverable runtime failure
throw new AssertionError(e);
}
for (Locale locale: Locale.getAvailableLocales())
{
if (locale == Locale.ROOT)
{
// Special case that doesn't map to a real locale
continue;
}
String language = locale.getDisplayLanguage(Locale.ENGLISH);
String country = locale.getDisplayCountry(Locale.ENGLISH);
country = mapping.getCountryAlias(country);
String script = locale.getDisplayScript();
for (Integer lcid: mapping.listLcidFor(language, country, script))
{
lcidToLocale.put(lcid, locale);
properties.put(lcid.toString(), locale.toLanguageTag());
}
}
// Cache the mapping
try (FileOutputStream out = new FileOutputStream(file))
{
properties.store(out, "LCID to Locale mapping");
}
catch (IOException e)
{
// Unrecoverable runtime failure
throw new AssertionError(e);
}
}
/**
* #param lcid a Microsoft LCID code
* #return a Java locale
* #see https://msdn.microsoft.com/en-us/library/cc223140.aspx
*/
public Locale fromLcid(int lcid)
{
return lcidToLocale.get(lcid);
}
}
import com.google.common.collect.HashMultimap;
import com.google.common.collect.ImmutableList;
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.SetMultimap;
import com.google.common.collect.Sets;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.bitbucket.cowwoc.preconditions.Preconditions;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Generates a mapping between Microsoft LCIDs and Java Locales.
* <p>
* #see http://stackoverflow.com/a/32324060/14731
* #author Gili Tzabari
*/
final class LcidToLocaleMapping
{
private static final int NUM_COUNTRIES = 194;
private static final int NUM_LANGUAGES = 13;
private static final int NUM_SCRIPTS = 5;
/**
* The number of locales we are expecting. This value is only used for performance optimization.
*/
public static final int NUM_LOCALES = 238;
private static final List<String> EXPECTED_HEADERS = ImmutableList.of("lcid", "language", "location");
// [language] - [comment] ([script])
private static final Pattern languagePattern = Pattern.compile("^(.+?)(?: - (.*?))?(?: \\((.+)\\))?$");
/**
* Maps a country to a list of entries.
*/
private static final SetMultimap<String, Mapping> COUNTRY_TO_ENTRIES = HashMultimap.create(NUM_COUNTRIES,
NUM_LOCALES / NUM_COUNTRIES);
/**
* Maps a language to a list of entries.
*/
private static final SetMultimap<String, Mapping> LANGUAGE_TO_ENTRIES = HashMultimap.create(NUM_LANGUAGES,
NUM_LOCALES / NUM_LANGUAGES);
/**
* Maps a language script to a list of entries.
*/
private static final SetMultimap<String, Mapping> SCRIPT_TO_ENTRIES = HashMultimap.create(NUM_SCRIPTS,
NUM_LOCALES / NUM_SCRIPTS);
/**
* Maps a Locale country name to a LCID country name.
*/
private static final Map<String, String> countryAlias = ImmutableMap.<String, String>builder().
put("United Arab Emirates", "U.A.E.").
build();
/**
* A mapping between a country, language, script and LCID.
*/
private static final class Mapping
{
public final String country;
public final String language;
public final String script;
public final int lcid;
Mapping(String country, String language, String script, int lcid)
{
Preconditions.requireThat(country, "country").isNotNull();
Preconditions.requireThat(language, "language").isNotNull().isNotEmpty();
Preconditions.requireThat(script, "script").isNotNull();
this.country = country;
this.language = language;
this.script = script;
this.lcid = lcid;
}
#Override
public int hashCode()
{
return country.hashCode() + language.hashCode() + script.hashCode() + lcid;
}
#Override
public boolean equals(Object obj)
{
if (!(obj instanceof Locales))
return false;
Mapping other = (Mapping) obj;
return country.equals(other.country) && language.equals(other.language) && script.equals(other.script) &&
lcid == other.lcid;
}
}
private final Logger log = LoggerFactory.getLogger(LcidToLocaleMapping.class);
/**
* Creates a new LCID to Locale mapping.
* <p>
* #throws IOException if an I/O error occurs while reading the LCID table
*/
LcidToLocaleMapping() throws IOException
{
Document doc = Jsoup.connect("https://msdn.microsoft.com/en-us/library/cc223140.aspx").get();
Element mainBody = doc.getElementById("mainBody");
Elements elements = mainBody.select("table");
assert (elements.size() == 1): elements;
for (Element table: elements)
{
boolean firstRow = true;
for (Element row: table.select("tr"))
{
if (firstRow)
{
// Make sure that columns are ordered as expected
List<String> headers = new ArrayList<>(3);
Elements columns = row.select("th");
for (Element column: columns)
headers.add(column.text().toLowerCase());
assert (headers.equals(EXPECTED_HEADERS)): headers;
firstRow = false;
continue;
}
Elements columns = row.select("td");
assert (columns.size() == 3): columns;
Integer lcid = Integer.parseInt(columns.get(0).text(), 16);
Matcher languageMatcher = languagePattern.matcher(columns.get(1).text());
if (!languageMatcher.find())
throw new AssertionError();
String language = languageMatcher.group(1);
String script = languageMatcher.group(2);
if (script == null)
script = "";
String country = columns.get(2).text();
Mapping mapping = new Mapping(country, language, script, lcid);
COUNTRY_TO_ENTRIES.put(country, mapping);
LANGUAGE_TO_ENTRIES.put(language, mapping);
if (!script.isEmpty())
SCRIPT_TO_ENTRIES.put(script, mapping);
}
}
}
/**
* Returns the LCID codes associated with a [country, language, script] combination.
* <p>
* #param language a language
* #param country a country (empty string if any country should match)
* #param script a language script (empty string if any script should match)
* #return an empty list if no matches are found
* #throws NullPointerException if any of the arguments are null
* #throws IllegalArgumentException if language is empty
*/
public Collection<Integer> listLcidFor(String language, String country, String script)
throws NullPointerException, IllegalArgumentException
{
Preconditions.requireThat(language, "language").isNotNull().isNotEmpty();
Preconditions.requireThat(country, "country").isNotNull();
Preconditions.requireThat(script, "script").isNotNull();
Set<Mapping> result = LANGUAGE_TO_ENTRIES.get(language);
if (result == null)
{
log.warn("Language '" + language + "' had no corresponding LCID");
return Collections.emptyList();
}
if (!country.isEmpty())
{
Set<Mapping> entries = COUNTRY_TO_ENTRIES.get(country);
result = Sets.intersection(result, entries);
}
if (!script.isEmpty())
{
Set<Mapping> entries = SCRIPT_TO_ENTRIES.get(script);
result = Sets.intersection(result, entries);
}
return result.stream().map(entry -> entry.lcid).collect(Collectors.toList());
}
/**
* #param name the locale country name
* #return the LCID country name
*/
public String getCountryAlias(String name)
{
String result = countryAlias.get(name);
if (result == null)
return name;
return result;
}
}
Maven dependencies:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<dependency>
<groupId>org.bitbucket.cowwoc</groupId>
<artifactId>preconditions</artifactId>
<version>1.25</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
Usage:
System.out.println("Language: " + new Locales().fromLcid(1033).getDisplayLanguage());
will print "Language: English".
Meaning, LCID 1033 maps to the English language.
NOTE: This only generates mappings for locales available on your runtime JVM. Meaning, you will only get a subset of all possible Locales. That said, I don't think it is technically possible to instantiate Locales that your JVM doesn't support, so this is probably the best we can do...
The was the first hit on google for "Java LCID" is this javadoc:
gnu.java.awt.font.opentype.NameDecoder
private static java.util.Locale
getWindowsLocale(int lcid)
Maps a Windows LCID into a Java Locale.
Parameters:
lcid - the Windows language ID whose Java locale is to be retrieved.
Returns:
an suitable Locale, or null if the mapping cannot be performed.
I'm not sure where to go about downloading this library, but it's GNU, so it shouldn't be too hard to find.
Here is a script to paste into the F12 console and extract the mapping for the currently 273 languages to their lcid (to be used on https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c):
// extract data from https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c
const locales = {}, dataTable = document.querySelector('div.table-scroll-wrapper:nth-of-type(2)>table.protocol-table');
for (let i=1, l=dataTable.rows.length; i<l; i++) {
const row = dataTable.rows[i];
let locale = Number(row.cells[2].textContent.trim()); // hex to decimal
let name = row.cells[3].textContent.trim(); // cc-LL
if ((locale > 1024) && (name.indexOf('-') > 0)) // only cc-LL (languages, not countries)
locales[locale] = name;
}
console.table(locales); // 273 entries

Categories