I'm working on Lucene Library, and I found the documents required after executing a BooleanQuery.
I looped in the searcher and each time I would like to put the Document in a HashMap.
int docId = hits[i].doc;
Document doc = searcher.doc(docId);
HashMap X = new HashMap ();
Now I want to know how to fill the hashmap X with the name_Field and the value_Field of the document?
You can iterate over document fields like this:
for (IndexableField field : doc.getFields())
{
X.put(field.name(), field.stringValue());
}
But it will work only for fields which are stored in index (those which was added with Field.Store.YES flag). Also if you have several values for the field in the document this code has to be modified.
You could extends the lucene Collector then add the document as the way you want.
IndexSearcher searcher = new IndexSearcher(indexReader);
private Map<String, String> docs = new HashMap<String, String>();
searcher.search(query, new Collector() {
private int docBase;
// ignore scorer
public void setScorer(Scorer scorer) {
}
// accept docs out of order (for a BitSet it doesn't matter)
public boolean acceptsDocsOutOfOrder() {
return true;
}
public void collect(int docNum) {
Document luceneDoc = searcher.doc(doc + docBase);
docs.put(luceneDoc.getValues(name_Field), luceneDoc.getValues(value_Field));
}
public void setNextReader(AtomicReaderContext context) {
this.docBase = context.docBase;
}
});
Related
I have the following document in my database:
_id: ObjectId('63a73aec1afb1e4de760d9de')
uuid: "71e5db4e-ab05-4de2-9238-5660474c5156"
coins: 0
level: 1
currentXp: 1
upgrades: Object
durability: 0
luck: 1
Now I want to get the data from the object. I tried to get the int from durability by doing this:
public static int getDurabilityLevel(UUID uuid) {
Document filter = new Document("uuid", uuid.toString());
int durabilityLevel = Main.getInstance().getDataConnection().getCollection().find(filter).first().getInteger("upgrades.durability");
return durabilityLevel;
}
I also want to chance the value of the luck integer. But if I try to chance it, the durability integer disappears. I used this to chance the value:
public static void setLuckLevel(UUID uuid, int level) {
Document filter = new Document("uuid", uuid.toString());
Document foundDocument = Main.getInstance().getDataConnection().getCollection().find(filter).first();
if(foundDocument != null) {
Document updateValue = new Document("Upgrades", new Document("luck", level));
Document updateOperation = new Document("$set", updateValue);
Main.getInstance().getDataConnection().getCollection().updateOne(foundDocument, updateOperation);
}
}
I hope anyone can help me with this simple problem. Thanks!
Now I could fix the problems. Here are my solutions:
This is my way to get data from the object:
public static int getDurabilityLevel(UUID uuid) {
Document filter = new Document("uuid", uuid.toString());
Document document = Main.getInstance().getDataConnection().getCollection().find(filter).first();
Document object = (Document) document.get("upgrades");
int durabilityLevel = object.getInteger("durability");
return durabilityLevel;
}
And this is my way to chance data in the object without deleting the other values:
public static void setDurabilityLevel(UUID uuid, int level) {
Document filter = new Document("uuid", uuid.toString());
Document foundDocument = Main.getInstance().getDataConnection().getCollection().find(filter).first();
if (foundDocument != null) {
Document updateValue = new Document("upgrades.durability", level);
Document updateOperation = new Document("$set", updateValue);
Main.getInstance().getDataConnection().getCollection().updateOne(foundDocument, updateOperation);
}
}
I have this path for a MongoDB field main.inner.leaf and every field couldn't be present.
In Java I should write, avoiding null:
String leaf = "";
if (document.get("main") != null &&
document.get("main", Document.class).get("inner") != null) {
leaf = document.get("main", Document.class)
.get("inner", Document.class).getString("leaf");
}
In this simple example I set only 3 levels: main, inner and leaf but my documents are deeper.
So is there a way avoiding me writing all these null checks?
Like this:
String leaf = document.getString("main.inner.leaf", "");
// "" is the deafult value if one of the levels doesn't exist
Or using a third party library:
String leaf = DocumentUtils.getNullCheck("main.inner.leaf", "", document);
Many thanks.
Since the intermediate attributes are optional you really have to access the leaf value in a null safe manner.
You could do this yourself using an approach like ...
if (document.containsKey("main")) {
Document _main = document.get("main", Document.class);
if (_main.containsKey("inner")) {
Document _inner = _main.get("inner", Document.class);
if (_inner.containsKey("leaf")) {
leafValue = _inner.getString("leaf");
}
}
}
Note: this could be wrapped up in a utility to make it more user friendly.
Or use a thirdparty library such as Commons BeanUtils.
But, you cannot avoid null safe checks since the document structure is such that the intermediate levels might be null. All you can do is to ease the burden of handling the null safety.
Here's an example test case showing both approaches:
#Test
public void readNestedDocumentsWithNullSafety() throws IllegalAccessException, NoSuchMethodException, InvocationTargetException {
Document inner = new Document("leaf", "leafValue");
Document main = new Document("inner", inner);
Document fullyPopulatedDoc = new Document("main", main);
assertThat(extractLeafValueManually(fullyPopulatedDoc), is("leafValue"));
assertThat(extractLeafValueUsingThirdPartyLibrary(fullyPopulatedDoc, "main.inner.leaf", ""), is("leafValue"));
Document emptyPopulatedDoc = new Document();
assertThat(extractLeafValueManually(emptyPopulatedDoc), is(""));
assertThat(extractLeafValueUsingThirdPartyLibrary(emptyPopulatedDoc, "main.inner.leaf", ""), is(""));
Document emptyInner = new Document();
Document partiallyPopulatedMain = new Document("inner", emptyInner);
Document partiallyPopulatedDoc = new Document("main", partiallyPopulatedMain);
assertThat(extractLeafValueManually(partiallyPopulatedDoc), is(""));
assertThat(extractLeafValueUsingThirdPartyLibrary(partiallyPopulatedDoc, "main.inner.leaf", ""), is(""));
}
private String extractLeafValueUsingThirdPartyLibrary(Document document, String path, String defaultValue) {
try {
Object value = PropertyUtils.getNestedProperty(document, path);
return value == null ? defaultValue : value.toString();
} catch (Exception ex) {
return defaultValue;
}
}
private String extractLeafValueManually(Document document) {
Document inner = getOrDefault(getOrDefault(document, "main"), "inner");
return inner.get("leaf", "");
}
private Document getOrDefault(Document document, String key) {
if (document.containsKey(key)) {
return document.get(key, Document.class);
} else {
return new Document();
}
}
I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4.
I know how to get TermVector for each document as following:
//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size()); //just testing
I thought I can just union all the termVector together as columns in a matrix to get the matrix. However, termVector for different documents have different size. And we don't know how to pad 0 into the termVector. So, certainly, this method does not work.
Hence, I wonder if someone can show me how to create Term-Document vector with Lucene 4.4 please? (If possible, please show sample code).
If Lucene does not support this function, what is the other way you recommend to do it?
Many thanks,
I found the solution to my problem here. Very detail example given by Mr. Sujit, although the code is written in older version of Lucene so many things will have to be changed. I'll update details when I finish my code.
Here is my solution that works on Lucene 4.4
public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
reader = DirectoryReader.open(FSDirectory.open(index));
searcher = new IndexSearcher(reader);
this.corpus = corpus;
termIdMap = computeTermIdMap(reader);
}
/**
* Map term to a fix integer so that we can build document matrix later.
* It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("contents");
TermsEnum itr = terms.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
if (termIdMap.containsKey(termText))
continue;
//System.out.println(termText);
termIdMap.put(termText, id++);
}
return termIdMap;
}
/**
* build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
//iterate through directory to work with each doc
int col = 0;
int numDocs = countDocs(corpus); //get the number of documents here
int numTerms = termIdMap.size(); //total number of terms
RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);
for (File f : corpus.listFiles()) {
if (!f.isHidden() && f.canRead()) {
//I build term document matrix for a subset of corpus so
//I need to lookup document by path name.
//If you build for the whole corpus, just iterate through all documents
String path = f.getPath();
BooleanQuery pathQuery = new BooleanQuery();
pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(pathQuery, 1);
//get term vector
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
//compute term weight
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
int row = termIdMap.get(termText);
long termFreq = itr.totalTermFreq();
long docCount = itr.docFreq();
double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
tdMatrix.setEntry(row, col, weight);
}
col++;
}
}
return tdMatrix;
}
}
One can refer this code also. In the latest Lucene version It will be quite easy.
Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
Terms fieldTerms = MultiFields.getTerms(index, "text");
if (fieldTerms != null && fieldTerms.size() != -1) {
IndexSearcher indexSearcher = new IndexSearcher(index);
for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
assertNotNull(vector);
assertTrue(vector.length > 0);
}
}
}
There are lots of questions on how to format the results of an SQL query to an HTML table, but I'd like to go the other way - given an arbitrary HTML table with a header row, I'd like to be able to extract information form one or more rows using SQL (or an SQL-like language). Simple to state, but apparently not so simple to accomplish.
Ultimately, I'd prefer to parse the HTML properly with something like libtidy or JSoup, but while the API documentation is usually reasonable, when it comes to examples or tutorials on actually using them, you usually find an example of extracting the <title> tag (which could be accomplished with regexes) with no real-world examples of how to use the library. So, a good resource or example code for one of the existing, established libraries would also be good.
A simple code for transforming a table into a list of tuples using JSoup looks like this:
public class Main {
public static void main(String[] args) throws Exception {
final String html =
"<html><head/><body>" +
"<table id=\"example\">" +
"<tr><td>John</td><td>Doe</td></tr>" +
"<tr><td>Michael</td><td>Smith</td>" +
"</table>" +
"</body></html>";
final List<Tuple> tuples = parse (html, "example");
//... Here the table is parsed
}
private static final List<Tuple> parse(final String html, final String tableId) {
final List<Tuple> tuples = new LinkedList<Tuple> ();
final Element table = Jsoup.parse (html).getElementById(tableId);
final Elements rows = table.getElementsByTag("tr");
for (final Element row : rows) {
final Elements children = row.children();
final int childCount = children.size();
final Tuple tuple = new Tuple (childCount);
for (final Element child : children) {
tuple.addColumn (child.text ());
}
}
return tuples;
}
}
public final class Tuple {
private final String[] columns;
private int cursor;
public Tuple (final int size) {
columns = new String[size];
cursor = 0;
}
public String getColumn (final int no) {
return columns[no];
}
public void addColumn(final String value) {
columns[cursor++] = value;
}
}
From this on you can e.g. create an in-memory table with H2 and use a regular SQL.
how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search?
IndexSearcher mem = ....(some standard code)
QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer);
TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1);
// the ~ triggers the fuzzy search as per "Lucene In Action"
The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets?
I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets .
There was no straight forward way of doing this, however I reconsidered Jared's suggestion and was able to get the solution working.
I am documenting this here just in case someone else has the same issue.
Create a class that implements org.apache.lucene.search.highlight.Formatter
public class HitPositionCollector implements Formatter
{
// MatchOffset is a simple DTO
private List<MatchOffset> matchList;
public HitPositionCollector(
{
matchList = new ArrayList<MatchOffset>();
}
// this ie where the term start and end offset as well as the actual term is captured
#Override
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
if (tokenGroup.getTotalScore() <= 0)
{
}
else
{
MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset());
getMatchList().add(mo);
}
return originalText;
}
/**
* #return the matchList
*/
public List<MatchOffset> getMatchList()
{
return matchList;
}
}
Main Code
public void testHitsWithHitPositionCollector() throws Exception
{
System.out.println(" .... testHitsWithHitPositionCollector");
String fuzzyStr = "bro*";
QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer);
Query fzyQry = parser.parse(fuzzyStr);
TopDocs hits = searcher.search(fzyQry, 10);
QueryScorer scorer = new QueryScorer(fzyQry, "f");
HitPositionCollector myFormatter= new HitPositionCollector();
//Highlighter(Formatter formatter, Scorer fragmentScorer)
Highlighter highlighter = new Highlighter(myFormatter,scorer);
highlighter.setTextFragmenter(
new SimpleSpanFragmenter(scorer)
);
Analyzer analyzer2 = new SimpleAnalyzer();
int loopIndex=0;
//for (ScoreDoc sd : hits.scoreDocs) {
Document doc = searcher.doc( hits.scoreDocs[0].doc);
String title = doc.get("f");
TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(),
hits.scoreDocs[0].doc,
"f",
doc,
analyzer2);
String fragment = highlighter.getBestFragment(stream, title);
System.out.println(fragment);
assertEquals("the quick brown fox jumps over the lazy dog", fragment);
MatchOffset mo= myFormatter.getMatchList().get(loopIndex++);
assertTrue(mo.getEndPos()==15);
assertTrue(mo.getStartPos()==10);
assertTrue(mo.getToken().equals("brown"));
}