Lucene : Changing the default facet delimiter?

Lucene : Changing the default facet delimiter? - java

First post on this wonderful site!
My goal is to use hierarchical facets for searching an index using Lucene. However, my facets need to be delimited by a character other than '/', (in this case, '~'). Example:
Categories
Categories~Category1
Categories~Category2
I have created a class that implements FacetIndexingParams interface (a copy of DefaultFacetIndexingParams with the DEFAULT_FACET_DELIM_CHAR param set to '~').
Paraphrased indexing code : (using FSDirectory for both index and taxonomy)
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34)
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer)
IndexWriter writer = new IndexWriter(indexDir, config)
TaxonomyWriter taxo = new LuceneTaxonomyWriter(taxDir, OpenMode.CREATE)
Document doc = new Document()
// Add bunch of Fields... hidden for the sake of brevity
List<CategoryPath> categories = new ArrayList<CategoryPath>()
row.tags.split('\\|').each{ tag ->
def cp = new CategoryPath()
tag.split('~').each{
cp.add(it)
}
categories.add(cp)
}
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo, facetIndexingParams)
categoryDocBuilder.setCategoryPaths(categories).build(doc)
writer.addDocument(doc)
// Commit and close both writer and taxo.
Search code paraphrased:
// Create index and taxonomoy readers to get info from index and taxonomy
IndexReader indexReader = IndexReader.open(indexDir)
TaxonomyReader taxo = new LuceneTaxonomyReader(taxDir)
Searcher searcher = new IndexSearcher(indexReader)
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", new StandardAnalyzer(Version.LUCENE_34))
parser.setAllowLeadingWildcard(true)
Query q = parser.parse(query)
TopScoreDocCollector tdc = TopScoreDocCollector.create(10, true)
List<FacetResult> res = null
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
FacetSearchParams facetSearchParams = new FacetSearchParams(facetIndexingParams)
CountFacetRequest cfr = new CountFacetRequest(new CategoryPath(""), 99)
cfr.setDepth(2)
cfr.setSortBy(SortBy.VALUE)
facetSearchParams.addFacetRequest(cfr)
FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader, taxo)
def cp = new CategoryPath("Category~Category1", (char)'~')
searcher.search(DrillDown.query(q, cp), MultiCollector.wrap(tdc, facetsCollector))
The results always return a list of facets in the form of "Category/Category1".
I have used the Luke tool to look at the index and it appears the facets are being delimited by the '~' character in the index.
What is the best route to do this? Any help is greatly appreciated!

I have figured out the issue. The search and indexing are working as they are supposed to. It is how I have been getting the facet results that is the issue. I was using :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString()
}
What I needed to use was :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString((char)'~')
}
The difference being the paramter sent to the toString function!
Easy to overlook, tough to find.
Hope this helps others.

Related

Lucene index: getting empty result while query

I am trying to query with Lucene index but getting the empty result and below errors in the log,
Traversal query (query without index): select [jcr:path] from [nt:base] where isdescendantnode('/test') and name='World'; consider creating an index
[async] The index update failed
org.apache.jackrabbit.oak.api.CommitFailedException: OakAsync0002: Missing index provider detected for type [counter] on index [/oak:index/counter]
I am using RDB DocumentStore and I have checked index and node are created in nodes table.i tried below code,
#Autowired
NodeStore rdbNodeStore;
//create reposiotory
LuceneIndexProvider provider = new LuceneIndexProvider();
ContentRepository repository = new Oak(rdbNodeStore)
.with(new OpenSecurityProvider())
.with(new InitialContent())
.with((QueryIndexProvider) provider)
.with((Observer) provider)
.with(new LuceneIndexEditorProvider())
.withAsyncIndexing("async",
5).createContentRepository();
//login reposiotory and retrive session
ContentSession contentSession = repository.login(null, null);
Root root = contentSession.getLatestRoot();
//create lucene index
Tree index = root.getTree("/");
Tree t = index.addChild("oak:index");
t = t.addChild("lucene");
t.setProperty("jcr:primaryType", "oak:QueryIndexDefinition", Type.NAME);
t.setProperty("compatVersion", Long.valueOf(2L), Type.LONG);
t.setProperty("type", "lucene", Type.STRING);
t.setProperty("async", "async", Type.STRING);
t = t.addChild("indexRules");
t = t.addChild("nt:base");
Tree propnode = t.addChild("properties");
Tree t1 = propnode.addChild("name");
t1.setProperty("name", "name");
t1.setProperty("propertyIndex", Boolean.valueOf(true), Type.BOOLEAN);
root.commit();
//Create TestNode
String h = "Hello" + System.currentTimeMillis();
String w = "World" + System.currentTimeMillis();
Tree test = root.getTree("/").addChild("test");
test.addChild("a").setProperty("name", Arrays.asList(new String[] { h, w }), Type.STRINGS);
test.addChild("b").setProperty("name", h);
root.commit();
//Search
String query = "select [jcr:path] from [nt:base] where isdescendantnode('/test') and name='World' option(traversal ok)";
List<String> paths = executeQuery(root, query, "JCR-SQL2", true, false);
for (String path : paths) {
System.out.println("Path=" + path);
}
can anyone share some sample code on how to create Lucene index?

There are a couple of issues with what you're likely doing. First thing would the error that you're observing. Since you're using InitialContent which provisions an index with type="counter". For that you'd need to have .with(new NodeCounterEditorProvider()) while building the repository. That should avoid the error you are seeing.
But, your code would likely still not work because lucene indexes are async (which you've correctly configured). Due to that asynchronous behavior, you can't query immediately after adding the node.
I tried your code but had to add something like Thread.sleep(10*1000) before going for querying.
As another side-note, I'd recommend that you try out IndexDefinitionBuilder to make lucene index structure. So, you could replace
Tree index = root.getTree("/");
Tree t = index.addChild("oak:index");
t = t.addChild("lucene");
t.setProperty("jcr:primaryType", "oak:QueryIndexDefinition", Type.NAME);
t.setProperty("compatVersion", Long.valueOf(2L), Type.LONG);
t.setProperty("type", "lucene", Type.STRING);
t.setProperty("async", "async", Type.STRING);
t = t.addChild("indexRules");
t = t.addChild("nt:base");
Tree propnode = t.addChild("properties");
Tree t1 = propnode.addChild("name");
t1.setProperty("name", "name");
t1.setProperty("propertyIndex", Boolean.valueOf(true), Type.BOOLEAN);
root.commit();
with
IndexDefinitionBuilder idxBuilder = new IndexDefinitionBuilder();
idxBuilder.indexRule("nt:base").property("name").propertyIndex();
idxBuilder.build(root.getTree("/").addChild("oak:index").addChild("lucene"));
root.commit();
The latter approach, imo, is less error prone and more redabale.

Lucene : Check Index availability before

I would like to know if there are ways to check the Index availability before creating it. I referred a lot of threads like : Lucene.NET - check if document exists in index . But, my program didn't work.
I am using the TutorialsPoint example to implement this with Java 1.5.
Below is a little code snippet which I used :
LuceneTester.java
static String indexDir = "D:\\Lucene\\Index";
static String dataDir = "D:\\Lucene\\Data";
Main :
Directory directory = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(directory);
Term term = new Term("D:\\Lucene\\Data","record1.txt");
TermDocs docs = reader.termDocs(term);
if(docs.next()){
System.out.println("Already Indexed");
}else{
tester = new LuceneTester();
tester.createIndex();
tester.search("Anish");
}

Fetch all facets records against specfic field Lucene

I'm beginner and started learning Lucene. Currently I 've implemented a Count Facets program in lucene 6.0.2 which display or output all facets field count. But now, want to search for a city "California" and in the result it will show all facets count w.r.t this query, How to do this...
Here is the code:
public List<FacetResult> runSearch() throws IOException {
DirectoryReader indexReader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(indexReader);
TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoDir);
FacetsConfig config = new FacetsConfig();
FacetsCollector fc = new FacetsCollector();
FacetsCollector.search(searcher, new MatchAllDocsQuery(), 10, fc);
List<FacetResult> results = new ArrayList<>();
Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
results.add(facets.getTopChildren(10, "city"));
results.add(facets.getTopChildren(10, "make"));
results.add(facets.getTopChildren(10, "year"));
results.add(facets.getTopChildren(10, "model"));
indexReader.close();
taxoReader.close();
return results;
}

From your example I get that you've searched * in your first search. To search for "* AND city:california" (Pretty useless in this case, but works as an example) you can build a DrillDownQuery with your original query and the filter, e.g.
Query baseQuery = new MatchAllDocsQuery();
DrillDownQuery ddQuery = new DrillDownQuery(config, baseQuery);
ddQuery.add("city", "california");
FacetsCollector fc = new FacetsCollector();
FacetsCollector.search(searcher, ddQuery, 10, fc);
Add can be called multiple times. If you call it for the same field the value is appended with OR, if you call it for a different field those limitation is appended with AND, e.g. (* AND (city:california OR city:york) AND year:1980)

Fetch FileStorageArea in Filenet with path [Document Moving]

I am a student and I'm new to Filenet. I am trying to do the test code on file moving.
Document doc = Factory.Document.getInstance(os, ClassNames.DOCUMENT, new Id("{33074B6E-FD19-4C0D-96FC-D809633D35BF}") );
FileStorageArea newDocClassFSA = Factory.FileStorageArea.fetchInstance(os, new Id("{3C6CEE68-D8CC-44A5-AEE7-CADE9752AA77}"), null );
doc.moveContent(dsa);
doc.save(RefreshMode.REFRESH);
The thing is that I can fetch the document by its path like this ,
doc = Factory.Document.fetchInstance(os, "/DEMO/MASTERFILE/ZONE-X/Org.No-XXXXX/XXXX-X-XXXX-X.TIF",null);
but I can't fetch the StorageArea by the path, it only takes ID. Is there a way to move a file easily than this? How can I get ID with path without using queries?

There is no other way to fetch FileStorageArea by its path other than issue a query and filter by RootDirectoryPath property.

You can access the ID of the document using the path:
//Get ID of the Document
StringBuffer propertyNames = new StringBuffer();
propertyNames.append(PropertyNames.ID);
propertyNames.append(" ");
propertyNames.append(PropertyNames.PATH_NAME);
PropertyFilter pf=new PropertyFilter();
FilterElement felement= new FilterElement(Integer.valueOf(0),Long.valueOf(0),Boolean.TRUE,propertyNames.toString(),Integer.valueOf(0));
pf.addIncludeProperty(felement);
Document document = Factory.Document.fetchInstance(os, ruta, pf );
idDocument = document.get_Id().toString();
and in the idDocument string you have it. Hope it helps.

Lucene 4.1 is ignoring FieldType.tokenized = false

I'm using Lucene 4.1 to index keyword/value pairs, where the keywords and values are not real words - i.e., they are voltages, settings, that should not be analyzed or tokenized. e.g. $P14R / 16777216. (this is FCS data for any Flow Cytometrists out there)
For indexing, I create a FieldType with indexed = true, stored = true, and tokenized = false. These mimic the ancient Field.Keyword from Lucene 1, for which I have the book. :-) I even freeze the fieldType.
I see these values in the debugger. I create the document and index.
When I read the index and document and look at the Fields in the debugger, I see all my fields. The names and fieldsData look correct. However, the FieldType is wrong. It shows indexed = true, stored = true, and tokenized = true. The result is that my searches (using a TermQuery) do not work.
How can I fix this? Thanks.
p.s. I am using a KeywordAnalyzer in the IndexWriterConfig. I'll try to post some demo code later, but it's off to my real job for today. :-)
DEMO CODE:
public class LuceneDemo {
public static void main(String[] args) throws IOException {
Directory lDir = new RAMDirectory();
Analyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_41, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(lDir, iwc);
// BTW, Lucene, anyway you could make this even more tedious???
// ever heard of builders, Enums, or even old fashioned bits?
FieldType keywordFieldType = new FieldType();
keywordFieldType.setStored(true);
keywordFieldType.setIndexed(true);
keywordFieldType.setTokenized(false);
Document doc = new Document();
doc.add(new Field("$foo", "$bar123", keywordFieldType));
doc.add(new Field("contents", "$foo=$bar123", keywordFieldType));
doc.add(new Field("$foo2", "$bar12345", keywordFieldType));
Field onCreation = new Field("contents", "$foo2=$bar12345", keywordFieldType);
doc.add(onCreation);
System.out.println("When creating, the field's tokenized is " + onCreation.fieldType().tokenized());
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(lDir);
Document d1 = reader.document(0);
Field readBackField = (Field) d1.getFields().get(0);
System.out.println("When read back the field's tokenized is " + readBackField.fieldType().tokenized());
IndexSearcher searcher = new IndexSearcher(reader);
// exact match works
Term term = new Term("$foo", "$bar123" );
Query query = new TermQuery(term);
TopDocs results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
// partial match fails
term = new Term("$foo", "123" );
query = new TermQuery(term);
results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
// wildcard search works
term = new Term("contents", "*$bar12345" );
query = new WildcardQuery(term);
results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
}
}
output will be:
When creating, the field's tokenized is false
When read back the field's tokenized is true
when searching for : $foo:$bar123 hits = 1
when searching for : $foo:123 hits = 0
when searching for : contents:*$bar12345 hits = 1

You can try to use a KeywordAnalyzer for the fields you don't want to tokenize.
If you need multiple analyzers (that is, if you have other fields that need tokenization), PerFieldAnalyzerWrapper is the way.

Lucene stores all tokens in lower case - hence, you need to convert your search strings to lower case first for non-tokenized fields.

The demo code proves that the value for tokenized is different when you read it back. Not sure if that is a bug or not.
But that isn't why the partial search doesn't work. The partial search doesn't work cause Lucene doesn't do partial searches (unless you use Wildcard) e.g. Says so here in StackOverflow
Been using Google so long I guess I didn't understand that. :-)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene : Changing the default facet delimiter? - java

Related

Lucene index: getting empty result while query

Lucene : Check Index availability before

Fetch all facets records against specfic field Lucene

Fetch FileStorageArea in Filenet with path [Document Moving]

Lucene 4.1 is ignoring FieldType.tokenized = false

Categories

Resources