Lucene : Check Index availability before

Lucene : Check Index availability before - java

I would like to know if there are ways to check the Index availability before creating it. I referred a lot of threads like : Lucene.NET - check if document exists in index . But, my program didn't work.
I am using the TutorialsPoint example to implement this with Java 1.5.
Below is a little code snippet which I used :
LuceneTester.java
static String indexDir = "D:\\Lucene\\Index";
static String dataDir = "D:\\Lucene\\Data";
Main :
Directory directory = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(directory);
Term term = new Term("D:\\Lucene\\Data","record1.txt");
TermDocs docs = reader.termDocs(term);
if(docs.next()){
System.out.println("Already Indexed");
}else{
tester = new LuceneTester();
tester.createIndex();
tester.search("Anish");
}

Related

How to search file inside a specific folder in google API v3

As i am using v3 of google api,So instead of using parent and chidren list i have to use fileList, So now i want to search list of file inside a specific folder.
So someone can suggest me what to do?
Here is the code i am using to search the file :
private String searchFile(String mimeType,String fileName) throws IOException{
Drive driveService = getDriveService();
String fileId = null;
String pageToken = null;
do {
FileList result = driveService.files().list()
.setQ(mimeType)
.setSpaces("drive")
.setFields("nextPageToken, files(id, name)")
.setPageToken(pageToken)
.execute();
for(File f: result.getFiles()) {
System.out.printf("Found file: %s (%s)\n",
f.getName(), f.getId());
if(f.getName().equals(fileName)){
//fileFlag++;
fileId = f.getId();
}
}
pageToken = result.getNextPageToken();
} while (pageToken != null);
return fileId;
}
But in this method it giving me all the files that are generated which i don't want.I want to create a FileList which will give file inside a specific folder.

It is now possible to do it with the term parents in q parameter in drives:list. For example, if you want to find all spreadsheets in a folder with id folder_id you can do so using the following q parameter (I am using python in my example):
q="mimeType='application/vnd.google-apps.spreadsheet' and parents in '{}'".format(folder_id)
Remember that you should find out the id of the folder files inside of which you are looking for. You can do this using the same drives:list.
More information on drives:list method can be seen here, and you can read more about other terms you can put to q parameter here.

To search in a specific directory you have to specify the following:
q : name = '2021' and mimeType = 'application/vnd.google-apps.folder' and '1fJ9TFZOe8G9PUMfC2Ts06sRnEPJQo7zG' in parents
This examples search a folder called "2021" into folder with 1fJ9TFZOe8G9PUMfC2Ts06sRnEPJQo7zG
In my case, I'm writing a code in c++ and the request url would be:
string url = "https://www.googleapis.com/drive/v3/files?q=name+%3d+%272021%27+and+mimeType+%3d+%27application/vnd.google-apps.folder%27+and+trashed+%3d+false+and+%271fJ9TFZOe8G9PUMfC2Ts06sRnEPJQo7zG%27+in+parents";

Searching files by folder name is not yet supported. It's been requested in this google forum but so far, nothing yet. However, try to look for other alternative search filters available in Search for Files.
Be creative. For example make sure the files within a certain folder contains a unique keyword which you can then query using
fullText contains 'my_unique_keyword'

You can use this method to search the files from google drive:
Files.List request = this.driveService.files().list();
noOfRecords = 100;
request.setPageSize(noOfRecords);
request.setPageToken(nextPageToken);
String searchQuery = "(name contains 'Hello')";
if (StringUtils.isNotBlank(searchQuery)) {
request.setQ(searchQuery);
}
request.execute();

Lucene 4.1 is ignoring FieldType.tokenized = false

I'm using Lucene 4.1 to index keyword/value pairs, where the keywords and values are not real words - i.e., they are voltages, settings, that should not be analyzed or tokenized. e.g. $P14R / 16777216. (this is FCS data for any Flow Cytometrists out there)
For indexing, I create a FieldType with indexed = true, stored = true, and tokenized = false. These mimic the ancient Field.Keyword from Lucene 1, for which I have the book. :-) I even freeze the fieldType.
I see these values in the debugger. I create the document and index.
When I read the index and document and look at the Fields in the debugger, I see all my fields. The names and fieldsData look correct. However, the FieldType is wrong. It shows indexed = true, stored = true, and tokenized = true. The result is that my searches (using a TermQuery) do not work.
How can I fix this? Thanks.
p.s. I am using a KeywordAnalyzer in the IndexWriterConfig. I'll try to post some demo code later, but it's off to my real job for today. :-)
DEMO CODE:
public class LuceneDemo {
public static void main(String[] args) throws IOException {
Directory lDir = new RAMDirectory();
Analyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_41, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(lDir, iwc);
// BTW, Lucene, anyway you could make this even more tedious???
// ever heard of builders, Enums, or even old fashioned bits?
FieldType keywordFieldType = new FieldType();
keywordFieldType.setStored(true);
keywordFieldType.setIndexed(true);
keywordFieldType.setTokenized(false);
Document doc = new Document();
doc.add(new Field("$foo", "$bar123", keywordFieldType));
doc.add(new Field("contents", "$foo=$bar123", keywordFieldType));
doc.add(new Field("$foo2", "$bar12345", keywordFieldType));
Field onCreation = new Field("contents", "$foo2=$bar12345", keywordFieldType);
doc.add(onCreation);
System.out.println("When creating, the field's tokenized is " + onCreation.fieldType().tokenized());
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(lDir);
Document d1 = reader.document(0);
Field readBackField = (Field) d1.getFields().get(0);
System.out.println("When read back the field's tokenized is " + readBackField.fieldType().tokenized());
IndexSearcher searcher = new IndexSearcher(reader);
// exact match works
Term term = new Term("$foo", "$bar123" );
Query query = new TermQuery(term);
TopDocs results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
// partial match fails
term = new Term("$foo", "123" );
query = new TermQuery(term);
results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
// wildcard search works
term = new Term("contents", "*$bar12345" );
query = new WildcardQuery(term);
results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
}
}
output will be:
When creating, the field's tokenized is false
When read back the field's tokenized is true
when searching for : $foo:$bar123 hits = 1
when searching for : $foo:123 hits = 0
when searching for : contents:*$bar12345 hits = 1

You can try to use a KeywordAnalyzer for the fields you don't want to tokenize.
If you need multiple analyzers (that is, if you have other fields that need tokenization), PerFieldAnalyzerWrapper is the way.

Lucene stores all tokens in lower case - hence, you need to convert your search strings to lower case first for non-tokenized fields.

The demo code proves that the value for tokenized is different when you read it back. Not sure if that is a bug or not.
But that isn't why the partial search doesn't work. The partial search doesn't work cause Lucene doesn't do partial searches (unless you use Wildcard) e.g. Says so here in StackOverflow
Been using Google so long I guess I didn't understand that. :-)

How do I normalize a CSV file with Encog?

I need to normalize a CSV file. I followed this article written by Jeff Heaton. This is (some) of my code:
File sourceFile = new File("Book1.csv");
File targetFile = new File("Book1_norm.csv");
EncogAnalyst analyst = new EncogAnalyst();
AnalystWizard wizard = new AnalystWizard(analyst);
wizard.wizard(sourceFile, true, AnalystFileFormat.DECPNT_COMMA);
final AnalystNormalizeCSV norm = new AnalystNormalizeCSV();
norm.analyze(sourceFile, false, CSVFormat.ENGLISH, analyst);
norm.setProduceOutputHeaders(false);
norm.normalize(targetFile);
The only difference between my code and the one of the article is this line:
norm.setOutputFormat(CSVFormat.ENGLISH);
I tried to use it but it seems that in Encog 3.1.0, that method doesn't exist. The error I get is this one (it looks like the problem is with the line norm.normalize(targetFile):
Exception in thread "main" org.encog.app.analyst.AnalystError: Can't find column: 11700
at org.encog.app.analyst.util.CSVHeaders.find(CSVHeaders.java:187)
at org.encog.app.analyst.csv.normalize.AnalystNormalizeCSV.extractFields(AnalystNormalizeCSV.java:77)
at org.encog.app.analyst.csv.normalize.AnalystNormalizeCSV.normalize(AnalystNormalizeCSV.java:192)
at IEinSoftware.main(IEinSoftware.java:55)

I added a FAQ that shows how to normalize a CSV file. http://www.heatonresearch.com/faq/4/2

Here's a function to do it... of course you need to create an analyst
private EncogAnalyst _analyst;
public void NormalizeFile(FileInfo SourceDataFile, FileInfo NormalizedDataFile)
{
var wizard = new AnalystWizard(_analyst);
wizard.Wizard(SourceDataFile, _useHeaders, AnalystFileFormat.DecpntComma);
var norm = new AnalystNormalizeCSV();
norm.Analyze(SourceDataFile, _useHeaders, CSVFormat.English, _analyst);
norm.ProduceOutputHeaders = _useHeaders;
norm.Normalize(NormalizedDataFile);
}

Lucene 4.0 overrides final method tokenStream

For different reasons I have to work with the latest release of Lucene's API.
The API isn't well documented yet so I find myself not able to perform a simple addDocument()
Here is the Writer initialization:
analyzer = new StopAnalyzer(Version.LUCENE_40);
config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
writer = new IndexWriter(FSDirectory.open(new File(ConfigUtil.getProperty("lucene.directory"))), config);
The simple toDocument method:
public static Document getDocument(User user) {
Document doc = new Document();
FieldType storedType = new FieldType();
storedType.setStored(true);
storedType.setTokenized(false);
// Store user data
doc.add(new Field(USER_ID, user.getId().toString(), storedType));
doc.add(new Field(USER_NAME, user.getFirstName() + user.getLastName(), storedType));
FieldType unstoredType = new FieldType();
unstoredType.setStored(false);
unstoredType.setTokenized(true);
// Analyze Location
String tokens = "";
if (user.getLocation() != null && ! user.getLocation().isEmpty()){
for (Tag location : user.getLocation()) tokens += location.getName() + " ";
doc.add(new Field(USER_LOCATION, tokens, unstoredType));
}
}
When running:
Document userDoc = DocumentManager.getDocument(userWrap);
IndexAccess.getWriter().addDocument(userDoc);
This is the error message I get:
class org.apache.lucene.analysis.util.ReusableAnalyzerBase overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
It may be a simple matter but I cannot find any reference to help with this problem. I'm using a default analyzer and I followed a tutorial in order to avoid the deprecated Field.Index.ANALYZED

This is due to some kind of JAR version mismatch. You may be depending on a contrib JAR that in turn depends on different version of Lucene. Try to get a hold of the exact dependency set at runtime and look for any version mismatches.

Lucene : Changing the default facet delimiter?

First post on this wonderful site!
My goal is to use hierarchical facets for searching an index using Lucene. However, my facets need to be delimited by a character other than '/', (in this case, '~'). Example:
Categories
Categories~Category1
Categories~Category2
I have created a class that implements FacetIndexingParams interface (a copy of DefaultFacetIndexingParams with the DEFAULT_FACET_DELIM_CHAR param set to '~').
Paraphrased indexing code : (using FSDirectory for both index and taxonomy)
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34)
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer)
IndexWriter writer = new IndexWriter(indexDir, config)
TaxonomyWriter taxo = new LuceneTaxonomyWriter(taxDir, OpenMode.CREATE)
Document doc = new Document()
// Add bunch of Fields... hidden for the sake of brevity
List<CategoryPath> categories = new ArrayList<CategoryPath>()
row.tags.split('\\|').each{ tag ->
def cp = new CategoryPath()
tag.split('~').each{
cp.add(it)
}
categories.add(cp)
}
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo, facetIndexingParams)
categoryDocBuilder.setCategoryPaths(categories).build(doc)
writer.addDocument(doc)
// Commit and close both writer and taxo.
Search code paraphrased:
// Create index and taxonomoy readers to get info from index and taxonomy
IndexReader indexReader = IndexReader.open(indexDir)
TaxonomyReader taxo = new LuceneTaxonomyReader(taxDir)
Searcher searcher = new IndexSearcher(indexReader)
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", new StandardAnalyzer(Version.LUCENE_34))
parser.setAllowLeadingWildcard(true)
Query q = parser.parse(query)
TopScoreDocCollector tdc = TopScoreDocCollector.create(10, true)
List<FacetResult> res = null
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
FacetSearchParams facetSearchParams = new FacetSearchParams(facetIndexingParams)
CountFacetRequest cfr = new CountFacetRequest(new CategoryPath(""), 99)
cfr.setDepth(2)
cfr.setSortBy(SortBy.VALUE)
facetSearchParams.addFacetRequest(cfr)
FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader, taxo)
def cp = new CategoryPath("Category~Category1", (char)'~')
searcher.search(DrillDown.query(q, cp), MultiCollector.wrap(tdc, facetsCollector))
The results always return a list of facets in the form of "Category/Category1".
I have used the Luke tool to look at the index and it appears the facets are being delimited by the '~' character in the index.
What is the best route to do this? Any help is greatly appreciated!

I have figured out the issue. The search and indexing are working as they are supposed to. It is how I have been getting the facet results that is the issue. I was using :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString()
}
What I needed to use was :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString((char)'~')
}
The difference being the paramter sent to the toString function!
Easy to overlook, tough to find.
Hope this helps others.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene : Check Index availability before - java

Related

How to search file inside a specific folder in google API v3

Lucene 4.1 is ignoring FieldType.tokenized = false

How do I normalize a CSV file with Encog?

Lucene 4.0 overrides final method tokenStream

Lucene : Changing the default facet delimiter?

Categories

Resources