Using Lucene, how to index TXT files into different fields?

Using Lucene, how to index TXT files into different fields? - java

I am using the NSF data whose format is txt. Now I have indexed these data and can send a query and got several results. But how can I search something in a selected field (eg. title) ? Because all of these NSF data are totally plain txt file. I do not think Lucene can recognize which part of the file is a "title" or something else. Should I firstly transfer the txt files to XML files (with tags telling Lucene which part is "title")? Can Lucene do that? I have no idea how to split the txt files into several fields. Can anyone please give me some suggestions? Thanks a lot!
BTW, every txt file looks like this:
---begin---
Title: Mitochondrial DNA and Historical Demography
Type: Award
Date: August 1, 1991
Number: 9000006
Abstract: asdajsfhsjdfhsjngfdjnguwiehfrwiuefnjdnfsd
----end----

You have to split the text into the several parts. You can use the resulting strings to create a field for each part of the text, i.e. title.
Create your lucene document with the fields like this:
Document doc = new Document();
doc.add(new Field("title", titleString, Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("abstract", abstractString, Field.Store.NO, Field.Index.TOKENIZED));
and so on. After indexing the document you can search in the title like this: title:dna
More complex queries and mixing multiple fields in the query also possible: +title:dna +abstract:"some example text" -number:935353

Related

How to deal with identifier fields in Lucene?

I've stumbled upon a problem similar to the one described in this other question: I have a field named like 'type', which is an identifier, ie, it's case sensitive and I want to use it for exact searches, no tokenisation, no similarity searches, just plain "find exactly 'Sport:01'". I might benefit from 'Sport*', but it's not extremely important in my case.
I cannot make it work: I thought the right kind of field to store this is: StringField.TYPE_STORED, with DOCS_AND_FREQS_AND_POSITIONS and setOmitNorms ( true ). However, this way I can't correctly resolve a query like: +type:"RockMusic" +title: "a sample title" using the standard analyzer, because, as far as I understand, the analyzer converts the input into lower case (ie, rockmusic) and the type is stored in its original mixed-case form (hence, I cannot resolve it even if I remove the title clause).
I'd like to mix case-insensitive search over title with case-sensitive over type, since I've cases where type := BRAIN is an acronym and it's different than 'Brain'.
So, what's the best way to manage fields and searches like the above? Are there alternatives other than text and string fields?
I'm using Lucene 6.6.0, but this is a general issue, regarding multiple (all?) Lucene versions.
Some code showing details is here (see testIdMixedCaseID*). The real use case is rather more complicated, if you want to give a look, the problem is with the field CC_FIELD, which might be 'BioProc' and nothing can be found in such a case.
Please note I need to use the plain Lucene, not Solr or Elastic search.

The following notes are based on Lucene 8.x, not on Lucene 6.6 - so there may be some syntax differences - but I take your point about how any such differences should be coincidental to your question.
Here are some notes, where I will focus on the following aspect of your question:
However, this way I can't correctly resolve a query like: +type:"RockMusic" +title:"a sample title" using the standard analyzer
I think there are 2 parts to this:
Firstly, the query example using "a sample title" will - as you say - not work well with how a standard analyzer works - for the reasons you state.
But, secondly, it is possible to combine the two types of query you want to use, in a way which I believe gets you what you need: An exact match for the type field (e.g. RockMusic) and a more traditional tokenized & case-insensitive result for the title field (a sample title).
Here is how I would do that:
Here is some simple test data:
public static void buildIndex() throws IOException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "another different title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "Rock Music", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
}
}
Here is the query code:
public static void doSearch() throws QueryNodeException, ParseException, IOException {
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery typeQuery = new TermQuery(new Term("type", "RockMusic"));
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("title", analyzer);
Query titleQuery = parser.parse("A Sample Title");
Query query = new BooleanQuery.Builder()
.add(typeQuery, BooleanClause.Occur.MUST)
.add(titleQuery, BooleanClause.Occur.MUST)
.build();
System.out.println("Query: " + query.toString());
System.out.println();
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
System.out.println("doc = " + hit.doc + "; score = " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println("Type = " + doc.get("type")
+ "; Title = " + doc.get("title"));
System.out.println();
}
}
The output from the above query is as follows:
Query: +type:RockMusic +(title:a title:sample title:title)
doc = 0; score = 0.7016101
Type = RockMusic; Title = a sample title
doc = 1; score = 0.2743341
Type = RockMusic; Title = another different title
As you can see, this query is a little different from the one taken from your question.
But the list of found documents shows that (a) the Rock Music document was not found at all (good - because Rock Music does not match the "type" search term of RockMusic); and (b) the title a sample title got a far higher match score than the another different title document, when searching for A Sample Title.
Additional notes:
This query works by combining a StringField exact search with a more traditional TextField tokenized search - this latter search being processed by the StandardAnalyzer (matching how the data was indexed in the first place).
I am making an assumption about the score ranking being useful to you - but for title searches, I think that is reasonable.
This approach would also apply to your BRAIN vs. brain example, for StringField data.
(I also assume that, for a user interface, a user could select the "RockMusic" type value from a drop-down, and enter the "A Sample Title" search in an input field - but this is getting off-topic, I think).
You could obviously enhance the analyzer to include stop-words, and so on, as needed.
Of course, my examples involve hard-coded data - but it would not take much to generalize this approach to handle dynamically-provided search terms.
Hope that this makes sense - and that I understood the problem correctly.

Going to answer myself...
I discovered what #andrewjames outlines in his excellent analysis by making a number of tests of my own. Essentially, fields like "type" don't play well with the standard analyser and they are best indexed and searched with an analyzer like KeywordAnalyzer, which, in practice, stores the original value as-is and searches it accordingly.
Most real cases are like my example, ie, mixed ID-like fields, which need exact matching, plus fields like 'title' or 'description', which best serves user searches using per-token searching, word-based scoring, stop words elimination, etc.
Because of that, PerFieldAnalyzerWrapper (see also my sample code, linked above) comes to much help, ie, a wrapper analyzer, which is able to dispatch analysis field-specific analyzers, on a field name basis.
One thing to add is that I still haven't clear which analyzer is used when a query is built without a parser (eg, using new TermQuery ( new Term ( fname, fval )), so now I use a QueryParser.

Solr search not working properly

I am searching for String Kansas City in description field.
"q":"description: *Kansas City*", but I am getting the results for both Kansas and City. Also it is getting the results from content field as well. I am not sure why it is fetching results from content field. Please suggest me if I am doing any error in my query.

Your quoting is wrong
description:"kansas city"
for example
What are the stars for?

After tokenizing and parsing query it looks like kansas city is tokenized into "kansas" and "city" and filters are applied as per fieldtype definition.
then they are searched in fieldname specified.
description:*Kansas
after tokenizing/word splitting, "city" becomes
different word for which you didn't specify fieldname. so by default it is searched in defaultfield(which is may be content in your case)
defaultsearchfield:city*
in your case after parsing description:kansasandcontent:city you can see the same debugQuery=on with URL in your browser.

blank output on IndriRunQuery in lemur project

I'm using lemur for a nlp project, and I indexed some data succesffully
I wanna run a query on index files by IndriRunQuery command
parameter file:
<parameters>
<index>PATH-TO-INDEX-DIRECTORY</index>
<query>
<number>1</number>
<text>QUERY SAMPLE STRING</text>
</query>
<count>50</count></parameters>
there is no error, there is no answer. just a blank line in output

I found answer myself
my documents in indexing step weren't in the format that lemur document told
documents told the make training document in this format:
<DOC>
<DOCNO>DOCUMENT-ID</DOCNO>
<TEXT>DCOUMENT-PLAIN-TEXT</TEXT>
</DOC>
and indexed documents again by: buildIndex [parameterFile]
then user indriRunQuery.exe and worked well

Extracting a column from a paragraph from a csv file using java

MAJOR ACC NO,MINOR ACC NO,STD CODE,TEL NO,DIST CODE
7452145,723456, 01,4213036,AAA
7254287,7863265, 01,2121920,AAA
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FOOCTYP,FOOCDES,FOOCAMT,FSTD,FTELNO,FNORECON,FXFRACCN,FLANGIND,CUR
12345,71234,7643234,AAA,001,DX,WLR Promotion - Insitu /Pre-Cabled PSTN Connection,-37.87,,,0,,E,EUR
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FORDNO,FREF,FCHGDES,FCHGAMT,CUR,FORENFRM,FORENTO
3242241,72349489,2345352,AAA,001,30234843P ,1,NEW CONNECTION - PRECABLED CHARGE,37.87,EUR,2123422,201201234
12123471,7618412389,76333232,AAA,001,3123443P ,2,BROKEN PERIOD RENTAL,5.40,EUR,201234523,20123601
I have a csv file something like the one above and I want to extract certain columns from it. For example I want to extract the first column of the first paragraph. I'm kind of new to java but I am able to read the file but I want to extract certain columns from different paragraphs. Any help will be appreciated.

Extracting word count From an XML File

(This question is related to the previous question I posted earlier on stackoverflow...here is the link
Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario)
The question is that keeping the above case in mind, instead of getting sentences, if i would like to get the words written by each participant in all sentences. For Example. if the word 'Budget' is used ten times in total and seven times by participant 'Dolske' and three times by others. So I need the list of all words and how many times it is written by each participant? Also the list of words in each turn?
What is the best strategy to achieve this? Any sample codes?
The XML is attached here (you can also check it in the referred question)
"(495584) Firefox - search suggestions passes wrong previous result to form history"
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> &gt; (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> &gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
Help will be highly appreciated...

Get all of the values, better use sTax parser, it is good for such kind of tasks. Then split all of the senteces in words and do whatever you want.
Like create a model with Class Turn, where you store the author and the sentences, write services for this class and go on. :)
To split sentence in words, use split() or StringTokenizer, but tokenizer is deprecated. To use split, create a temp array, like
stringArray = sentence.toString().split(" ");
or like "sentence.getValue()", whatever.
where in method parameter you put the regEx. In your case it is a simple space, cause it splits the sentence. Then you could just go over the words and count what you need.
In case of ArrayList, use List.toArray() to get your list in the array view.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.