Generate Term-Document matrix using Lucene 4.4 - java

I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4.
I know how to get TermVector for each document as following:
//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size()); //just testing
I thought I can just union all the termVector together as columns in a matrix to get the matrix. However, termVector for different documents have different size. And we don't know how to pad 0 into the termVector. So, certainly, this method does not work.
Hence, I wonder if someone can show me how to create Term-Document vector with Lucene 4.4 please? (If possible, please show sample code).
If Lucene does not support this function, what is the other way you recommend to do it?
Many thanks,

I found the solution to my problem here. Very detail example given by Mr. Sujit, although the code is written in older version of Lucene so many things will have to be changed. I'll update details when I finish my code.
Here is my solution that works on Lucene 4.4
public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
reader = DirectoryReader.open(FSDirectory.open(index));
searcher = new IndexSearcher(reader);
this.corpus = corpus;
termIdMap = computeTermIdMap(reader);
}
/**
* Map term to a fix integer so that we can build document matrix later.
* It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("contents");
TermsEnum itr = terms.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
if (termIdMap.containsKey(termText))
continue;
//System.out.println(termText);
termIdMap.put(termText, id++);
}
return termIdMap;
}
/**
* build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
//iterate through directory to work with each doc
int col = 0;
int numDocs = countDocs(corpus); //get the number of documents here
int numTerms = termIdMap.size(); //total number of terms
RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);
for (File f : corpus.listFiles()) {
if (!f.isHidden() && f.canRead()) {
//I build term document matrix for a subset of corpus so
//I need to lookup document by path name.
//If you build for the whole corpus, just iterate through all documents
String path = f.getPath();
BooleanQuery pathQuery = new BooleanQuery();
pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(pathQuery, 1);
//get term vector
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
//compute term weight
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
int row = termIdMap.get(termText);
long termFreq = itr.totalTermFreq();
long docCount = itr.docFreq();
double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
tdMatrix.setEntry(row, col, weight);
}
col++;
}
}
return tdMatrix;
}
}

One can refer this code also. In the latest Lucene version It will be quite easy.
Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
Terms fieldTerms = MultiFields.getTerms(index, "text");
if (fieldTerms != null && fieldTerms.size() != -1) {
IndexSearcher indexSearcher = new IndexSearcher(index);
for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
assertNotNull(vector);
assertTrue(vector.length > 0);
}
}
}

Related

How to extract key phrases from a given text with OpenNLP?

I'm using Apache OpenNLP and i'd like to extract the Keyphrases of a given text. I'm already gathering entities - but i would like to have Keyphrases.
The problem i have is that i can't use TF-IDF cause i don't have models for that and i only have a single text (not multiple documents)
Here is some code (prototyped - not so clean)
public List<KeywordsModel> extractKeywords(String text, NLPProvider pipeline) {
SentenceDetectorME sentenceDetector = new SentenceDetectorME(pipeline.getSentencedetecto("en"));
TokenizerME tokenizer = new TokenizerME(pipeline.getTokenizer("en"));
POSTaggerME posTagger = new POSTaggerME(pipeline.getPosmodel("en"));
ChunkerME chunker = new ChunkerME(pipeline.getChunker("en"));
ArrayList<String> stopwords = pipeline.getStopwords("en");
Span[] sentSpans = sentenceDetector.sentPosDetect(text);
Map<String, Float> results = new LinkedHashMap<>();
SortedMap<String, Float> sortedData = new TreeMap(new MapSort.FloatValueComparer(results));
float sentenceCounter = sentSpans.length;
float prominenceVal = 0;
int sentences = sentSpans.length;
for (Span sentSpan : sentSpans) {
prominenceVal = sentenceCounter / sentences;
sentenceCounter--;
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
int npstart = start + tokSpans[chunk.getStart()].getStart();
int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
String potentialKey = text.substring(npstart, npend);
if (!results.containsKey(potentialKey)) {
boolean hasStopWord = false;
String[] pKeys = potentialKey.split("\\s+");
if (pKeys.length < 3) {
for (String pKey : pKeys) {
for (String stopword : stopwords) {
if (pKey.toLowerCase().matches(stopword)) {
hasStopWord = true;
break;
}
}
if (hasStopWord == true) {
break;
}
}
}else{
hasStopWord=true;
}
if (hasStopWord == false) {
int count = StringUtils.countMatches(text, potentialKey);
results.put(potentialKey, (float) (Math.log(count) / 100) + (float)(prominenceVal/5));
}
}
}
}
}
sortedData.putAll(results);
System.out.println(sortedData);
return null;
}
What it basically does is giving me the Nouns back and sorting them by prominence value (where is it in the text?) and counts.
But honestly - this doesn't work soo good.
I also tried it with lucene analyzer but the results were also not so good.
So - how can i achieve what i want to do? I already know of KEA/Maui-indexer etc (but i'm afraid i can't use them because of GPL :( )
Also interesting? Which other algorithms can i use instead of TF-IDF?
Example:
This text: http://techcrunch.com/2015/09/04/etsys-pulling-the-plug-on-grand-st-at-the-end-of-this-month/
Good output in my opinion: Etsy, Grand St., solar chargers, maker marketplace, tech hardware
Finally, i found something:
https://github.com/srijiths/jtopia
It is using the POS from opennlp/stanfordnlp. It has an ALS2 license. Haven't measured precision and recall yet but it delivers great results in my opinion.
Here is my code:
Configuration.setTaggerType("openNLP");
Configuration.setSingleStrength(6);
Configuration.setNoLimitStrength(5);
// if tagger type is "openNLP" then give the openNLP POS tagger path
//Configuration.setModelFileLocation("model/openNLP/en-pos-maxent.bin");
// if tagger type is "default" then give the default POS lexicon file
//Configuration.setModelFileLocation("model/default/english-lexicon.txt");
// if tagger type is "stanford "
Configuration.setModelFileLocation("Dont need that here");
Configuration.setPipeline(pipeline);
TermsExtractor termExtractor = new TermsExtractor();
TermDocument topiaDoc = new TermDocument();
topiaDoc = termExtractor.extractTerms(text);
//logger.info("Extracted terms : " + topiaDoc.getExtractedTerms());
Map<String, ArrayList<Integer>> finalFilteredTerms = topiaDoc.getFinalFilteredTerms();
List<KeywordsModel> keywords = new ArrayList<>();
for (Map.Entry<String, ArrayList<Integer>> e : finalFilteredTerms.entrySet()) {
KeywordsModel keyword = new KeywordsModel();
keyword.setLabel(e.getKey());
keywords.add(keyword);
}
I modified the Configurationfile a bit so that the POSModel is loaded from the pipeline instance.

How to get all terms in index directory created by lucene 4.4.0

I's using lucene 4.4 for index my docs, and now I want to get all terms by IndexReader. In older version I used:
IndexReader reader = new IndexReader.open(pathIndexDirectory);
TermEnum listTerm = reader.terms();
But in version 4.4 I can't do it.
What can I do now ?
So I'm using this way:
IndexReader reader = IndexReader.open(ramdir)
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
int count = 0;
while (termsEnum.next() != null) {
count++;
}
System.out.println(count);
}

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

How to get Lucene Fuzzy Search result 's matching terms?

how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search?
IndexSearcher mem = ....(some standard code)
QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer);
TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1);
// the ~ triggers the fuzzy search as per "Lucene In Action"
The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets?
I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets .
There was no straight forward way of doing this, however I reconsidered Jared's suggestion and was able to get the solution working.
I am documenting this here just in case someone else has the same issue.
Create a class that implements org.apache.lucene.search.highlight.Formatter
public class HitPositionCollector implements Formatter
{
// MatchOffset is a simple DTO
private List<MatchOffset> matchList;
public HitPositionCollector(
{
matchList = new ArrayList<MatchOffset>();
}
// this ie where the term start and end offset as well as the actual term is captured
#Override
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
if (tokenGroup.getTotalScore() <= 0)
{
}
else
{
MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset());
getMatchList().add(mo);
}
return originalText;
}
/**
* #return the matchList
*/
public List<MatchOffset> getMatchList()
{
return matchList;
}
}
Main Code
public void testHitsWithHitPositionCollector() throws Exception
{
System.out.println(" .... testHitsWithHitPositionCollector");
String fuzzyStr = "bro*";
QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer);
Query fzyQry = parser.parse(fuzzyStr);
TopDocs hits = searcher.search(fzyQry, 10);
QueryScorer scorer = new QueryScorer(fzyQry, "f");
HitPositionCollector myFormatter= new HitPositionCollector();
//Highlighter(Formatter formatter, Scorer fragmentScorer)
Highlighter highlighter = new Highlighter(myFormatter,scorer);
highlighter.setTextFragmenter(
new SimpleSpanFragmenter(scorer)
);
Analyzer analyzer2 = new SimpleAnalyzer();
int loopIndex=0;
//for (ScoreDoc sd : hits.scoreDocs) {
Document doc = searcher.doc( hits.scoreDocs[0].doc);
String title = doc.get("f");
TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(),
hits.scoreDocs[0].doc,
"f",
doc,
analyzer2);
String fragment = highlighter.getBestFragment(stream, title);
System.out.println(fragment);
assertEquals("the quick brown fox jumps over the lazy dog", fragment);
MatchOffset mo= myFormatter.getMatchList().get(loopIndex++);
assertTrue(mo.getEndPos()==15);
assertTrue(mo.getStartPos()==10);
assertTrue(mo.getToken().equals("brown"));
}

Indexing and Searching Date in Lucene

I tried it to index date with DateTools.dateToString() method. Its working properly for indexing as well as searching.
But my already indexed data which has some references is in such a way that it has indexed Date as a new Date().getTime().
So my problem is how to perform RangeSearch Query on this data...
Any solution to this???
Thanks in Advance.
You need to use a TermRangeQuery on your date field. That field always needs to be indexed with DateTools.dateToString() for it to work properly. Here's a full example of indexing and searching on a date range with Lucene 3.0:
public class LuceneDateRange {
public static void main(String[] args) throws Exception {
// setup Lucene to use an in-memory index
Directory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
MaxFieldLength mlf = MaxFieldLength.UNLIMITED;
IndexWriter writer = new IndexWriter(directory, analyzer, true, mlf);
// use the current time as the base of dates for this example
long baseTime = System.currentTimeMillis();
// index 10 documents with 1 second between dates
for (int i = 0; i < 10; i++) {
Document doc = new Document();
String id = String.valueOf(i);
String date = buildDate(baseTime + i * 1000);
doc.add(new Field("id", id, Store.YES, Index.NOT_ANALYZED));
doc.add(new Field("date", date, Store.YES, Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
// search for documents from 5 to 8 seconds after base, inclusive
IndexSearcher searcher = new IndexSearcher(directory);
String lowerDate = buildDate(baseTime + 5000);
String upperDate = buildDate(baseTime + 8000);
boolean includeLower = true;
boolean includeUpper = true;
TermRangeQuery query = new TermRangeQuery("date",
lowerDate, upperDate, includeLower, includeUpper);
// display search results
TopDocs topDocs = searcher.search(query, 10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc);
}
}
public static String buildDate(long time) {
return DateTools.dateToString(new Date(time), Resolution.SECOND);
}
}
You'll get much better search performance if you use a NumericField for your date, and then NumericRangeFilter/Query to do the range search.
You just have to encode your date as a long or int. One simple way is to call the .getTime() method of your Date, but this may be far more resolution (milli-seconds) than you need. If you only need down to the day, you can encode it as YYYYMMDD integer.
Then, at search time, do the same conversion on your start/end Dates and run NumericRangeQuery/Filter.

Categories