integrate wordnet with solr7.5.0 - java

I am beginner in solr7.5.0 and I don't know each and every modules of it. As I'm building question answer system I want to integrate wordnet so I can get better query responses. I googled it and found some methods and previous question but I'm really confused on how to do in solr version 7.5.0 step by step.
Edit: solr7.5.0 having WordnetSynonymParser class, So if anyone worked on same please guide me how I can use this class or is there another way to do it? and can I use python to do it?
thanks in advance.

This article is very useful for this question, and the integration of wordnet can be done by, there are WordNet prolog file('wn_s.pl') which has synsets, we can convert it to synonyms.txt which can be consumable by Solr. So, to convert wn_s.pl file we can use Syns2Syms.java. It generates Synonyms.txt which we can index to solr.
But WordNet expansion will only yield marginal gains in relevance if it is domain specific search, so simply creating your own synonyms list based on the common tokens in your index will give more relevance.

Related

Extract relevant sections or paragraphs from a document based on keyword

In one of the project we have some html files stored in oracle database but we may keep it in files as well or if more appropriate in some NOSQL database whichever is appropriate. We are given some keywords and based on them we need to find relevant sections in those files. These files are basic company declaration, news articles , financial reports, etc. Now need to find different sections let's say pertaining to below categories:
Risk using keywords like crime, theft, litigation,accuse, etc
High rank Changes using keywords like 'will be leaving', Appointment of Certain Officers', 'Election of Director', etc
Shareholder Rights using keywords like 'shareholder rights','shareholder lawsuits', 'financial restatements', etc
There are other categories as well and they have defined keywords to be searched. So the requirement is to category-wise extract the section/paragraph which are MOST relevant.
The emphasis is on High accuracy to find most relevant section.
If technologies like Solr or Elastic search or Jackrabbit provides that we are open. Just need right direction to correct tech-stack needed here.
Currently we are trying Oracle text search but I believe we might have a better programmatic solution as well may using Machine learning or NLP or some library in Java which would do that. Kindly give me some insights. I am an experienced java developer and working with Machine leaning and NLP. I am language-agnostic, so a good solution using any language or technique is welcome.
The direction you seem to be going with this question is one of word/phrase search [easy] vs semantic search [hard]. There have been several people over the years to work on such solutions [I met folks from a company in Scotland who were building a Java based solution, but I can't recall the name]. Where you get into trouble with semantic search is that there are so many problem domains [and very relevant taxonomies within the domain] where semantics are way different for same words or phrases. Then of course some folks make the "semantic" job easier by meta-tagging the data (examples: images, video, complex documents), then searching the meta data.
When I was an Enterprise Architect a few years back, we used Verity to essentially Google the enterprise. I have no idea if it is still a product, but it leveraged Oracle Text and layered it's code on that.
Back in the day, the state of the art was what Forester Research called: "Connecting Data, Content, And Text With Organic Information Abstraction", but I don't know where the state of the practice is right now.
I'll bet Google might have some tools you could use :) .
Sounds like a fun project!!!

How to use Lucene FieldCache for search speed improvement?

I am using Lucene 3.6 and i am trying to implement FieldCache. I have seen some posts but did not get any clear idea. Can anyone please suggest me any link where i can find proper example of FieldCache and how to use it while searching.
You dont typically use it directly, it is an internal API used by Lucene.
If you are extending Lucene's search API and need to use it, you will need to provide more details.

How to use word2vec?

I have to make lexical graph with words within a corpus. For that, I need to make a program with word2vec.
The thing is that I'm new at this. I've tried for 4 days now to find a way to use word2vec but I'm lost. My big problem is that I don't even know where to find the code in Java (I heard about deeplearning but I couldn't find the files on their website), how to integrate it in my project...
One of the easiest way to embody the Word2Vec representation in your java code is to use deeplearning4j, the one you have mentioned. I assume you have already seen the main pages of the project. For what concerns the code, check these links:
Github repository
Examples

How to send search query using opensearch and Lucene?

I have an opensearch description and am supposed to send my queries using that programmatically. Unfortunately, I'm really new to all of this- are there any examples out there using Lucene and Java?
Thanks!
I don't know nothing about OpenSearch but I can give you some hints of Lucene. To make Lucene search you have to first make a Lucene index of your data (text files, xml files, database, etc..).
There are a lot of example how to use Lucene all over the internet. This looks quite good example how to make a Lucene index.
This covers all: indexing, searching and displaying result (It is very simple example). Because your question is not very specific, this is as I can help. If you'll have a specific problem, I would be happy to help.

java - tf*idf implementation?

I am basically creating a search engine and I want to implement tf*idf to rank my xml documents based on a search query. How do I implement it? How do I start it? Any help appreciated.
Surprising that the Weka library hasn't been mentioned here. Weka's StringToWordVector class implements TF-IDF.
I did this in the past, and I used Lucene to get the TD*IDF data.
It took fair amount of fiddling aound though, so if there are other solutions people know are easier, then use them.
Start by looking at TermFreqVector and other classes in org.apache.lucene.index.
tfidf is a standalone Java package that calculates Tf-Idf.
Apache Mahout:
https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/vectorizer/TFIDF.java
I believe it requires a Hadoop File System, which is a bit of extra work. But it works great.

Categories