Performing fuzzy matching - java

In my requirement i need to match names from table to another table.
Source table might contain names as Tony, Bill, Rob
Target Table might contain names as Anthony, William, Robert
Basically source table may contain nick/short names.
Is there any fuzzy logic/AI tool available in Java/SQL to perform such matches.
I know it can be done using SQL sever fuzzy logic package, but this package comes with SQL server Enterprise Edition, and my client doesnt want to upgrade to it.
Is there any other alternative, preferably open source/free of cost.

Search frameworks like Apache Lucene provides Fuzzy matching queries. Try FuzzyQuery from Lucene

Related

Lucene Search Luke vs Hibernate Search different result

I am running the following lucene query phrase in luke:
+(debtorNumber:10200000 originalDebtorNumber:10200000) +(serviceName:"skype
for"^840.0 (serviceName:for* serviceId:for*) (serviceName:skype*
serviceId:skype*))
shows at the beginning expected result for ex.:
Skype for Business for Managers
Microsoft Skype for Business Conferencing (Plan2)
Telephone dial-in for Skype for Business Conferencing
and so on.
The same query executed with hibernate search shows different result :/
I am getting for example the following result:
antivirus protection for your PC, notebook or server
central administration for thin clients
skype for comes on the 3rd or 4th page.
The java code is:
SearchManager = Search.getSearchManager(cache)
CacheQuery<MyType> query = searchManager.getQuery(booleanQuery, MyType.class)
List<MyType> pagedResulat = query
.maxResults(criteria.getPageSize())
.firstResult(Math.toIntExact(criteria.getOffset()))
.list()
This logs the above query which I used in Luke
log.info("Lucene Search boolean query:" + booleanQuery);
Please advise.
There might be multiple reasons for the difference, let me try compile a checklist.
Different index
The main difference I can think of is that Luke will always target a single index: the one you opened explicitly.
Hibernate Search will actually run the query on a composite view of all indexes containing MyType and indexed subclasses (and any shards you might have). Often that's just one index, but you possibly have multiple indexes opened?
That will affect the results, and definitely the scores.
Different Lucene version
Verify that the Luke version you're using is using the exact same version of Lucene.
Check the scoring
You can use a Projection query to have Infinispan Query / Hibernate Search explain the scores of all results it produced; this can be very useful to understand what is going on.
See FullTextQuery.EXPLANATION and FullTextQuery.SCORE in section Projections, and Example 105.
IndexReader
You can also use the SearchManager to get the low-level IndexReader(s) and run the query directly, by-passing Infinispan and Hibernate Search code.
SearchIntegrator si searchManager.unwrap(SearchIntegrator.class);
si.getIndexReaderAccessor(). ...
that might help narrow down which component is affecting your expected scoring.
The IndexReaderAccessor can open an index by type or by name. When opened by name it will open the single index, when opened by type it will apply the rules to satisfy polymorphic queries and might return an aggregate: might be interesting to experiment with both of them to verify they return the same results.
...and check the basics
Make sure you're opening the same physical index :-)
In particular recent versions of Infinispan might apply sharding transparently to improve data distribution in the cluster, this might be confusing when debugging scoring - especially when you're not aware of it.

Database Search with key words using jpa

I'm doing college work where I have to search by keywords. My entity is called Position and I'm using MySQL. The fields that I need to search are:
    - date
    - positionCode
    - title
    - location
    - status
    - company
    - tecnoArea
I need to search the same word in all of these fields. To this end, I used criteria API to create a dynamic query. It is the same word for several fields and it should get the maximum possible results. Do you have any advice about how to optimize the search on the database. Should I do several queries?
EDIT
I will use an OR constraint.
If you will need to find the key word at any position within the data you will need to use LIKE with wildcards, eg. title LIKE '%manager%'. Since date and positionCode (presumably a numeric type) are not likely to contain the key word, to achieve a very small performance gain, I would omit searching these columns for the key word. Your query is going to need to do a serial read, which means that all rows in the table will need to be brought into main memory to evaluate and retrieve the result set of your query. Given a serial read is going to happen anyway, I do not think there is too much you can do to optimize the query when searching multiple columns. I am not familiar with the "criteria api to create dynamic queries", but using dynamic queries in other systems is non-optimal - they must be parsed and evaluated every time the are run and most query optimize-rs cannot make use of the statistics for cost-based optimization to improve performance like they can with explicitly defined SQL.
Not sure what your database is.
If it is Oracle, you can use Oracle text.
The below link might be useful :
http://swiss-army-development.blogspot.com/2012/02/keyword-search-via-oracle-text.html

Hibernate Search query for class

I'm using hibernate search 4.4.0. And I met a problem recently.
E.g, I have 2 classes INDEXING and DATA_PROPERTY. There is no association between 2 of them. And I can't change them or creat a new class to associate 2 of them.
Part of Lucene indexing:
mapping.entity(DatatypeProperty.class).indexed().providedId()
.property("rdfResource",ElementType.FIELD).field().analyze(Analyze.NO).store(Store.YES)
.property("partitionValue", ElementType.FIELD).field().analyze(Analyze.NO)
mapping.entity(Indexing.class).indexed().providedId()
.property("rdfResource",ElementType.FIELD).field().analyze(Analyze.NO).store(Store.YES)
Now in the SQL, I use
SELECT IND.RDF_RESOURCE
FROM INDEXING IND, DATA_PROPERTY DP
WHERE IND.RDF_RESOURCE = DP.RDF_RESOURCE
AND IND.OBJECT_TYPE_ID_INDEXED IN (........)
AND DP.PARTITION_VALUE IN (......)
AND .......
How can I translate IND.RDF_RESOURCE = DP.RDF_RESOURCE in Hibernate Search???
I thought maybe I can use the query to find all the RDF_RESOURCE of class DatatypeProperty and matching all of them in the query for class Indexing. But it seems very inefficiency.
Does anyone has a better way for this?
I have 2 classes INDEXING and DATA_PROPERTY. There is no association
between 2 of them. And I can't change them or create a new class to
associate 2 of them.
In this case you are between a rock and a hard place. You will need to associate the records somehow and the most obvious choice is via an association. Also, you cannot compare a SQL join with a free text based index provided by Lucene.
One potential solution could be to write a custom bridge which at indexing time executes the join and indexes the relevant data, so that you can target it directly via your query. Whether this works for you will depend on your use case. In your example setup, I don't see any field which would benefit from free text search. I can only assume that you are only showing parts of your code. If not, why don't you just stick with SQL?

Appengine Search API vs Datastore

I am trying to decide whether I should use App-engine Search API or Datastore for an App-engine Connected Android Project. The only distinction that the google documentation makes is
... an index search can find no more than 10,000 matching documents.
The App Engine Datastore may be more appropriate for applications that
need to retrieve very large result sets.
Given that I am already very familiar with the Datastore: Will someone please help me, assuming I don't need 10,000 results?
Are there any advantages to using the Search API versus using Datastore for my queries (per the quote above, it seems sensible to use one or the other)? In my case the end user must be able to search, update existing entries, and create new entities. For example if my app is a bookstore, the user must be able to add new books, add reviews to existing books, search for a specific book.
My data structure is such that the content will be supplied by the end user. Document vs Datastore entity: which is cheaper to update? $$, etc.
Can they supplement each other: Datastore and Search API? What's the advantage? Why would someone consider pairing the two? What's the catch/cost?
Some other info:
The datastore is a transactional system, which is important in many use cases. The search API is not. For example, you can't put and delete and document in a search index in a single transaction.
The datastore has a lot in common with a NoSql DB like Cassandra, while the search API is really a textual search engine, very similar to something like Lucene. If you understand how a reverse index works, you'll get a better understanding of how the search API works.
A very good reason to combine usage of the datastore API and the search API is that the datastore makes it very difficult to do some types of queries (e.g. free text queries, geospatial queries) that the search API handles very easily. Thus, you could store your main entities in the datastore, but then use the search API if you need to search in ways the datastore doesn't allow. Down the road, I think it would be great if the datastore and search API were more tightly integrated, for example by letting you do free text search against indexed Text fields, where app engine would automatically create a search Document Index behind the scenes for you.
The key difference is that with the Datastore you cannot search inside entities. If you have a book called "War and peace", you cannot find it if a user types "war peace" in a search box. The same with reviews, etc. Therefore, it's not really an option for you.
The most serious con of Search API is Eventual Consistency as stated here:
https://developers.google.com/appengine/docs/java/search/#Java_Consistency
It means that when you add or update a record with Search API, it may not reflect the change immediately. Imagine a case where a user upload a book or update his account setting, and nothing changes because the change hasn't gone to all servers yet.
I think Search API is only good for one thing: Search. It basically acts as a search engine for your data in Datastore.
So my advice is to keep the data in datastore that user expects immediate result, and use Search API to search the data that user won't expect immediate result.
The Datastore only provides a few query operators (=, !=, <, >), doing nested filters and multiple inequalities would either be costly or impossible (timeouts) and search results may give a lot of False Positives. You can do partial string search by tokenizing but this will bloat your entity. Best way to get through these limitations is using Structured Properties and/or Ancestor Queries.
Search API on the other hand runs a Full Text search on Search Documents, which is faster and more accurate than NDB queries without relying on tokenized data. Downside is it relies on data staying up to date.
Use Datastore to process your data (create, update, delete), then run a function to put these data as documents and cluster using indexes, then run the searches using the Search API.

can Lucene be used to search inside db?

Can we use Lucene to search text stored in DB?
I saw this article that shows how to use it for normal articles stored as files
http://javatechniques.com/blog/lucene-in-memory-text-search-example/
Can someone suggest?
Look at the below question from their FAQ. If you are using Hibernate then I recommend you to consider Hibernate Search.
How can I use Lucene to index a database?
You should use the Compass Framework. It's built upon Lucene and integrates nicely with several ORMs
Update: you should now use ElasticSearch instead (thanks Pangea)
Can we use Lucene to search text stored in DB?
Yes, you can. Lucene is able to read different kind of database-tables (like mysql,etc). In order to search stored text in an DB, lucene needs to index all the data you like to search.
But don't forgett: lucene is just an index. To access lucene - that mens to search inseide or to start import (whatever) you need an 2nd part oft software, to "use" (control,...) the data inside lucene.
This could be solr, for example http://lucene.apache.org/solr/
On the RDBMS you don't need an fulltext index for that anymore.

Categories