implementing search using java technology(java web) - java

I want to implement a search functionality in my web application that I am building using java technology. I would have to search through the database, depending on the user query and will display the results. Which way can I go about doing this(please take note I am using java technology)??.Thanks.

You can use a product like http://lucene.apache.org/core/ or http://lucene.apache.org/solr/ for this instead of writing this on your own.
Lucene is a high-performance search engine for documents.
SOLR is built on top of Lucene and provides additional features (like hit highlighting, faceted search, database integration or rich document (Word, PDF, ..) search)
Lucene will analyze your text data and build up an index. When performing a search you run a lucene query against this index.

Assuming you mean free text searching of the data in the database...
For free text searching Lucene and/or SOLR are very good solutions. These work by creating a separate index of the data in your database. It is up to you to either pull the data from the database and index it using Lucene/SOLR or arrange your code that writes to the database to also update the Lucene/SOLR index. Given what you have said it sounds like this is being retrofitted to an existing database so pulling the data and indexing it may be the best solution. In this case SOLR is probbaly a better fit as it is a packaged solution.
Another option would be Hibernate Search. Again this would be a solution to use if you are starting out. It would be more difficult to add after the fact.
Also bear in mind some databases support free text searching in addition to normal relational queries and could be worth a look. SQL Server certainly has text search capabilities and I would imagine other databases have some sort of support. I am not too sure how you access these but I would expect to be able to do it using SQL via JDBC. It is likely to be database specific though.
If you just mean normal SQL searching then there are a whole load of Java EE technologies, plain JDBC, Spring templates, ORM technologies (JPA, JDO, Hibernate etc). The list goes on and it would be difficult to suggest any particular approach without a lot more info.

Related

Autocomplete with java , Redis, Elastic Search , Mongo

I have to implement an autocomplete with over 500,000 names which may later increase to over 4 million names.
Backend is a java REST web service call using Spring. Should I use MongoDB, Redis or Elasticsearch for storing and querying/searching the names?
It's a critical search use case, and MongoDB and Redis are perfect for key-based lookups and not use for Search purposes, while Elasticsearch is a distributed search engine, built specifically for such use-case.
Before choosing the system, you should know how your feature works internally And below the consideration for selecting it.
Non-functional requirements for your feature
What would be the total no of search queries per second (QPS)?
How frequently you would be updating the documents(ie, names in your example).
What is the SLA after names in updated and coming in the search result?
SLA for your search results.
Some functional requirements.
How autocomplete should look like, prefix, infix search on names?
Minimum how many character user should type, before showing them the autocomplete results.
How frequently the above requirements can change.
Elasticsearch indexed documents in the inverted index and works on
tokens match(which can be easily customized to suit business
requirements), hence super fast in searching. Redis and MongoDB are
not having this structure internally and shouldn't be used for this
use-case. You shouldn't have any doubt about choosing Elasticsearch over
these to implement Autocomplete.
You can use RediSearch (https://oss.redislabs.com/redisearch/). Its a free text search engine build on top of Redis as a RedisModule. It also has an auto complete feature.

Should I use SolrJ to convert Lucene project into browser based search engine ?

My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.

One SQL query to access multiple data sources in Java (from oracle, excel, sql server)

I need to develop application that can be getting data from multiple data sources ( Oracle, Excel, Microsoft Sql Server, and so on) using one SQL query. For example:
SELECT o.employeeId, count(o.orderId)
FROM employees#excel e. customers#microsoftsql c, orders#oracle o
WHERE o.employeeId = e.employeeId and o.customerId = c.customerId
GROUP BY o.employeeId;
This sql and data sources must be changes dynamically by java program. My customers want to write and run sql-like query from different database and storage in same time with group by, having, count, sum and so on in web interface of my aplication. Other requirements is perfomance and light-weight.
I find this way to do it (and what drawbacks I see, please, fix me if I wrong):
Apache Spark (drawbacks: heavy solution, more better for BigData,
slow if you need getting up-to-date informations without cached it
in Spark),
Distributed queries in SQL server (Database link of Oracle, Linked
server of Microsoft SQL Server, Power Query of Excel) - drawbacks:
problem with change data sources dynamically by java program and
problem with working with Excel,
Prestodb (drawbacks: heavy solution, more better for BigData),
Apache Drill (drawbacks: quite young solution, some problem with not
latest odbc drivers and some bugs when working),
Apache Calcite (ligth framework that be used by Apache Drill,
drawbacks: quite young solution yet),
Do join from data sources manually (drawbacks: a lot of work to
develop correct join, "group by" in result set, find best execution plan and so on)
May be, do you know any other way (using free open-source solutions) or give me any advice from your experience about ways in above? Any help would be greatly appreciated.
UnityJDBC is a commercial JDBC Driver that wraps multiple datasoruces and allows you to treat them as if they were all part of the same database. It works as follows:
You define a "schema file" to describe each of your databases. The schema file resembles something like:
...
<TABLE>
<semanticTableName>Database1.MY_TABLE</semanticTableName>
<tableName>MY_TABLE</tableName>
<numTuples>2000</numTuples>
<FIELD>
<semanticFieldName>MY_TABLE.MY_ID</semanticFieldName>
<fieldName>MY_ID</fieldName>
<dataType>3</dataType>
<dataTypeName>DECIMAL</dataTypeName>
...
You also have a central "sources file" that references all of your schema files and gives connection information, and it looks like this:
<SOURCES>
<DATABASE>
<URL>jdbc:oracle:thin:#localhost:1521:xe</URL>
<USER>scott</USER>
<PASSWORD>tiger</PASSWORD>
<DRIVER>oracle.jdbc.driver.OracleDriver</DRIVER>
<SCHEMA>MyOracleSchema.xml</SCHEMA>
</DATABASE>
<DATABASE>
<URL>jdbc:sqlserver://localhost:1433</URL>
<USER>sa</USER>
<PASSWORD>Password123</PASSWORD>
<DRIVER>com.microsoft.sqlserver.jdbc.SQLServerDriver</DRIVER>
<SCHEMA>MySQLServerSchema.xml</SCHEMA>
</DATABASE>
</SOURCES>
You can then use unity.jdbc.UnityDriver to allow your Java code to run SQL that joins across databases, like so:
String sql = "SELECT *\n" +
"FROM MyOracleDB.Whatever, MySQLServerDB.Something\n" +
"WHERE MyOracleDB.Whatever.whatever_id = MySQLServerDB.Something.whatever_id";
stmt.execute(sql);
So it looks like UnityJDBC provides the functionality that you need, however, I have to say that any solution that allows users to execute arbitrary SQL that joins tables across different databases sounds like a recipe to bring your databases to their knees. The solution that I would actually recommend to your type of requirements is to do ETL processes from all of your data sources into a single data warehouse and allow your users to query that; how to define those processes and your data warehouse is definitely too broad for a stackoverflow question.
One of the appropriate solution is DataNucleus platform which has JDO, JPA and REST APIs. It has support for almost every RDBMS (PostgreSQL, MySQL, SQLServer, Oracle, DB2 etc) and NoSQL datastore like Map based, Graph based, Doc based etc, database web services, LDAP, Documents like XLS, ODF, XML etc.
Alternatively you can use EclipseLink, which also has support for RDBMS, NoSQL, database web services and XML.
By using JDOQL which is part of JDO API, the requirement of having one query to access multiple datastore will be met. Both the solutions are open-source, relatively lightweight and performant.
Why did I suggest this solution ?
From your requirement its understood that the datastore will be your customer choice and you are not looking for a Big Data solution.
You are preferring open-source solutions, which are light weight and performant.
Considering your use case you might require a data management platform with polyglot persistence behaviour, which has the ability to leverage multiple datastore, based on your/customer's use cases.
To read more about polyglot persistence
https://dzone.com/articles/polyglot-persistence-future
https://www.mapr.com/products/polyglot-persistence
SQL is related to the database management system. SQL Server will require other SQL statements than an Oracle SQL server.
My suggestion is to use JPA. It is completely independent from your database management system and makes development in Java much more efficient.
The downside is, that cannot combine several database systems with JPA out of the box (like in an 1:1 relation between SQL Server and Oracle SQL server). You could, however, create several EntityManagerFactories (one for each database) and link them together in your code.
Pros for JPA in this scenario:
write database management system independent JPQL queries
reduces required java code
Cons for JPA:
you cannot relate entities from different databases (like in a 1:1 relationship)
you cannot query several databases with one query (combining tables from different databases in a group by or similar)
More information:
Wikipedia
I would recommend presto and calcite.
performance and lightweight doesn't always go hand in hand.
presto : quite a lot of proven usages, as you have said "big data". performs well scales well. I don't quite know what light weight means specifically, if requiring less machines is one of them, you could definitely scale less according to your need
calcite : embeded in a lot of data analytic libraries like drill kylin phoenix. does what you needed " connecting to multiple DBs" and most importantly "light weight"
Having experience with some of the candidates (Apache Spark, Prestodb, Apache Drill) makes me chose Prestodb. Even though it is used in big data mostly I think it is easy to set it up and it has support for (almost) everything your are asking for. There are plenty of resources available online (including running it in Docker) and it also has excellent documentation and active community, also support from two companies (Facebook & Netflix).
Multiple Databases on Multiple Servers from Different Vendors
The most challenging case is when the databases are on different servers and some of the servers run different database software. For example, the customers database may be hosted on machine X on Oracle, and the orders database may be hosted on machine Y with Microsoft SQL Server. Even if both databases are hosted on machine X but one is on Oracle and the other on Microsoft SQL Server, the problem is the same: somehow the information in these databases must be shared across the different platforms. Many commercial databases support this feature using some form of federation, integration components, or table linking (e.g. IBM, Oracle, Microsoft), but support in the open-source databases (HSQL, MySQL, PostgreSQL) is limited.
There are various techniques to handling this problem:
Table Linking and Federation - link tables from one source into
another for querying
Custom Code - write code and multiple queries to manually combine
the data
Data Warehousing/ETL - extract, transform, and load the data into
another source
Mediation Software - write one query that is translated by a
mediator to extract the data required
May be wage idea. Try to use Apache solr. User different data sources and import the data in to Apache solr. Once data is available you can write different queries by indexing it.
It is open source search platform, that makes sure your search is faster.
That's why Hibernate framework is for, Hibernate has its own query language HQL mostly identical to SQL. Hibernate acts as a middle ware to convert HQL query to database specific queries.

Preloaded Document based Desktop Application

I want to develop a desktop application that allows users to search through json files.
These files (around 50.000) are predefined. They should be shipped with the application itself.
My question is, what would be the best way to ship these documents with the application and at the same time allow users to search for documents containing certain values, e.g. in sql terms: show all documents where some json value within the document like %Example%.
I thought about using some kind of NoSQL solution, preloading the files into the db and bundle it with the app. I've looked at some solutions, but I'm not really sure which one would be best suited for my needs or if it's even the best approach.
Bottom line is, I can't have my users install a db on their system, that is way too complicated.
I'd prefer a solution suitable for java or python.
Thanks for your help!
You can use an embedded database, memory based database (like hsql) or a file-based database like sqlite.
Neither require any installation from your end users. You just have to package the libraries as part of your application install bundle (and of course, the engine itself).
If you are looking for a k/v store, then the good ol' Berkeley DB should suffice. If you are really looking for a "embedded NoSQL solution", try MooDB.
Mongo DB comes in an embeddable version: https://github.com/flapdoodle-oss/embedmongo.flapdoodle.de
I've used it for integration testing (mocking a Mongo server) and it works really well!
Anytime I read document and search, I also think of Solr: http://lucene.apache.org/solr/

Implementing full text search in Java EE

I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?
Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.
I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing
I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.

Categories