I am using Spring Data (Mongo) for my web application (close to a social networking website). Now, I wish to provide search capabilities over the content written within the application (such as posts, tags, friends, etc.).
I believe Lucene/Solr is one of the better libraries to go for such cases, but am not sure how to use (integrate?) it with Spring Data (or maybe there is some inherent support within Spring for it).
Would appreciate help (documentation, links, blog posts, etc.) on this!
Though the post has been around for a while, you may have a look at this one https://github.com/SpringSource/spring-data-solr/
The Spring Data for Solr project provides a natural Spring Data like API for querying data from Solr. Read the examples for a quick overview.
I found a good read here - http://adeithzya.wordpress.com/2011/08/25/using-apache-solr-with-spring-framework - that hits the nail on its head!
Integrating them is relatively easy, the difficult part is maintaining data consistency between them. For example, how would you answer these questions:
How and when do you intend to perform CRUD with mongo and sorl? Do you write to Mongo first (with/without waiting for a confirmation?) and then to Solr?
if you're using async writes with mongo, what happens when you send the data to solr, and then get an exception for mongo (data exist in solr, but doesn't exist in mongo)?
What happens if you get an error while trying to write to solr (data exist in mongo but not in solr)?
if you delete something from mongo, and right after that someone performs a search where solr returns that very deleted document because solr stil has that document indexed?
The point is there'll be an inconsistency window where mongo and solr are not in sync, and you probably want to handle at least some of the issues.
Related
My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.
Note:There is a good chance I'm not using the correct terminology here and that maybe the reason I'm not finding the answers to my question. I apologize upfront if this has been already answered, so please just direct me there.
I am looking for an open source framework written in Java that would allow me to build pluggable data connectors (and obviously have some built in already) and almost have a query language (abstraction layer) that would translate into any of those connections.
For example: I would be able to say:
Fetch 1 record from a Mongo DB that matches name='John Doe'
and get JSON as a response
or I could say
Fetch all records from a MySQL DB that matches name='John Doe'
and get a JSON as a response
If not exactly what I described, I am willing to work with anything that would have a part of this solved.
Thank you in advance!
You're not going to find a "Swiss army knife" data abstraction framework that does all of the above. Perhaps the closest things to what you ask for would be JPA providers for both Mongo and MySQL (Hibernate is a well-regarded JPA provider for MySQL, and a quick google search shows Kundera, DataNucleus and Hibernate OGM for Mongo). This will let you map your data to Java Objects, which might be a step further than what you ask for since you explicitly asked for JSON; however, there are numerous options for mapping the resulting objects into JSON if you need to present JSON to a user or another system (Jackson comes to mind for this).
Try YADA, an open source data-abstraction framework.
From the README:
YADA is like a Universal Remote Control for data.
For example, what if you could access
any data set
at any data source
in any format
from any environment
using just a URL
with just one-time configuration?
You can with YADA.
Or, what if you could get data
from multiple sources
in different formats
merging the results
into a single set
on-the-fly
with uniform column names
using just one URL?
You can with YADA.
Full disclosure: I am the creator of YADA.
I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies.
I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl.
Using solr interface at http://localhost:8983/solr/admin, I can also query the crawled results. But this is the output I receive. .
Am I missing something here, the earlier apache-nutch-0.7 had a war which generated a clear html output like this. . How do I achieve this... Or if anyone could point me to a latest tutorial or guidebook, highly appreciated.
A couple of things:
If you are just starting, do not use Solr 3.6, go straight to latest 4.1+. A bunch of things have changed and a lot of new features are added.
You seem to be saying that you will expose Solr + UI directly to general web - that's a really bad idea, as Solr is completely unsecured and allows web-based delete queries. You really want a business layer in a middle.
With Solr 4.1, there is a pretty Admin UI and, also, there is a /browse page that shows how to use Velocity to do the pages backed by Solr. Or have a look at something like Project Blacklight for an example of how to get UI over Solr.
I found below link
http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-nutch-and-solr/
which answered my query.
I agree after reading the content available on above link, I felt very angry at me.
Solr package provides all the required objects to query solr.
Infact, the essential jars are just solr-solrj-3.4.0.jar, commons-httpclient-3.1.jar and slf4j-api-1.6.4.jar.
Anyone can build a java search engine using these objects to query the database and have a fancy UI.
Thanks again.
I´m building Java backends with Spring, Hibernate and RDBMSs for a while now. Also I´m regularily working on mobile applications for iOS and Android.
So I have a full stack of technology to use for this task, however I am looking for something maybe more advanced that better fits the requirements. I was having some thoughts about it, but I better first explain how my current systems work and then how I want my upcoming systems to look like.
Currently using
Spring Framework to connect everything together
Hibernate with Entity beans for persistence
MySQL or others as RDBMS
DTO objects created with Dozer
RESTful API to expose services
DTOs are transferred in JSON format
This setup works. But I have the feeling that it´s just too much work and life could be simpler with other technologies.
What I am looking for
On the mobile site, I want to receive data for the current screen that I could easily cache. JSON is something that is already serialized and that would be easy to save to disk in the mobile application, without using yet another database. So the question is, how could I store the data in the backend, so that I can more easily receive it, without using entity beans, DTOs and Dozer to convert between them? Isn´t there another database solution which already delivers JSON? What about graph databases for example, like OrientDB or Neo4J?
I definitely want to go with Java and Spring, and I am open to a replacement for Hibernate, RDBMS and entity beans and DTOs.
Looking forward to your answers!
Your current design (This setup works) has niceties which a good system should have. tiered and good separation of concerns.
If I understand your requirement correctly then, you argument is, if my end data format is JSON then why not store the data in JSON format which will get you rid of lot of plumbing code/effort in the middle tier.
It will directly enable you to fetch the data from the storage and pass it on the requesting client. These are your requirement in nutshell. Please correct me if I am wrong.
Now JSON is more of textual notation and less of storage format. Jason is generally consumed by the View tier of MVC architecture as its easy to render on the screen using Javascript.
Your reasoning of using a NoSQL DB which directly delivers JSON is credible given that tye end client is going to be mobile app.
Overall architecture looks good and highly optimized for Mobile access.
Now coming on the NoSQL JSON storage, following are the Document Store NoSQL DBs which support JSON interface
i. CouchDB
ii. JasDB
iii.SchemaFreeDB
8.You can evaluate any of these to suite your needs.
(full disclosure - I'm an engineer with Kinvey, a BaaS provider)
One option you might consider is using Backend-as-a-service. Most BaaS providers use JSON to transfer the data over the wire, which sounds like it would be compatible with your requirements.
In addition, you'll typically get a lot of common mobile app functionality baked in (i.e. push notifications, file storage and CDN infrastructure, user management, etc). This could be especially useful if you are building multiple apps, each with their own backend; rather than reinventing the wheel each time, simply spin up a new backend.
One last, but important note, would be pricing. A lot depends on your use case, but from what I've seen, a BaaS provider is usually significantly cheaper that rolling your own solution on AWS or some other cloud provider, especially since most providers offer a free tier.
Even though this question is a bit old, maybe a quick alternative for RDBMS: MongoDB. It is a document database with document-level locking. It scales really well.
Main point: it uses JSON as its document storage (actually the Binary JSON a.k.a. BSON, but that is just a superset). Inserting a document into the database is as easy as
db.collection.insert(JSON);
on the mongo shell and
DBObject bson = (DBObject) JSON.parse(JSONstr);
collection.insert(bson);
in the java driver.
I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?
Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.
I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing
I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.