Best searching algorithm for Streamin API in JAVA

Best searching algorithm for Streamin API in JAVA - java

I am using twitter streaming where I am searching for 20,000 keywords like
https://stream.twitter.com/1/statuses/filter.json?delimited=length&track=api,software,hardwate,etc
Here now I am using sequential search like for loop but its taking very long time to search one twit for 20,000 keyword.
Is any best searching method available in java to search data coming from high traffic http/web.

If your data doesn't have to be processed in real time, you can use information retrieval (IR) techniques.
Have a back-end server that indexes all the data for you "over night"1. It will create an inverted index, and will listen to your app.
Your app will then query the back-end server (instead the streaming server), and will "ask" it for the wanted keywords as queries, in the standard IR techniques.
You can use Apache Lucene to help you. Lucene is a mature open source information retrieval library, so it can help you with both indexing and querying.
Hope that helps
(1) In here "over night" means one of these:
If there is a time the app is inactive - it could be done then
There are some libraries that support an index to be both queried
and built in the same time. I cannot recall if lucene is one of
them.
You can use 2 servers, and in each point in time - one will be building index and the other will be available for queries.

Related

How to capture formulas and support formula evaluation in java web application

We have a requirement to incorporate an excel based tool in java web application. This excel tool has set of master data and couple of result outputs using formula calculations on master data.
Master data can be captured in database with relational tables. We are looking for the best way to provide capability to capture, validate and evaluate. formulas.
So far looked at using scripting engines nashorn and provide formula support using eval. We would like to know how people are doing in other places.

I've searched and found two possible libraries that could be useful for you please have a look.
http://mathparser.org/
http://mathparser.org/mxparser-hello-world/mxparser-hello-world-java/
https://lallafa.objecthunter.net/exp4j/
https://lallafa.objecthunter.net/exp4j/#Evaluating_an_expression_asynchronously

Depends on how big your data is and what your required SLA is. Also on what kind of formulas/other functions that you want to support.
For example, consider a function like sum or max. Now, the master data is in some relation table containing 10k rows. You could pull in all this data inside a java app and do a sum (or run any function). However, imagine if the table contained 500K rows. This would take some time to stream all 500K rows to Java app but consumes lot of cpu and network bandwidth (database resources, local cpu resources). A better optimized scenario in that case would be index that column in the database and let database do all the hard work for you.
Personally, I don't like using eval. I would rather parse the user input to determine what actions to take.
I am assuming that data is not big to use big data tools.

How to design the server side of an autocomplete box like Quora?

I don't want to use Lucene because i think it is to heavy.
Is there any easier way to implement this (Millons of data) ?

If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
Check it out here.

Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages (in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%' in the case of the user typing Qu into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
How do they do it so quickly? There are probably four things to bear in mind.
They have LOTS of machines handling the requests. Bear in mind that Google's autocomplete is turned on by default and works in (almost?) all languages. That's a lot of searches against the autocomplete index. A lot more than there will be against the web index itself: for each web search request, Google will probably have processed 3 or 4 autocomplete requests.
They're probably doing it in memory. Google is already known to store its web indexes in memory, so I would expect them to be doing the same with this.
Specialised software (this is where it gets really interesting). While a traditional database or a NoSQL database could do this and do it quickly I would expect the big boys to actually be doing this with specialised code whose sole purpose is to provide autocomplete suggestions. The SQL statement I provided above was purely to demonstrate the logical request that would be needed. You're probably looking at some kind of specialised tree, such as a suffix tree, radix tree, or similar.
Sharding. To cope with the quantity of data and the number of machines doing the requests you're going to need to shard. That is ensure that a certain subset of all the machines involved only process requests requests that begin with one or more letters. eg a group of X machines processing searches that begin with a certain letter or even 2 letters. That means that you've got more machines, but they don't each have to have the whole index to hand. How does a particular group of machines get chosen? You're either routing once the request is in your data centre, or you could route on the client side (eg in your Javascript decide which IP to query based upon the first X letters of the search term)
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
And, here's how I have done it, purely in an experimental environment at home:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.

First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
See, e.g.
http://stackoverflow.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.

how much extra Space/RAM/CPU is used by apache solr?

I am using MySQL database for my webapp.
I need to search over multiple tables & multiple columns, it very similar like full text searching inside those columns.
I need know your experience of using any Full Text Search API (eg. solr/lucene/mapReduce/hadoop etc..) over using simple SQL in terms of :
Speed performance
Extra space usage
Extra CPU usage (is it continuously building index? )
How long it takes to build index or it get ready for use?
Please let me know your experience of using these frameworks.
Thanks a lot!

To answer your questions
1.) i have an database with round about 5 Million Docs. MySQL Fulltextsearch needs 2-3 Minutes. Solr/Lucene needs for the same search round about 200-400 milliseconds.
2.) The space you need depends on your configuration, the number of copyfields and if you store the data or if you only index the data. In my configuration, full DB is indexed, but only metadata is sored. So an 30GB DB needs 40 GB on for Solr/Lucene. Keep in mind, that if you like to (re)optimize your index, you need temporary 100% of the index-size again.
3.) If you migrate from MySQL fulltext-Index to Lucene/Solr, you save CPU Power. Using MySQL Fulltext needs much more CPU Power than Solr Fulltext search -> look at answer 1.)
4.) depends on the number of documents, the size of the documents and the disk-speed. Of course the CPU performance is very important. There is not a good scaling over multiple CPU's during index-time. 2 big cores are much more faster than 8 small cores.
Indexing 5 Million Docs (44GB) in my environment needs 2-3 hours on an dual core VM ware server.
5.) Migrating from MySQL Fulltext-Index to Lucene/Solr Fulltextindex was the best idea ever. ;-) But probably you have to redesign your application.
//Edit to answer the question "Will the Lucene Index get updated immediately after some Insert statements "
It depends on your SOlR configuration, but it is possible

Q1: Lucene is usually faster and more powerful in terms of features (if correctly implemented)
Q2: if you don't store the original content, it's usually 20-30% of the original (indexed) content
Q4: Depends on the size of your content that you want to index, on the amount of processing you'll be doing (you can have your own analyzers, etc), then your hardware... you'll have to do a benchmark. For one of my projects, last time it took 15min to build a 500MB index (out of the box performance, no tweaks attempted), for another, it took 3 days to build a huge 17GB index.

instant searching in petabyte of data

I need to search over petabyte of data in CSV formate files. After indexing using LUCENE, the size of the indexing file is doubler than the original file. Is it possible to reduce the indexed file size??? How to distribute LUCENE index files in HADOOP and how to use in searching environment? or is it necessary, should i use solr to distribute the LUCENE index??? My requirement is doing instant search over petabyte of files....

Hadoop and Map Reduce are based on batch processing models. You're not going to get instant response speed out of them, that's just not what the tool is designed to do. You might be able to speed up your indexing speed with Hadoop, but it isn't going to do what you want for querying.
Take a look at Lucandra, which is a Cassandra based back end for Lucene. Cassandra is another distributed data store, developed at Facebook if I recall, designed for faster access time in a more query oriented access model than hadoop.

Any decent off the shelf search engine (like Lucene) should be able to provide search functionality over the size of data you have. You may have to do a bit of work up front to design the indexes and configure how the search works, but this is just config.
You won't get instant results but you might be able to get very quick results. The speed will probably depend on how you set it up and what kind of hardware you run on.
You mention that the indexes are larger than the original data. This is to be expected. Indexing usually includes some form of denormalisation. The size of the indexes is often a trade off with speed; the more ways you slice and dice the data in advance, the quicker it is to find references.
Lastly you mention distributing the indexes, this is almost certainly not something you want to do. The practicalities of distributing many petabytes of data are pretty daunting. What you probably want is to have the indexes sat on a big fat computer somewhere and provide search services on the data (bring the query to the data, don't take the data to the query).

If you want to avoid changing your implementation, you should decompose your lucene index into 10, 20 or even more indices and query them in parallel. It worked in my case (I created 8 indices), I had 80 GB of data, and I needed implement search which works on a developer machine (Intel Duo Core, 3GB RAM).

Storing data in Lucene or database

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.

From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)

I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/

Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.