I'm looking to implement Lucene for a webapp I'm working on and I'm more or less looking for a "best practice".
What I'm trying to achieve is to have a web request come in (via an ajax post) and add a doc to my lucene index with the information posted.
However, what I'm unsure of is:
Can I run lucene in the context of a web request or do I need to run it separately and have my requests write to a folder (which the separate lucene would monitor and load docs based on the file)
I've been doing allot of searching into how to achieve this but I'm not finding many/any results so I think I totally off here.
I think the deciding factor here is just what sort of response time do you want users (or the ajax client) to see, and whether you want to guarantee that when the request completes the document has actually been added. That said, adding a document to a Lucene index is usually relatively fast (less than a few milliseconds is not unusual) you can probably do it in the context of the web request unless you have extremely strict timing requirements. Of course, indexing speed will depend on the document size and the complexity of your tokenizing and analysis.
(And if the request just queues a document for later indexing, then the client doesn't know for sure whether it was indexed when the response comes back. You would have to come up with some other way of letting clients know when a document was indexed or if there was an error indexing it, if they care about that.)
One thing you may want to do consider is periodically optimizing the index to keep space requirements small and queries fast. Optimizing can take a long time, so it's not something you want to do after every addition, and you probably don't want to do it in the context of a web request.
Related
I don't want to use Lucene because i think it is to heavy.
Is there any easier way to implement this (Millons of data) ?
If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
Check it out here.
Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages (in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%' in the case of the user typing Qu into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
How do they do it so quickly? There are probably four things to bear in mind.
They have LOTS of machines handling the requests. Bear in mind that Google's autocomplete is turned on by default and works in (almost?) all languages. That's a lot of searches against the autocomplete index. A lot more than there will be against the web index itself: for each web search request, Google will probably have processed 3 or 4 autocomplete requests.
They're probably doing it in memory. Google is already known to store its web indexes in memory, so I would expect them to be doing the same with this.
Specialised software (this is where it gets really interesting). While a traditional database or a NoSQL database could do this and do it quickly I would expect the big boys to actually be doing this with specialised code whose sole purpose is to provide autocomplete suggestions. The SQL statement I provided above was purely to demonstrate the logical request that would be needed. You're probably looking at some kind of specialised tree, such as a suffix tree, radix tree, or similar.
Sharding. To cope with the quantity of data and the number of machines doing the requests you're going to need to shard. That is ensure that a certain subset of all the machines involved only process requests requests that begin with one or more letters. eg a group of X machines processing searches that begin with a certain letter or even 2 letters. That means that you've got more machines, but they don't each have to have the whole index to hand. How does a particular group of machines get chosen? You're either routing once the request is in your data centre, or you could route on the client side (eg in your Javascript decide which IP to query based upon the first X letters of the search term)
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
And, here's how I have done it, purely in an experimental environment at home:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.
First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
See, e.g.
http://stackoverflow.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.
I'm trying to performance optimize a page load in GAE, and I'm a bit stumped what is taking so long to serve the page.
When I first got appstats running I found the page was calling about 500-600 RPC calls. I've now got that down to 3.
However, I'm still seeing a massive load of extra time in App Stats. Another page on my site (using the same django framework + templating) loads in about 60ms, doing a small query to a small data set.
Question is, what is this overhead, and where should I be looking for trouble points?
The data in the request has about 350 records, and about 30 properties per record. I'm cool with the data call itself taking the datastore api time, but it's the other time I'm confused about. The data does get stepped through a biggish iterator, and I've now used fetch on most of these requests to keep the RPC call down, and make sure things are in memory rather than being queried as they go.
Slow Request - Look at all the extra blue
Fast Request , RPC blue is matched against overall blue
EDIT
OK, so I have created a new model called FastModel, and copied the bare minimum items needed for the page to it, so it can load as quickly as possible, and it does make a big difference. Seems there are things on the Model that slow it all down. Will investigate further.
Deserializing 350 records, especially large ones, takes a long time. That's probably what's taking up the bulk of your execution time.
I'm attempting to transfer a large two dimensional array (17955 X 3) from my server to the client using Asynchronous RPC calls. This is taking a very long period of time which is especially bad because the data is needed in order to initialize the application. I've read that using a JSON object might be faster, but I'm not sure how to do the conversion in Java as I'm pretty new to the language and GWT, and I don't know if the speed difference is significant. I also read somewhere that I can zip the data, but I only read that in a forum and I'm not sure if it's actually possible as I couldn't find information for it elsewhere. Is there any way to transfer large amounts of data from server to client? Thanks for your time.
Read this article on adding JSON capabilities to GWT. In regards to compression this article explains gzipping with GWT.
Also the size of your array is still very large even with the compression you may achieve with gzipping, which will vary depending on how much data is repeated in your array. You may want to consider logically breaking up the array in multiple RPC calls if at all possible.
I would recommend revisiting your design if your application needs such a large amount of data to initialize.
As other's pointed out, you should re-consider your design because even if you are able to solve the data transfer speed issue somehow you will likely find other issues waiting for you:
Processing large amount of data in the browser can be slow.
Lot of data means a lot of used-up memory
What you can think about is:
Partitioning the data:
How is your user going to cope with a lot of data. Your user will probably need some kind of user interface aid to be able to work with such a huge data. If you are going to use paging, tabs or other means to partition the data for user's consumption, why not load the data on demand. For example, you can load a single page of records if you are using a paging grid or you can load a single tab worth of records if you are going to use tabs. Similary, if you are going to allow filtering on the records, you can set a default filter after the load to keep the data to a minumum.
Summarizing the data:
You can also summarize the data on the server, if you are not going to show each row to the user. For example you can initially show summary for each group of records and let the user drill-down in a specific group
I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.
From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)
I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/
Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards
I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...
Does anybody know does such a web crawler exist and can If answer is yes where I can find it.
Tnx...
What you're asking is two components:
Web crawler
Lucene-based automated indexer
First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.
Web crawler
So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.
The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.
Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.
In terms of API calls, you should have something like HttpCrawler#getDocuments(String url) which returns a List<YourBean> to use in conjuction with the actual indexer.
Lucene-based automated indexer
Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example addDoc(..) method and just replace the String with YourBean.
Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling IndexWriter#commit() only after a bunch of documents have been added to index is good for performance and then calling IndexWriter#optimize() to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessary LockObtainFailedExceptions to be thrown, as with all IO in Java such operation should of course be done in the finally block.
Caveats
You need to remember to expire your Lucene index' contents every now and then too, otherwise you'll never remove anything and it'll get bloated and eventually just dies because of its own internal complexity.
Because of the threading model you most likely need to create a separate read/write abstraction layer for the index itself to ensure that only one instance can write to the index at any given time.
Since the source data acquisition is done over HTTP, you need to consider the validation of data and possible error situations such as server not available to avoid any kind of malformed indexing and client hangups.
You need to know what you want to search from the index to be able to decide what you are going to put into it. Note that indexing by date must be done so that you split the date to say year, month, day, hour, minute, second instead of millisecond value because when doing range queries from Lucene index, the [0 to 5] actually gets transformed into +0 +1 +2 +3 +4 +5 which means the range query dies out very quickly because there's a maximum number of query sub parts.
With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.
Take a look at solr search server and nutch (crawler), both are related to the lucene project.