Searching is not responsive during indexing with Lucene

Searching is not responsive during indexing with Lucene - java

When I re-index the DB data of my application, and there is a search executed on the same time, the thread that runs the search is going to sleep until the re-indexing is done. I assume that the indexing methods are thread-safe in order to prevent change of the data while indexing. Is there any built in way in Lucene to make it responsive only for search (where the data is not being changed)? Or should I start thinking about something on my own? I'm running my application on a Tomcat server.
Thanks, Tomer

I assume that you are actually rebuilding the index (or reindexing everything from scratch, as opposed to reindexing individual documents). While the index is being rebuilt, you cannot perform the queries against it, because it's not in consistent state.
The simplest solution that is often used is to rebuild the index in the background (while still performing the queries against the old one) and then replace it with the fresh one.
If the problem you are facing is connected with frequent server crashes, it might be worthwhile to look at some more systematical approach like the one that is implemented for example in Zoie -- it records subsequent indexing requests, so it can recover from the last correct snapshot of the index.

Related

How to reboot application without losing the TreeMap kept in memory?

In a Spring Boot application, I keep a TreeMap in memory. I'm doing around 10,000 operations per second, and it may increase. To improve performance, I kept data in memory. I want my app to be able to start from the same state when application is restarted.
There are some methods I could find for this.
Keeping data on Hazelcast.
In this case I don't risk losing the data unless the Hazelcast dies, but if the Hazelcast dies, I can't restore data. Additionally, I don't think it makes sense to sync that amount of operations on Hazlecast.
Synchronizing events to database.
Here, my risk of data loss is very low. However, I need to execute a query after each operation. This may affect performance. Also, I need to handle exceptions on database update.
Synchronizing data in batches
There is only one ready solution that I could find here, MapDB. I'm planning to try it but I haven't tried it yet. If there is a more reliable, optimized sink solution that also uses db instead of file, I would prefer to use it.
Any recommendation to solve this question?

Do you need a Map or a TreeMap ?
Is collating sequence relevant for storage, for access or neither.
For Hazelcast, the chance for data loss is configurable. You set up a cluster with the level of resilience you want. This is the same as with disk, if you have one disk and it fails, you lose data. If you have two and one goes offline, you don't lose data. You allocate hardware for the level of resilience you need. Three is the recommended minimum.
(10,000 per second isn't worrying either, 1,000,000,000 has been done. Sync to an external store can be immediate or in batches)
Disclaimer, I work for Hazelcast, but I think your question is more fundamental -- how do you keep your store available.
Simply, don't restart.
Clustered solutions are the answer here. If you have multiple nodes, the service as a whole stays running even if a few nodes go offline.
Do rolling bounces.
If you must restart everything at once, what matters is how quickly can your service bring all data back and what does it do when the restore is 50% done (is 50% data visible?). Immediate replication to elsewhere is only really necessary if you have a clustered solution that hasn't been configured for resilience. Saving intermittently is fine if you have solved resilience.
So, configure your storage so that it doesn't go offline, makes the solution options for backup/restore all the easier.

Stale Lucene index when using multiple machines

I've got a Java/Hibernate/MySQL application up and running, and it works very nicely.
Recently I've been using Lucene (Hibernate Search) to speed up the searching and avoid round trips to the database by using projection. That works great too, except that the index gets stale when the application gets used on multiple machines. Lucene does a good job of updating the local index when changes are made locally, but it can't see changes from other machines.
Currently, I am:
reindexing in full once a week
updating a "last modified" time on all records, and updating the local index at start time based on anything modified since last indexing
But this doesn't work for deletions. If something gets deleted on one machine, it still turns up in searches on other machines.
Is there a 'standard' way to deal with this? I can think of a few options, none of which excite me:
reindex in full every night (still stale during the day, though)
maintain a table of deleted records so that I can use it to update locally
perform a round trip to the db at startup time to find all entries in the index but not in the db
add some sort of trigger to the db to record something somewhere when something gets deleted (this would work for updates as well as deletions)
Hard to believe this is a new problem, but I couldn't find any convincing answers.
Any help much appreciated.

Concurrent calls to a custom plugin processed 1 at a time

I developed a plugin of my own in Neo4j in order to speed the process of inserting node. Mainly because I needed to insert node and relationship only if they didn't exists before which can be too slow using the REST API.
If I try to call my plugin a 100 time, inserting roughly 100 nodes and 100 relationship each time, it take approximately 350ms on each call. Each call is inserting different nodes, in order to rule out locking cause.
However if I parallelize my calls (2, 3 , 4... at time), the response time drop accordingly to the parallelism degree. It takes 750ms to insert my 200 objects when I do 2 call at a time, 1000ms when I do 3 etc.
I'm calling my plugin from a .NET MVC controller, using HttpWebRequest. I set the maxConnection to 10000, and I can see all the TCP connection opened.
I investigated a little on this issue but it seems very wrong. I must have done something wrong, either in my neo4j configuration, or in my plugin code. Using VisualVM I found out that the threads launched by Neo4j to handle my calls are working sequentially. See the picture linked.
http://i.imgur.com/vPWofTh.png
My conf :
Windows 8, 2 core
8G of RAM
Neo4j 2.0M03 installed as a service with no conf tuning
Hope someone will be able to help me. As it is, I will be unable to use Neo4j in production, where there will be tens of concurrent calls, which cannot be done sequentially.

Neo4j is transactional. Every commit triggers an IO operation on filesystem which needs to run in a synchronized block - this explains the picture you've attached. Therefore it's best practice to run writes single threaded. Any pre-processing prior can of course benefit from parallelizing.
In general for maximum performance go with the stable version (1.9.2 as of today). Early milestone builds are not optimized yet, so you might get a wrong picture.
Another thing to consider is the transaction size used in your plugin. 10k to 50k in a single transaction should give you best results. If your transactions are very small, transactional overhead is significant, in case of huge transactions, you need lots of memory.
Write performance is heavily driven by the performance of underlying IO subsystem. If possible use fast SSD drives, even better stripe then.

Downloading A Large SQLite Database From Server in Binary vs. Creating It On The Device

I have an application that requires the creation and download of a significantly large SQLite database. Depending on the user's data, creation of the db and the syncing of data from the server can take upwards of 20 to 25 minutes (some customers have a LOT of data). The data is downloaded as JSON and processed with Android's built in JSON classes.
To account for OutOfMemory issues I was having with some devices, I needed to limit the per-call download from the server to 500 records at a time. But, as of now, all of the above is working successfully - although slow.
Recently, there has been talk from my team of creating the complete SQLite db on the server side and then just downloading it to the device in binary in an effort to speed things up. I've never done this before. Is this indeed a viable option OR should I just be looking into speeding up the processing of the JSON through a 3rd party lib like GSON or Jackson.
Thanks in advance for your input.

From my experience with mobile devices, reinventing synchronization is an overkill most of the time. It obviously depends on the hardware, software and amounts of data you're working with. But most of the time long operation execution times on mobile devices are caused by faulty design, careless coding or specifics of embedded systems not taken into consideration.
Unfortunately, I can only give you some hints which you may consider, given pretty vague description of issues you're facing. I mean "LOT" doesn't mean much to me - I've seen mobile apps with DBs containing millions of records running pretty smoothly and ones that had around a 1K records running horribly slow and causing UI to freeze. You also didn't mentioned what OS version and device (or at least it's capabilities) you're using. What's the server configuration, what software is installed, what libraries/frameworks are used and in what modes. It all matters when you want to really speed things up.
Apart of encoding being gzip (which I believe you left default, which is on), you should give this ideas a try:
Streaming! - make sure both the client and the server use a streaming version of JSON API and use buffered streams. If either doesn't - replace it with a library that does. Jackson has one of the fastest streaming API. Sure it's more cumbersome to write a (de)serializer, but it pays off. When done properly, none of the sides must create a buffer large enough for (de)serialization of all the data, fill it with contents, and then parse/write it. Instead, a much smaller buffer is allocated and filled gradually as successive fields are serialized. When this buffer gets filled, it's contents is immediately sent to the other end of data channel. There it can be deserialized right away. The process continues until all data have been transmitted in small chunks. It makes the data interchange much more fluent and less resource-intensive.
For large batch inserts or updates use prepared statements. It also sometimes helps to insert your data without constraints and then create them - that way, for example, an index can be computed in one run instead of for each insert. Don't use transactions (they require maintaining extra database logs) or commit every 300 rows to minimize the overhead. If you're updating existing database and atomic modifications are necessary - load new data to a temporary database and, if everything is ok, replace old database with new one on the fly.
Almost always some data can be precomputed and stored on an sd-card for example. Or it can be loaded directly to an sd-card as a prepared SQLite DB in the company. If a task requires data that is so large that an import takes more than 10 minutes, you probably shouldn't do that task on mobile devices in the first place.

MySQL performance

I have this LAMP application with about 900k rows in MySQL and I am having some performance issues.
Background - Apart from the LAMP stack , there's also a Java process (multi-threaded) that runs in its own JVM. So together with LAMP & java, they form the complete solution. The java process is responsible for inserts/updates and few selects as well. These inserts/updates are usually in bulk/batch, anywhere between 5-150 rows. The PHP front-end code only does SELECT's.
Issue - the PHP/SELECT queries become very slow when the java process is running. When the java process is stopped, SELECT's perform alright. I mean the performance difference is huge. When the java process is running, any action performed on the php front-end results in 80% and more CPU usage for mysqld process.
Any help would be appreciated.
MySQL is running with default parameters & settings.
Software stack -
Apache - 2.2.x
MySQL -5.1.37-1ubuntu5
PHP - 5.2.10
Java - 1.6.0_15
OS - Ubuntu 9.10 (karmic)

What engine are you using for MySQL? The thing to note here is if you're using MyISAM, then you're going to have locking issues due to the table locking that engine uses.
From: MySQL Table Locking
Table locking is also disadvantageous
under the following scenario:
* A session issues a SELECT that takes a long time to run.
* Another session then issues an UPDATE on the same table. This session
waits until the SELECT is finished.
* Another session issues another SELECT statement on the same table.
Because UPDATE has higher priority than SELECT, this SELECT waits for the UPDATE to finish,
after waiting for the first SELECT to finish.
I won't repeat them here, but the page has some tips on increasing concurrency on a table within MySQL. Obviously, one option would be to change to an engine like InnoDB which has a more complex row locking mechanism that for high concurrency tables can make a huge difference in performance. For more info on InnoDB go here.
Prior to changing the engine though it would probably be worth looking at the other tips like making sure your table is indexed properly, etc. as this will increase select and update performance regardless of the storage engine.
Edit based on user comment:
I would say it's one possible solution based on the symptoms you've described, but it may not be
the one that will get you where you want to be. It's impossible to say without more information.
You could be doing full table scans due to the lack of indexes. This could be causing I/O contention
on your disk, which just further exasterbates the table locks used by MyISAM. If this is the case then
the root of the cause is the improper indexing and rectifying that would be your best course of action
before changing storage engines.
Also, make sure your tables are normalized. This can have profound implications on performance
especially on updates. Normalized tables can allow you to update a single row instead of hundreds or
thousands in an un-normalized table. This is due to unduplicated values. It can also save huge amounts
of I/O on selects as the db can more efficiently cache data blocks. Without knowing the structure of
the tables you're working with or the indexes you have present it's difficult to provide you with a
more detailed response.
Edit after user attempted using InnoDB:
You mentioned that your Java process is multi-threaded. Have you tried running the process with a single thread? I'm wondering if maybe it's possibly you're sending the same rows to update out to multiple threads and/or the way you're updating across threads is causing locking issues.
Outside of that, I would check the following:
Have you checked your explain plans to verify you have reasonable costs and that the query is actually using the indexes you have?
Are your tables normalized? More specifically, are you updating 100 rows when you could update a single record if the tables were normalized?
Is it possible that you're running out of physical memory when the Java process is running and the machine is busy swapping stuff in and out?
Are you flooding your disk (a single disk?) with more IOPs than it can reasonably handle?

We'd need to know a lot more about the system to say if thats normal or how to solve the problem.
with about 900k rows in MySQL
I would say that makes it very small - so if its performing badly then you're going seriously wrong somewhere.
Enable the query log to see exactly what queries are running, prioritize based on the product of frequency and duration. Have a look at the explain plans, create some indexes. Think about splitting the database across multiple disks.
HTH
C.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.