neo4j "empty" database takes up a lot of disk space - java

I've inserted ~2M nodes (via Java API), and deleted them after a day or two of usage (through java too). Now my db has got 16k nodes, and weights 6 GB.
Why this space wasn't freed?
What may be the cause?

The data/graph.db directory contains multiple items:
Store itself, split into multiple files
Indexes
Transaction log files
Log files (messages.log)
All your operations are stored in the transaction logs and then expire according to the keep_logical_logs setting. Not sure what the default value is, by I presume that you might have quite some space in use there.
I'd suggest to check what is taking up the space.
Also, we have sometimes seen that the space in use (reported with du for example) differs when Neo4j is running and stopped.

In addition to Alberto's answer, the store is not compacted. It leaves the empty records for reuse, and they will stay there forever. As far as I know, there is no available tool to compact the store (I've considered writing one myself, but usually convince myself that there aren't that many use cases affected by this).
If you do have a lot of churn where you are inserting and deleting records often, it's a good idea to restart your database often so it will reuse the records that it has marked as deleted.
As Alberto mentions, one of the first things I set (the other being the heap size) when I install a new neo4j is the keep_logical_logs to something like 1-7 days. If you let them grow forever (the default), they will get quite large.

Related

Replacing a huge dump file with an efficient lookup Java key-value text store

I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.

Updating an item/document takes between 1-2 seconds in a small index

We have a small index - less than 1MB in size and covering roughly 10,000 documents. The only fields that are stored are quite short which explains the small index size.
After the documents are loaded into the index, an update of an existing document can take between 1 and 2 seconds (there's quite a variance in this range though). We've tried utilizing various best practices (such as those in the Lucene wiki) but can't find what's wrong. We've even gone ahead and are now using RAMDirectory to remove the possibility of IO being the problem.
Is this really the performance to expect?
UPDATE
As requested below, I'm adding some more details:
We're treating Lucene as a black-box, we just time the amount of time it takes to reindex/update an object. We don't know what's going on inside.
The objects (or documents, in Lucene's terms) are quite small, with a total size of a 2KB of data each.
A code snippet outlining your entire update procedure would help. Are you committing after each update? This is not necessary and for top performance you must use Near Realtime Readers. Newer Lucene versions have an NRTManager that handles most of the boilerplate involved.
In many cases the best practice is to commit only rarely or never (except when shutting down). If your service shuts down ungracefully, you lose your index, but even if you didn't, you'd have to rebuild it upon restart anyway to account for all the changes that happened in the meantime.

how much extra Space/RAM/CPU is used by apache solr?

I am using MySQL database for my webapp.
I need to search over multiple tables & multiple columns, it very similar like full text searching inside those columns.
I need know your experience of using any Full Text Search API (eg. solr/lucene/mapReduce/hadoop etc..) over using simple SQL in terms of :
Speed performance
Extra space usage
Extra CPU usage (is it continuously building index? )
How long it takes to build index or it get ready for use?
Please let me know your experience of using these frameworks.
Thanks a lot!
To answer your questions
1.) i have an database with round about 5 Million Docs. MySQL Fulltextsearch needs 2-3 Minutes. Solr/Lucene needs for the same search round about 200-400 milliseconds.
2.) The space you need depends on your configuration, the number of copyfields and if you store the data or if you only index the data. In my configuration, full DB is indexed, but only metadata is sored. So an 30GB DB needs 40 GB on for Solr/Lucene. Keep in mind, that if you like to (re)optimize your index, you need temporary 100% of the index-size again.
3.) If you migrate from MySQL fulltext-Index to Lucene/Solr, you save CPU Power. Using MySQL Fulltext needs much more CPU Power than Solr Fulltext search -> look at answer 1.)
4.) depends on the number of documents, the size of the documents and the disk-speed. Of course the CPU performance is very important. There is not a good scaling over multiple CPU's during index-time. 2 big cores are much more faster than 8 small cores.
Indexing 5 Million Docs (44GB) in my environment needs 2-3 hours on an dual core VM ware server.
5.) Migrating from MySQL Fulltext-Index to Lucene/Solr Fulltextindex was the best idea ever. ;-) But probably you have to redesign your application.
//Edit to answer the question "Will the Lucene Index get updated immediately after some Insert statements "
It depends on your SOlR configuration, but it is possible
Q1: Lucene is usually faster and more powerful in terms of features (if correctly implemented)
Q2: if you don't store the original content, it's usually 20-30% of the original (indexed) content
Q4: Depends on the size of your content that you want to index, on the amount of processing you'll be doing (you can have your own analyzers, etc), then your hardware... you'll have to do a benchmark. For one of my projects, last time it took 15min to build a 500MB index (out of the box performance, no tweaks attempted), for another, it took 3 days to build a huge 17GB index.

Are all .class files in my Java application loaded into memory after application start?

I am making an app for Android, in my Activity I need to load an array of about 10000 strings. Loading it from database was slow, so I decided to put it directly into one .java file (as a private field). I have about 20 of these classes containing string arrays and my question is, are all the classes loaded into memory after my application is started? If so the Activity in which I need these strings would be loaded quickly, but the application as a whole would have a slow start...
Is there other way, how to very quickly load an 10000 string array from a file?
UPDATE:
Why I need these strings? My Android app allows you to find "journeys" in Prague's public transit - you choose departure stop, arrival stop and it finds your journey (have a look here). My app has a suggestions feature - you enter leter "c" as your departure stop and a suggestions ListView appears with stops starting with "c". For these suggestions I need the strings. Fetching the suggestions from database is slow (about 400ms on G1).
First, 400ms to perform a simple database query is really slow. So slow that I'd suspect that there is some problem in your database schema (e.g. indices) or your database connection configuration.
But if you a serious about not using a database, there are a couple of alternatives to what you are currently doing:
Arrange that the classes containing the arrays are lazily loaded as required, using Class.forName(...). If you implement it right, it should be possible for the garbage collector to reclaim the classes after they have been loaded and the strings have been added to your primary data structure.
Turn the 10000 Strings into a flat file, put the file into your app's JAR file. Then use Class.getResourceAsStream(...) to open the file and read it into the in-memory array.
As above, but using an indexed file and replacing the array with a data structure that allows you to read Strings from the file lazily. (This will be a bit complicated, but if you are worried by the memory consumed by the 10000 Strings, this will help address that.)
A class is loaded only when it is first referenced.
Though you need an array of 10000, you may not need all at once. Here is where the concept of paging comes in. This link indicates that Paging is often done in Android.Initialy have only a small amount of array in memory, and as you need it, keep loading it in to memory and unloading any previous data from memory if not wanted.
For e.g. in any table, at one shot, the user may see at best 50 records, then he will have to scroll(considering his screen is not size of an iMax movie theatre). When he scrolls, load the next chunk of data and unload any data that is now inivsible to the user.
When is a Type Loaded? This is a
surprisingly tricky question to
answer. This is due in large part to
the significant flexibility afforded,
by the JVM spec, to JVM
implementations. Loading must be
performed before linking and linking
must be performed before
initialization. The VM spec does
stipulate the timing of
initialization. It strictly requires
that a type be initialized on its
first active use (see Appendix A for a
list of what constitutes an "active
use"). This means that loading (and
linking) of a type MUST be performed
at or before that type's first active
use.
From http://www.developer.com/java/other/article.php/2248831/Java-Class-Loading-The-Basics.htm
I don't think that you will be happy with maintaining 10K Strings, hardcoded at Java files.
Rather check if you are using the right database for your problem and if your indices are set correctly. A wrong index can cause really poor performance.
Additionally you should limit the amount of results returned by the query, but make sure you don't fetch the entries one by one.
If nothing fits, you can still preload the Strings from the database at startup.
You could preload, let's say 10 entries, for each character. If a character is keyed in, you can preload the entries with that character following another and so on.

how to handle large lists of data

We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
If you're working with huge amounts of data, you might want to consider using a database instead.
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.

Categories