Using an in-memory db for data analysis operations - java

We are working on a solution which crunches log files generated by systems and does various analysis operations on these logs to come up with different views which can help to triage the issues. For e.g. building a sequence of error messages which are repeating across the logs.
Currently we are loading the logdata in java collections and doing all operations by iterating/searching through these collections which is affecting the performance. We are thinking to instead load the data in a database and fire queries on the data to get optimized search results. And for the same we are thinking on using an in-memory db which will give a better performance than a persistent store as disk reads/writes will be minimized.
The amount of data to be analyzed at a time may go up to few GBs (2-4 GBs) and hence may exceed the RAM size on the machine.
Question:
What options can be considered for such an In-Memory db? Is GridGain a good option for the same?
Most of our solutions shall be deployed on a single node and hence distributed capabilities are not the priority. What other in-memory db's can be recommended for this purpose

You could try column store in-memory databases. They usually can achieve better compression ratio than row store db, and are designed for analytical tasks. Examples are MonetDB (open source), Vertica, InfiniDB and so on.

Related

java : Is it a bad practice to store file streams in a database?

I'm reading a file streams of certain group of files and storing it in a database as bytea type. But when I try to read the streams from the database and write those streams into a file, It is really taking long to do it and finally I get an out of memory exception. Is there any other alternative where it can be done more efficiently with or without database involved?
Databases were designed with a key problem in mind:
When having a bunch of data, where we don't know the kinds of reports
that will be generated, how can we store the data in a manner that
preserves the data's inner relationships and permits any reporting
format we can think of. a
Files lack a few key characteristics of databases. Files consistently have a single structure of "characters in order". They also lack any means of integrated report building, and the reporting is often confined to simple searches, which have little context without the result being shown in the rest of the file.
In short, if you aren't using the database's features, please don't use the database.
Many people do store files in databases; because, they have one handy, and instead of writing support for a filesystem storage, they cut-and-paste the database storage code. Let's explore the consequences:
Backups and restores become problematic because the database grows in size very quickly, and the bandwidth to do the backup and restore is a function of the size of the database.
Replication rebuilds in fail-safe databases take longer (I've seen some go so long that redundancy couldn't catch up to the rate of change in the primary database).
Queries that (accidentally) reference the files in bulk spike the CPU, possibly starving access to the rest of the system (depends on database).
Bandwidth of returning the results of those queries steals system resources preventing other queries from communicating their results (better on some databases, worse on others).

is h2 a persistent alternative to java collections with disk backend

I'm still looking for Java Collections that are persistent and have comparable access times for performance. The Real data should stay on the disk but for faster access times I need a cache in the RAM so I can stream the content from the file to the main memory.
I read about h2 have such a cache function. Is there a option to cache the whole file on start up?
And can somebody say something about the performance?
Currently, I have more than 100.000 items in a Java HashMap (key value is custom class which contains a byte array).
Thank you!
Partially. The H2 MVStore can be used as a persistent java.util.Map. But not as a list, stack, or so. The H2 Database is a relational database, with SQL and JDBC API, and with the latest version uses the MVStore as the default storage engine.
Other projects such as MapDB support similar features than the MVStore.

Cloud Storage vs. Datastore latency

I have a simple data file I want to store. I don't need any indexes or queries performed on it, so I can put it in Cloud Storage. BUT, the latency of fetching the file is very important. What is the latency I can expect when fetching a file from Cloud Storage vs. the latency in fetching an entity from the Datastore?
I could not find a good reference for this issue...
You shouldn't expect a specific latency as it'll vary depending on a large number of things. If the file is that important, then just package it with the files when distributing the program if that's possible.
If this is a file that fits within the limits of Datastore entity (1 MB size). Then storing the file there makes sense.
I have seen lower latency on Datastore retrieval than GCS (again depends highly on the size of the object).
Another advantage using Datastore would be with is using the NDB Python interface as it will transparently cache the entity to memcache.

PostgreSQL, Hibernate : Moving contents of db to text file/XML file for storage purposes

I am working on a Spring-MVC application in which we are seeing that the database is growing big. The space is consumed by chat messages history mostly, and other stuff like old notifications, which are not that useful.
Because of which we thought of moving the guys to some text/XML file to give the DB some room to breath and increase the performance of queries thereby. Indexes are not that useful as too many insertions.
I wanted to know if there is any way, PostgreSQL or Hibernate has support for such a task, where data is picked out of db and saved in plain files, which can be accessed and result in atleast good performance gains.
I have only started looking up some stuff, so I don't have much in hand to show. Kindly let me know if there are any questions you guys have.
Thanks a lot.
I would use the PostgreSQL JSON storage and have two databases:
the current operations DB, the one where you are moving data away to slim it
the archive database where old data is aggregated to save storage
This way you can move data from the current database into the archive database without compromising ACID attributes and you can aggregate the old data to simplify retrieval, by grouping various related entities based on some common root entity, which you'll then use to access your old data.
This way the current operation database remains small enough, while the archive database can be shared. This way, it's easier to configure the current operation for high performance, while the archive one for scalability.
Anyway, hibernate doesn't support this out-of-the-box, but you can implement it using custom Hibernate types and JTA transactions.

Storing JSON objects: SQLite vs serialization to disk

Will be building an app which will be pulling down JSON objects from a web service, in the low hundreds, each relatively small say 20kb each.
The app won't be doing much else than displaying these POJOs, downloading new and updated ones when available and deleting out of date ones. What would be the preferred method for persistent storage of these objects? I guess the two main contenders are storing them in a SQLite DB, maybe using ORMLite to cut down on the overhead, or just serialize the objects to disk, probably in one large file and use a very fast JSON parser.
Any ideas what would be the preferred method?
You could consider using CouchDB as cache between the mobile client and your webservice.
CouchDB would have to run on a service on the internet, caching the objects from the webservice. On the client you can use TouchDB-Android: https://github.com/couchbaselabs/TouchDB-iOS/wiki/Why-TouchDB%3F . TouchDB-Android can synchronize automatically with CouchDB inatance running on the Internet. The application itself would then access TouchDB solely. TouchDB automatically detects wetter or not there's an internet-connection, so your application keeps running even without internet.
Advantages:
- Caching of JSON calls
- Client remains working with internet-connection down, synchronized automatically when internetconnection is up again.
- Takes load of your webservice, and you can scale.
We used this setup before to allow Android software to work seamlessly, even when the internetconnetion would drop frequently and the service we accessed data from was quite slow and had limited capacity.
A dbms such as SQLLite should come with querying, indexing and sorting capabilities (and other standard SQL DBMS features), you should consider if you need any of these. How many objects are you planning to have in production environment? If say a million disk serialization approach might not scale.

Categories