Justification of the need for an in-memory database - java

My use case is as follows --
I have a database table with around 1000+ entries and this table is updated/edited infrequently but i expect this to change in future. Some of the columns in the table contain strings that are of considerable length.
Now I am in the process of writing a UI application that will have some mouseover events that will display texts derived from the aforementioned database table.
I have, for my use case, decided to write a backend 'server' that will host an in-memory database that will have all the data that was present in the aforementioned table. The UI app will now, on startup, cache the required data from the in-memory database present or hosted by the backend server.
Does my use case justify using an in-memory database ? If not, what are the alternatives I should consider ?
EDIT 1 --
My use case also involves running multiple searches of varying complexity on the database very frequently.
Thanks
p1ng

Seems like an excellent use-case for an in-memory database. Writing it yourself, on the other hand, is probably not the way to go.
There are plenty of existing options for just about any imaginable scenario: http://en.wikipedia.org/wiki/In-memory_database
If you're doing complex searches on text data, Lucene is quite excellent. It has special in-memory storage backends, but really, it doesn't matter for such a tiny dataset - it will always be quickly cached anyway.

Related

Informix, MySQL and Oracle blob contains

We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.
Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.

H2 performance recommendations

I'm currently working with a somewhat larger database, and though I have no specific issues, I would like some recommendations, if anyone has any.
The database is 2.2 gigabyte (after recreation/compacting). It contains about 50 tables. One of those tables contains a blob plus some metadata. It currently has about 22000 rows. If I remove the blobs from the table (UPDATE table SET blob = null), the database size is reduced to about 200 megabyte (after recreation/compacting). The metadata is accessed a lot, the blobs however are not that often needed.
The database URL I currently use is:
jdbc:h2:D:/data;AUTO_SERVER=true;MVCC=true;CACHE_SIZE=524288
It runs in our Java VM which has 4GB max heap.
Some things I was wondering:
Would running H2 in a separate process have any impact on performance (for better or for worse)?
Would it help to have the blobs in a separate table with a 1-1 relation to the metadata? I could imagine it would help with the caching, not having the blobs in the way?
The internet seems divided on whether to include blobs in a database or write them to files on a filesystem with a link in the DB. Any H2-specific advise here?
The answer for you depends on the growth rate of your blob data. If for example, your data set is going to grow at 10% per week - then there is little point of trying to extend the use of H2 to store blob data (as it will quickly out pace the available heap memory). If instead the blob data is the biggest it will ever be, then attempting to use H2 might make sense.
To answer your questions about H2:
1) Running H2 in a separate process will allow H2 claim the majority of heap space - making controlling the available heap space for H2 much more manageable. However, you'll also be adding the maintenance overhead of having a separate process to maintain and monitor. So the answer is "it depends on your operating environment and goals". If you have the people and time, running H2 in a separate process might make sense. But if that's true - then you should probably consider just running an appropriate blob storage platform instead.
2) Yes, you're correct that storing the blobs in a separate table would help with caching - in the case that you don't often need the blobs. It should also help with retrieval times, as H2 won't have to read past the blobs to find the metadata.
3) Note that "the internet" represents many thousands of people with almost as many different specific use cases. You'll need to filter down your use case into requirements, and then apply the logic you glean from others.
4) My personal advice is, if you're trying to make a scalable and maintainable platform - use the right tools. H2, or any other relational database, is most often not the right tool for storing many large blobs. I'd recommend that you investigate using a key/value store.

Bulk Update in Oracle from CSV file

I have table and CVS file what i want to do is from csv have to update the table.
csv file as follows (no delta)
1,yes
2,no
3,yes
4,yes
Steps through java
what i have did is read the csv file and make two lists like yesContainList,noContainList
in that list added the id values which has yes and no seperately
make the list as coma seperated strinh
Update the table with the comma seperated string
Its working fine. but if i want to handle lakhs of records means somewhat slow.
Could anyone tell whether is it correct way or any best way to do this update?
There are 2 basic techniques to do this:
sqlldr
Use an external table.
Both methods are explained here:
Update a column in table using SQL*Loader?
Doing jobs like bulk operation, import, exports or heavy SQL operation is not recommended to be done outside RDBMS due to performance issues.
By fetching and sending large tables throw ODBC like API's you will suffer network round trips, memory usage, IO hits ....
When designing a client server application (like J2EE) do you design a heavy batch operation being called and controlled from user interface layer synchronously or you will design a server side process triggered by clients command?.
Think about your java code as UI layer and RDBMS as server side.
BTW RDBMS's have embedded features for these operations like SQLLOADER in oracle.

Options when storing all data in memory doesn't scale

I've written a Java application that users install on there desktop. It crawls websites, storing the data about each page in a LinkedList. The application allows users to view all the pages crawled in a JTable.
This works great for small sites, but doesn't scale very well. Currently users have to allocate more memory (which translates to a -Xmx when starting Java) for larger crawls.
My current thinking is to move to storing all the data in a database, possibly using something like HSQLDB.
Are there any other approaches I should be considering?
relation db is not a good place to store web page data. could you save pages on disk? i you want to do searching on the crawling results. try the apache lucene searching engine. loading all the results all-in-once in memory is not reasonable. you can paginate the JTable model.,and use soft-reference to cache some results when pagination.
A relational database is probably the right approach for this case. Reasons:
It'll enable you to handle larger-than-memory crawls.
If you keep the link data in separate tables from the considerable larger volumes of page data, you may still be able to fit all your links in memory which will be pretty important from a performance and searching perspective
It will give you an easy way of persisting crawled data (in case this is needed in the future)
It's pretty well known / standard technology
There are good open source database implementation available (H2 or JavaDB would probably be my first choices as they are embeddable and written in pure Java)
The relational features could turn out to be useful, for example queries on link data
It doesn't sound like you have the data volumes or availability requirements that might push you towards a NoSQL-type solution
You have basically 4 options:
Store the data in flat files
Store the data in a database
Somehow transmit the data to "the cloud" (I have no idea how)
Somehow "pare" the data down to the essentials, knowing that you can re-extract the full info when needed
You can also do a variant of 4 to gain some space -- rather than a "rich" object structure, compress each distinct datum into a single String or byte[] or some such that you keep in an array or arraylist vs a linked list. This can reduce your storage requirements by 2X or more. Less "object oriented", but sometimes reality intervenes.
Try storing the page data in db4o http://community.versant.com , an object database. Object databases handle complex objects (eg. with lots of siblings) than relational databases.

Is a good idea do processing of a large amount of data directly on database?

I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...

Categories