Maintaining a billion of key:value pairs in a file

Maintaining a billion of key:value pairs in a file - java

With Java, how to store around a billion of key-value pairs in a file, with a possibility of dynamically updating and querying the values whenever necessary?

If for some reason a database is out of the question, then you need to answer the following question about your problem:
What is the mix of the following operations?
Insert
Read
Modify
Delete
Search
Once you have a good guess at the ratio of these operations, try selecting the appropriate data structure for use in your file. I'd recommend starting with this book as a good catalog of options:
http://www.amazon.com/Introduction-Algorithms-Second-Thomas-Cormen/dp/0262032937
You'll want to select a data structure with the best average and worst case runtimes for your most common operations.
Good Luck

Old question, but this is a case for log files. You do not want to be copying a billion records over every time you do a delete. This can be solved by logging all "transactions" or updates to a new and separate file. These files should be broken up into reasonable sizes.
To read a tuple, you start at the newest log file until you find your key, then stop. To update or insert you just add a new record into the most recent log file. A delete is still a log entry.
A batch coalesce process needs to be run periodically which will scan each log file and write out another master. As it is read, each NEW key gets written to the new master and duplicate (old) keys are skipped until you make it all the way through. If you encounter a delete record, mark it in a separate delete list skip the record and ignore subsequent records with that key.
That made it sound simple, but remember you may want to block/chunk your file as you will likely scan said log files in reverse, or you will at least seek() to the max size and write in reverse instead of read.
I have done this exact thing with billions of lines of data. You're just re-inventing sequential access databases.

You leave out a lot of details, but...
Are the keys static? What about the values? Are they fixed size? Why not use a database?
If you don't want to use a database then use a memory mapped file.

Can you use a database? Managing such a large file would be a pain.
Edit: if the file requirement is mostly to avoid machine communication failures, downtime and similar situations, maybe you could use an embedded database. This way you would be freed from the large file manipulation problems and still use all the advantages a database can give you. I already used Apache Derby as an embedded database with wonderful results. Java DB is Oracle supported and based on Derby.

Related

Parse large number of xml files in java

I'll be getting a large number of xml files (numbering in tens of thousands every few minutes) from an MQ. The xml files aren't very big. I have to extract the information and save it into a database. I cannot use third party libraries unfortunately (except the apache commons). What strategies/techniques are normally used in this scenario? Is there any xml parser in java or apache which can handle such situations well?
I might also add that I'm using jdk 1.4

Based on the comments and discussion around this topic - I would like to propose a consolidated solution.
Parsing XML files using SAX - As #markspace mentioned, you should go
with SAX which is built-in and has good performance.
Use BULK INSERTS if possible - Since you plan to insert a large
amount of data consider what type of data are you reading and
storing into the database. Do all the XML files contain the same
schema (which means they correspond to a single table in the
database) OR do they represent different objects (which means you
would end up inserting data into multiple tables).
In case the schema of all XML files that needs to be inserted into
the same table in the database, then consider batching these data
objects and bulk-inserting them into the database. This will be
definitely more performing in terms of time as well as resources
(you would open only a single connection to persist a batch as
opposed to multiple connections for each objects). Of course you
would need to spend some time in tuning your batch size and also
deciding the error handling strategy for batch inserts (discard
all v/s discard erroneous)
If the schema of the XML files are different, then consider clubbing
similar XMLs into groups so that you can BULK INSERT these groups
later.
Finally - and this is important : Ensure that you release all the
resources such as File handles, Database connections etc once you
are done with processing or in case you encounter errors. In simple
words use try-catch-finally at the correct places.
While by no means complete, hope this answer provides you a set of critical checkpoints that you need to consider while writing scalable performant code

Result Set to Multi Hash Map

I have a situation here. I have a huge database with >10 columns and millions of rows. I am using a matching algorithm which matches each input records with the values in database.
The database operation is taking lot of time when there are millions of records to match. I am thinking of using a multi-hash map or any resultset alternative so that i can save the whole table in memory and prevent hitting database again....
Can anybody tell me what should i do??

I don't think this is the right way to go. You are trying to do the database's work manually in Java. I'm not saying that you are not capable of doing this, but most databases have been developed for many years and are quite good in doing exactly the thing that you want.
However, databases need to be configured correctly for a given type of query to be executed fast. So my suggestion is that you first check whether you can tweak the database configuration to improve the performance of the query. The most common thing is to add the right indexes to your table. Read How MySQL Uses Indexes or the corresponding part of the manual of your particular database for more information.
The other thing is, if you have so much data storing everything in main memory is probably not faster and might even be infeasible. Not to say that you have to transfer the whole data first.
In any case, try to use a profiler to identify the bottleneck of the program first. Maybe the problem is not even on the database side.

Why is file system storage faster than SQL databases

Extending this thread - I would just like to know why it's faster to retrieve files from a file system, rather than a MySQL database. If one were to benchmark the two to see which would retrieve the most data (multiple types of data) over 10 minutes - which one would win?
If a file system is truly faster, then why not just store everything in a file system and replace a database with csv or xml?
EDIT 1:
I found a good resource for alternate storage options for java
EDIT 2:
I'm looking for a Java API/Jar that has the functionality of a SQL Database Server Engine (or at least some of it) that uses XML for data storage (preferably). If you know of something, please leave a comment below.

At the end of the day the database does just store the data in the file system. It's all the useful stuff on top of just the raw data that makes you decide to use a database.
If you can replicate the functionality, scalability, robustness, integrity, etc, etc of a database system using CSV and still make it perform faster than a relational database then yes I'd suggest doing it your way.
It'd take you a few years to get there though.
Of course, relational systems are not the only way to store data. There are object-oriented database systems (db4o, InterSystems Cache) and document-based systems (RavenDB).
Performance is also relative to the style and volume of data you are working with and what you intend to do with it - I'm not going to even try and discuss that, it's too open ended.
I will also not start the follow on discussion: if memory is truly faster than the file system, why not just store everything in memory? :-)
This also seems similar to another question I answered a long while ago:
Is C# really slower than say C++?
Basically stuff isn't always done just for performance.

MySQL uses the file system the same as everything else on a computer. To retrieve a single piece of data, or a table of data, there is no faster way that directly from the file system. MySQL would just be a small bit of overhead added to that file system pull.
If you need to do some intelligent selecting, match some rows, or filter that data, MySQL is going to do that faster than most other options. The database server provides you calculation and data manipulation power that a filesystem can't.

When you have mixed/structured data, a DBMS is the only solution. For eg. try to get the people's name, surname and country for all your customers stored into your DB, but only those born in 1981 and living in Rome. If you have this data into files on the filesystem, how do you easily get only the required data without scanning all your files and how do you join returned data?
A DBMS give you much more than that.
Many DBMS store data into files.
This abstraction layer will make you retrieve data in a very easily, standard and structured way.

The difference is in how the desired data is located.
In a file system, locating the desired data means searching through all existing data until you find it.
Databases provide indexing which results in locating the desired data almost immediately (within ~12 comparisons) regardless of the amount of data.
What we want is an indexed file system - lucky for us, we have them. They are called databases.

JAVA : file exists Vs searching large xml db

I'm quite new to Java Programming and am writing my first desktop app, this app takes a unique isbn and first checks to see if its all ready held in the local DB, if it is then it just reads from the local DB, if not it requests the data from isbndb.com and enters it into the DB the local DB is in XML format. Now what im wondering is which of the following two methods would create the least overhead when checking to see if the entry all ready exists.
Method 1.) File Exists.
On creating said DB entry the app would create a seperate file for every isbn number named isbn number.xml (ie. 3846504937540.xml) and when checking would use the file exists method to check if an entry all ready exists using the user provided isbn .
Method 2.) SAX XML Parser.
All entries would be entered into a single large XML file and when checking for existing entries the SAX XML Parser would be used to parse the file and then the user provided isbn would be checked against those in the XML DB for a match.
Note :
The resulting entries could number in the thousands over time.
Any information would be greatly appreciated.

I don't think either of your methods is all that great. I strongly suggest using a DBMS to store the data. If you don't have a DBMS on the system, or if you want an app that can run on systems without an installed DBMS, take a look at using SQLite. You can use it from Java with SQLiteJDBC by David Crawshaw.
As far as your two methods are concerned, the first will generate a huge amount of file clutter, not to mention maintenance and consistency headaches. The second method will be slow once you have a sizable number of entries because you basically have to read (on the average) half the data base for every query. With a DBMS, you can avoid this by defining indexes for the info you need to look up quickly. The DBMS will automatically maintain the indexes.

I don't like too much the idea of relying on the file system for that task: I don't know how critical is your application, but many things may happen to these xml files :) plus, if the folder gets very very big, you would need to think about splitting these files in some hierarchcal folder structure, to have decent performance.
On the other hand, I don't see why using an xml file as a database, if you need to update frequently.
I would use a relational database, and add a new record in a table for each entry, with an index on the isbn_number column.
If you are in the thousands records, you may very well go with sqlite, and you can replace it with a more powerful non-embedded DB if you ever need it, with no (or little :) ) code modification.

I think you'd better use DBMS instead of your 2 methods.

If you want least overhead just for checking existence, then option 1 is probably what you want, since it's direct look up. Parsing XML each time for checking requires you to to pass through the whole XML file in worst case. Although you can do caching with option 2 but that gets more complicated than option 1.
With option 1 though, you need to beware that there is a limit of how many files you can store under a directory, so you probably have to store the XML files by multiple layer (for example /xmldb/38/46/3846504937540.xml).
That said, neither of your options is good way to store data in the long run, you will find them become quite restrictive and hard to manage as data grows.
People already recommended using DBMS and I agree. On top of that I would suggest you to look into document-based database like MongoDB as your database.

Extend your db table to not only include the XML string but also the ISBN number.
Then you select the XML column based on the ISBN column.
Query: Java escaped, "select XMLString from cacheTable where isbn='"+ isbn +"'"
A different approach could be to use an ORM like Hibernate.
In ORM instead of saving the whole XML document in one column you use different different columns for each element and attribute and you could even split upp your document over several tables for a simpler long term design.

What's the most efficient way to load data from a file to a collection on-demand?

I'm working on a java project that will allows users to parse multiple files with potentially thousands of lines. The information parsed will be stored in different objects, which then will be added to a collection.
Since the GUI won't require to load ALL these objects at once and keep them in memory, I'm looking for an efficient way to load/unload data from files, so that data is only loaded into the collection when a user requests it.
I'm just evaluation options right now. I've also thought of the case where, after loading a subset of the data into the collection, and presenting it on the GUI, the best way to reload the previously observed data. Re-run the parser/Populate collection/Populate GUI? or probably find a way to keep the collection into memory, or serialize/deserialize the collection itself?
I know that loading/unloading subsets of data can get tricky if some sort of data filtering is performed. Let's say that I filter on ID, so my new subset will contain data from two previous analyzed subsets. This would be no problem is I keep a master copy of the whole data in memory.
I've read that google-collections are good and efficient when handling big amounts of data, and offer methods that simplify lots of things so this might offer an alternative to allow me to keep the collection in memory. This is just general talking. The question on what collection to use is a separate and complex thing.
Do you know what's the general recommendation on this type of task? I'd like to hear what you've done with similar scenarios.
I can provide more specifics if needed.

You can embed a database into the application, like HSQLDB. That way you parse the files the first time and then use SQL to do simple and complex querys.
HSQLDB (HyperSQL DataBase) is the
leading SQL relational database engine
written in Java. It has a JDBC driver
and supports nearly full ANSI-92 SQL
(BNF tree format) plus many SQL:2008
enhancements. It offers a small, fast
database engine which offers in-memory
and disk-based tables and supports
embedded and server modes.
Additionally, it includes tools such
as a command line SQL tool and GUI
query tools.

If you have tons of data, lots of files, and you are short on memory, you can do an initial scan of the file to index it. If the file is divided into records by line feeds, and you know how to read the record, you could index your records by byte locations. Later, if you wanted to read a certain set of indeces, you would do a fast lookup to find which byte ranges you need to read, and read those from the File's InputStream. When you don't need those items anymore, they will be GCed. You will never hold more items than you need into the heap.
This would be a simple solution. I'm sure you can find a library to provide you with more features.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.