Parse large number of xml files in java - java

I'll be getting a large number of xml files (numbering in tens of thousands every few minutes) from an MQ. The xml files aren't very big. I have to extract the information and save it into a database. I cannot use third party libraries unfortunately (except the apache commons). What strategies/techniques are normally used in this scenario? Is there any xml parser in java or apache which can handle such situations well?
I might also add that I'm using jdk 1.4

Based on the comments and discussion around this topic - I would like to propose a consolidated solution.
Parsing XML files using SAX - As #markspace mentioned, you should go
with SAX which is built-in and has good performance.
Use BULK INSERTS if possible - Since you plan to insert a large
amount of data consider what type of data are you reading and
storing into the database. Do all the XML files contain the same
schema (which means they correspond to a single table in the
database) OR do they represent different objects (which means you
would end up inserting data into multiple tables).
In case the schema of all XML files that needs to be inserted into
the same table in the database, then consider batching these data
objects and bulk-inserting them into the database. This will be
definitely more performing in terms of time as well as resources
(you would open only a single connection to persist a batch as
opposed to multiple connections for each objects). Of course you
would need to spend some time in tuning your batch size and also
deciding the error handling strategy for batch inserts (discard
all v/s discard erroneous)
If the schema of the XML files are different, then consider clubbing
similar XMLs into groups so that you can BULK INSERT these groups
later.
Finally - and this is important : Ensure that you release all the
resources such as File handles, Database connections etc once you
are done with processing or in case you encounter errors. In simple
words use try-catch-finally at the correct places.
While by no means complete, hope this answer provides you a set of critical checkpoints that you need to consider while writing scalable performant code

Related

What is the best approach for saving statistical data on a file using spring framework?

What is the best approach for saving statistical data on a file using spring framework? is there any available library that offers reading and updating the data on a file? or should I build my own IO code?
I already have a relational database, but don't like the approach of creating an additional table to save the calculated values in different multiple tables with joins, also don't want to add more complexity to the project by using an additional database for just one task like MongoDB.
To understand the complexity of this report, Imagine you are drawing a chart with a total number of daily transactions for a full year with billions of records at any time with a lot of extra information like( total and average with different currencies on different rates).
So, my approach was to generate those data in a file on a regular basis, so later I don't need to generate them again once requested, only accumulate the new dates if available to the file
Is this approach fine? and what is the best library to do that in an efficient way?
Update
I found this answer useful for why sometimes people prefer using flat files rather than the relational or non-relational one
Is it faster to access data from files or a database server?
I would preferet to use MongoDB for such purposes, but if you need simple approach, you can write your data to csv\excel file.
Just using I\O
List<String> data = new ArrayList<>();
data.add("head1;head2;head3");
data.add("a;b;c");
data.add("e;f;g");
data.add("9;h;i");
Files.write(Paths.get("my.csv"), data);
That is all)
How to convert your own object, to such string 'filed1;field2' I think you know.
Also you can use apache-poi csv library, but I think this is way much faster.
Files.write(Paths.get("my.csv"), data, StandardOpenOption.APPEND);
If you want to append data to existed file, there are many different options in StandardOpenOption.
For reading you should use Files.readAllLines(Paths.get("my.csv")); it will return you list of strings.
Also you can read lines in range.
But if you need to retrieve one column, or update two columns where, and so on. You should read about MongoDB or other not relational databases. It is difficult write about MongoDB here, you should read documentation.
Enjoy)
I found a library that can be used to write/read CSV files easily and can be mapped to objects as well Jackson data formats
Find an example with spring

Bulk Update in Oracle from CSV file

I have table and CVS file what i want to do is from csv have to update the table.
csv file as follows (no delta)
1,yes
2,no
3,yes
4,yes
Steps through java
what i have did is read the csv file and make two lists like yesContainList,noContainList
in that list added the id values which has yes and no seperately
make the list as coma seperated strinh
Update the table with the comma seperated string
Its working fine. but if i want to handle lakhs of records means somewhat slow.
Could anyone tell whether is it correct way or any best way to do this update?
There are 2 basic techniques to do this:
sqlldr
Use an external table.
Both methods are explained here:
Update a column in table using SQL*Loader?
Doing jobs like bulk operation, import, exports or heavy SQL operation is not recommended to be done outside RDBMS due to performance issues.
By fetching and sending large tables throw ODBC like API's you will suffer network round trips, memory usage, IO hits ....
When designing a client server application (like J2EE) do you design a heavy batch operation being called and controlled from user interface layer synchronously or you will design a server side process triggered by clients command?.
Think about your java code as UI layer and RDBMS as server side.
BTW RDBMS's have embedded features for these operations like SQLLOADER in oracle.

JAVA : file exists Vs searching large xml db

I'm quite new to Java Programming and am writing my first desktop app, this app takes a unique isbn and first checks to see if its all ready held in the local DB, if it is then it just reads from the local DB, if not it requests the data from isbndb.com and enters it into the DB the local DB is in XML format. Now what im wondering is which of the following two methods would create the least overhead when checking to see if the entry all ready exists.
Method 1.) File Exists.
On creating said DB entry the app would create a seperate file for every isbn number named isbn number.xml (ie. 3846504937540.xml) and when checking would use the file exists method to check if an entry all ready exists using the user provided isbn .
Method 2.) SAX XML Parser.
All entries would be entered into a single large XML file and when checking for existing entries the SAX XML Parser would be used to parse the file and then the user provided isbn would be checked against those in the XML DB for a match.
Note :
The resulting entries could number in the thousands over time.
Any information would be greatly appreciated.
I don't think either of your methods is all that great. I strongly suggest using a DBMS to store the data. If you don't have a DBMS on the system, or if you want an app that can run on systems without an installed DBMS, take a look at using SQLite. You can use it from Java with SQLiteJDBC by David Crawshaw.
As far as your two methods are concerned, the first will generate a huge amount of file clutter, not to mention maintenance and consistency headaches. The second method will be slow once you have a sizable number of entries because you basically have to read (on the average) half the data base for every query. With a DBMS, you can avoid this by defining indexes for the info you need to look up quickly. The DBMS will automatically maintain the indexes.
I don't like too much the idea of relying on the file system for that task: I don't know how critical is your application, but many things may happen to these xml files :) plus, if the folder gets very very big, you would need to think about splitting these files in some hierarchcal folder structure, to have decent performance.
On the other hand, I don't see why using an xml file as a database, if you need to update frequently.
I would use a relational database, and add a new record in a table for each entry, with an index on the isbn_number column.
If you are in the thousands records, you may very well go with sqlite, and you can replace it with a more powerful non-embedded DB if you ever need it, with no (or little :) ) code modification.
I think you'd better use DBMS instead of your 2 methods.
If you want least overhead just for checking existence, then option 1 is probably what you want, since it's direct look up. Parsing XML each time for checking requires you to to pass through the whole XML file in worst case. Although you can do caching with option 2 but that gets more complicated than option 1.
With option 1 though, you need to beware that there is a limit of how many files you can store under a directory, so you probably have to store the XML files by multiple layer (for example /xmldb/38/46/3846504937540.xml).
That said, neither of your options is good way to store data in the long run, you will find them become quite restrictive and hard to manage as data grows.
People already recommended using DBMS and I agree. On top of that I would suggest you to look into document-based database like MongoDB as your database.
Extend your db table to not only include the XML string but also the ISBN number.
Then you select the XML column based on the ISBN column.
Query: Java escaped, "select XMLString from cacheTable where isbn='"+ isbn +"'"
A different approach could be to use an ORM like Hibernate.
In ORM instead of saving the whole XML document in one column you use different different columns for each element and attribute and you could even split upp your document over several tables for a simpler long term design.

Maintaining a billion of key:value pairs in a file

With Java, how to store around a billion of key-value pairs in a file, with a possibility of dynamically updating and querying the values whenever necessary?
If for some reason a database is out of the question, then you need to answer the following question about your problem:
What is the mix of the following operations?
Insert
Read
Modify
Delete
Search
Once you have a good guess at the ratio of these operations, try selecting the appropriate data structure for use in your file. I'd recommend starting with this book as a good catalog of options:
http://www.amazon.com/Introduction-Algorithms-Second-Thomas-Cormen/dp/0262032937
You'll want to select a data structure with the best average and worst case runtimes for your most common operations.
Good Luck
Old question, but this is a case for log files. You do not want to be copying a billion records over every time you do a delete. This can be solved by logging all "transactions" or updates to a new and separate file. These files should be broken up into reasonable sizes.
To read a tuple, you start at the newest log file until you find your key, then stop. To update or insert you just add a new record into the most recent log file. A delete is still a log entry.
A batch coalesce process needs to be run periodically which will scan each log file and write out another master. As it is read, each NEW key gets written to the new master and duplicate (old) keys are skipped until you make it all the way through. If you encounter a delete record, mark it in a separate delete list skip the record and ignore subsequent records with that key.
That made it sound simple, but remember you may want to block/chunk your file as you will likely scan said log files in reverse, or you will at least seek() to the max size and write in reverse instead of read.
I have done this exact thing with billions of lines of data. You're just re-inventing sequential access databases.
You leave out a lot of details, but...
Are the keys static? What about the values? Are they fixed size? Why not use a database?
If you don't want to use a database then use a memory mapped file.
Can you use a database? Managing such a large file would be a pain.
Edit: if the file requirement is mostly to avoid machine communication failures, downtime and similar situations, maybe you could use an embedded database. This way you would be freed from the large file manipulation problems and still use all the advantages a database can give you. I already used Apache Derby as an embedded database with wonderful results. Java DB is Oracle supported and based on Derby.

What's the most efficient way to load data from a file to a collection on-demand?

I'm working on a java project that will allows users to parse multiple files with potentially thousands of lines. The information parsed will be stored in different objects, which then will be added to a collection.
Since the GUI won't require to load ALL these objects at once and keep them in memory, I'm looking for an efficient way to load/unload data from files, so that data is only loaded into the collection when a user requests it.
I'm just evaluation options right now. I've also thought of the case where, after loading a subset of the data into the collection, and presenting it on the GUI, the best way to reload the previously observed data. Re-run the parser/Populate collection/Populate GUI? or probably find a way to keep the collection into memory, or serialize/deserialize the collection itself?
I know that loading/unloading subsets of data can get tricky if some sort of data filtering is performed. Let's say that I filter on ID, so my new subset will contain data from two previous analyzed subsets. This would be no problem is I keep a master copy of the whole data in memory.
I've read that google-collections are good and efficient when handling big amounts of data, and offer methods that simplify lots of things so this might offer an alternative to allow me to keep the collection in memory. This is just general talking. The question on what collection to use is a separate and complex thing.
Do you know what's the general recommendation on this type of task? I'd like to hear what you've done with similar scenarios.
I can provide more specifics if needed.
You can embed a database into the application, like HSQLDB. That way you parse the files the first time and then use SQL to do simple and complex querys.
HSQLDB (HyperSQL DataBase) is the
leading SQL relational database engine
written in Java. It has a JDBC driver
and supports nearly full ANSI-92 SQL
(BNF tree format) plus many SQL:2008
enhancements. It offers a small, fast
database engine which offers in-memory
and disk-based tables and supports
embedded and server modes.
Additionally, it includes tools such
as a command line SQL tool and GUI
query tools.
If you have tons of data, lots of files, and you are short on memory, you can do an initial scan of the file to index it. If the file is divided into records by line feeds, and you know how to read the record, you could index your records by byte locations. Later, if you wanted to read a certain set of indeces, you would do a fast lookup to find which byte ranges you need to read, and read those from the File's InputStream. When you don't need those items anymore, they will be GCed. You will never hold more items than you need into the heap.
This would be a simple solution. I'm sure you can find a library to provide you with more features.

Categories