Serializing a very large list - java

I'm gettint a large amount of data from a database query and I'm making objects of them. I finally have a list of these objects (about 1M of them) and I want to serialize that to disk for later use. Problem is that it barely fits in memory and won't fit in the future, so I need some system to serialize say the first 100k, the next 100k etc; and also to read the data back in in 100k increments.
I could make some obvious code that checks if the list gets too big and then wirites it to file 'list1', then 'list2' etc but maybe there's a better way to handle this?

You could go through the list, create an object, and then feed it immediately to an ObjectOutputStream which writes them to the file.

Read the objects one by one from the DB
Don't put them into a list but write them into the file as you get them from the DB
Never keep more than a single object in RAM. When you read the object, terminate the reading loop when readObject() returns null (= End of file)

I guess that you checked, it's really necessary to save the data to disk. It couldn't stay in the database, could it?
To handle data that is too big, you need to make it smaller :-)
One idea is to get the data by chunks:
start with the request, so you don't build this huge list (because that will become a point of failure sooner or later)
serialize your smaller list of objects
then loop

Think about setting the fetch size for the JDBC driver also, for example the JDBC driver for mysql defaults to fetching the whole resultset.
read here for more information: fetch size

It seems that you are retreiving a large dataset from db and convert them into list of objects and serialize them in a single shot.
Dont do that.. finally it may lead to application crash.
Instead you have to
minimize the amount of data retrieved from database. (let say
1000 records instead 1 M)
convert them into business object
And serialize them.
And perform the same procedure until the last record
this way you can avoid the performance problem.

ObjectOutputStream will work but it has more overhead. I think DataOutputStream/DataInputStream is a better choice.
Just read/write one by one and let stream worry about buffering. For example, you can do something like this,
DataOutputStream os = new DataOutputStream(new FileOutputStream("myfile"));
for (...)
os.writeInt(num);
One Gotcha with both object and data stream is that write(int) only writes one byte. Please use writeInt(int).

Related

Consequences of using use StepExecutionContext/JobExecutionContext to share Hashmap with large values

I have a requirement in which I am retrieving values in one Reader of the Step using SQL statements and doing the same request in next reader.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step.
For this I have gone through the following link on SO :
How can we share data between the different steps of a Job in Spring Batch?
In many of the comments it is mentioned that 'data must be short'.
Also it is mentioned in one response that: these contexts are good to share strings or simple values, but not for sharing collections or huge amounts of data.
By passing that HashMap, I believe it automatically infers that the reference of the HashMap will be passed.
It would be good to know the possible consequences of passing it before hand and any better alternative approach.
Passing data between step is indeed done via the execution context. However, you should be careful about the size of data you put in the execution context as it is persisted between steps.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step
You can read the data from the database only once and put it in a cache. The second reader can then get the data from the cache. This would be faster that reading the data from the database a second time.
Hope this helps.

How to cache only part of the RDD in Spark?

I have a PairRDD<Metadata, BigData>.
I want to do two actions: one on all the data in the RDD and then another action on only the Metadata.
The input comes from reading massive files, which I don't want to repeat.
I understand that the classic thing to do is to use cache() or persist() on the input RDD so that it is kept in memory:
JavaPairRDD<Metadata, Bigdata> inputRDD = expensiveSource();
JavaPairRDD<Metadata, Bigdata> cachedRDD = inputRDD.cache();
cachedRDD.foreach(doWorkOnAllData);
cachedRDD.keys().foreach(doWorkOnMetadata);
The problem is that the input is so big that it doesn't fit in memory and cache() therefore does not do anything.
I could use persist() to cache it on a disk but since the data is so big, saving and reading all that data will actually be slower than reading the original source.
I could use MEMORY_SERDE to gain a bit of space, but it is probably not enough, and even then serializing the whole thing when I am just interested in 0.1% of the data seems silly.
What I want is to cache only the key part of my PairRDD. I thought I could do that by calling cache() on the keys() RDD:
JavaPairRDD<Metadata, Bigdata> inputRDD = expensiveSource();
JavaRDD<Metadata, Bigdata> cachedRDD = inputRDD.keys().cache();
inputRDD.foreach(doWorkOnAllData);
cachedRDD.foreach(doWorkOnMetadata);
But in that case it doesn't seem to cache anything, and just go back to load the source.
Is it possible to only put a part of the data in cache? The operation on the metadata is ridiculously small but I have to do it after the operation on the whole data.
Spark will only load you RDD from cache if you call inputRDD.keys()
What you can try is : JavaRDD<Metadata> keys = inputRDD.keys().cache(); to cache your JavaRDD<Metadata>
Then to make your cachedRDD you do :
JavaRDD<Metadata,Bigdata> cachedRDD = keys.join(JavaPairRDD<Bigdata>)
Also if you RDD is huge, read from cache is slowest the first time because you have to save your RDD, but the next time to read it, it will be faster.

Processing a big list from DB in Java

I have a big list of over 20000 items to be fetched from DB and process it daily in a simple console based Java App.
What is the best way to do that. Should I fetch the list in small sets and process it or should I fetch the complete list into an array and process it. Keeping in an array means huge memory requirement.
Note: There is only one column to process.
Processing means, I have to pass that string in column to somewhere else as a SOAP request.
20000 items are string of length 15.
It depends. 20000 is not really a big number. If you are only processing 20000 short strings or numbers, the memory requirement isn't that large. But if it's 20000 images that is a bit larger.
There's always a tradeoff. Multiple chunks of data means multiple trips to the database. But a single trip means more memory. Which is more important to you? Also can your data be chunked? Or do you need for example record 1 to be able to process record 1000.
These are all things to consider. Hopefully they help you come to what design is best for you.
Correct me If I am Wrong , fetch it little by little , and also provide a rollback operation for it .
If the job can be done on a database level i would fo it using SQL sripts, should this be impossible i can recommend you to load small pieces of your data having two columns like the ID-column and the column which needs to be processed.
This will enable you a better performance during the process and if you have any crashes you will not loose all processed data, but in a crash case you eill need to know which datasets are processed and which not, this can be done using a 3rd column or by saving the last processed Id each round.

Which is the best way to go - store table data to java arraylist or access table when and where needed?

The problem is like this:
I am retrieving latitude and longitude from a point and storing it into an agent class.
The class has the members id, latitude and longitude and assosciated functions.
Now at the start of the java app, it reads all rows from a local MySQL table(having cordinate values) and stores it into an arraylist of these agent variables. It then reads through this arraylist and calls necessary functions and draws a graph at the end.
This works fine for a table of 8K - 10K rows but then for another table with about 200,000 rows, it starts giving Heap size error in java.
I am thinking of changing the code to access the db everytime coordinates of an agent is needed. This should reduce the memory required right? However, there are many loops using the Iterator of this agent ArrayList. I am still in a fix on how to handle that.
So which would be a better option - reading from the table everytime or proceed as it is and increase the java memory allocation?
Thanks for reading.
try to only load data as you need it. We don't have very much information, but I am assuming there is no need for the user to see/interact with all 10,000 data points at once. Besides the physical limitations of the size of the device screen, realistically a user will just be overwhelmed with that much information.
I would try implementing a paging system. Again, I have no idea what you are using the Agent class for, but try to only load as many Agents as you are showing to the user.
Your strategy has serious drawback: big amount of data is read from disk, then stored in database's memory, sent to java app and it finally placed in JVM's memory. This takes much time and resources.
Use cursors, then you 'll be able to read from db and iterate on resultset in java app at the same time and JVM won't have to allocate so much memory.
Some documentation:
http://forge.mysql.com/wiki/Cursors
http://dev.mysql.com/doc/refman/5.6/en/connector-j-reference-implementation-notes.html (Read ResultSet section)

Is it faster to access a java list (arraylist) compared to accessing the same data in a mysql database?

I have the MYSQL database in the local machine where I'm running the java program from.
I plan create a array list of all the entries of a particular table. From this point on wards I will not access the database to get a particular entry in the table, instead I will use the array list created. Is this going to be faster or slower compared to accessing the database to grab a particular entry in the table?
Please note that the table I'm interested has about 2 million entries.
Thank you.
More info : I need only two fields. 1 of type Long and 1 of type String. The index of the table is Long , not int.
No, it's going to be much slower, because to find an element in an ArrayList, you've to scan sequentially the ArrayList until your element is found.
It can be faster, for a few hundreds entry, because you don't have the connection overhead, but with two millions entry, MySQL is going to win, provided that you create the correct indexes. Only retrieve the rows that you actually need each time.
Why are you thinking to do this? Are you experiencing slow queries?
To find out, in your my.cnf activate the slow query log, by uncommenting (or adding) the following lines.
# Here you can see queries with especially long duration
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 1
Then see which queries take a long time, and run them with EXPLAIN in front, consider to add index where the explain command tells you that is not using indexes, or just post a new question with your CREATE TABLE statement and your example query to optimize.
This question is too vague, and can easily go either way depending on:
How many fields in each record, how big are the fields?
What kind of access are you going to perform? Text search? Sequential?
For example, if each records consists of a couple bytes of data it's much faster to store them all in-memory (not necessarily an ArrayList though). You may want to put them into a TreeSet for example.
It depends on what you will do with the data. If you just wanted a few rows, only those should be fetched from the DB. If you know that you need ALL the data, go ahead and load the whole table into java if it can fit in memory. What will you do with it after? Sequencial or random reading? Will data be changed? A Map or Set could be a faster alternative depending on how the collection will be used.
Whether it is faster or slower is measurable. Time it. It is definitely faster to work with structures stored in memory than it is to work with data tables located on the disk. That is if you have enough memory and if you do not have 20 users running the same process at the same time.
How do you access the data? Do you have an integer index?
First, accessing an array list is much much faster than accessing a data base. Accessing memory is much more faster than accessing a hard disk.
If the number of entries in the array is big and I guess it is, then you need to consider using a "direct access" data structure such as a HashMap which will act as a database table where you have values referenced by their keys

Categories