Spring batch - using in-memory database for huge file processing - java

I am using Spring batch to process huge data (150 GB) to produce 60 GB output file. I am using Vertical Scaling approach and with 15 threads (Step partitioning approach).
The Job execution details are stored in the in-memory database. The CPU Utilization is more because its running on single machine and the file size is huge. But the Server is having a good configuration like 32 core processor and i am using 10 GB memory for this process.
My question is, if i move this to separate database will it reduce some CPU Utilization? Also, Using In-Memory database for Production is a bad choice /decision?
Regards,
Shankar

When you are talking about moving from in-memory db to a separate db, you are just talking about the batch runtime tables (job_instance, job_execution, step_execution, ...), right?
If so, I wouldn't expect that the CPU usage will drop a lot. Depending on your chunksize, a lot more CPU usage will be needed for your data processing, than for your updating the batch runtime tables.
If using an in-memory db for production is a good decision or not, depends on your needs. Two points to consider:
You lose any-information which was written into the batch-runtime tables. This could be helpful for debug sessions or simply to have a kind of history. But you can "persist" such information also in logfiles.
You will not be able to implement a restartable job. This could be an issue, if your job takes hours to complete. But for job, that only reads from a file, writes to a file, and is completed within a couple of minutes, this is not really a problem.

Related

Spring Batch Performance Improvement for a complex job

I've a Spring Batch Job that runs on a daily basis and has around 100k records to process. I've configured my batch as below.
ItemReader : I've used JdbcCursorItemReader that reads data from a single table(This table has all the source records). Chunk size is 1000
ItemProcessor : Here I've added logic to perform validation for every record. Validation includes checking the data for its correctness and once validations are complete I've to verify few more tables(for this record).
ItemWriter : Here I've updated final tables based on the validation results.(This is a bulk operation and I've used JdbcTemplate.batchUpdate for faster processing).
Results :
For processing 104000 records job took around 140 min. Since this is run on a daily basis and many other jobs are running parallely in production I want to enhance the performance of this batch.
Can someone suggest a better way to enhance this batch? (I've tried multithreaded approach provided by spring batch using taskexecutor in step config but I've got some cursor issues in reader as below)
**Caused by: org.springframework.dao.InvalidDataAccessResourceUsageException: Unexpected cursor position change.
at org.springframework.batch.item.database.AbstractCursorItemReader.verifyCursorPosition(AbstractCursorItemReader.java:368)
at org.springframework.batch.item.database.AbstractCursorItemReader.doRead(AbstractCursorItemReader.java:452)
at org.springframework.batch.item.support.AbstractItemCountingItemStreamItemReader.read(AbstractItemCountingItemStreamItemReader.java:88)
at org.springframework.batch.core.step.item.SimpleChunkProvider.doRead(SimpleChunkProvider.java:91)
at org.springframework.batch.core.step.item.FaultTolerantChunkProvider.read(FaultTolerantChunkProvider.java:87)**
Screenshot of CPU sample inside ItemProcessor
use JVisualVm to monitor the bottlenecks inside your application.
Since you said "for processing 104000 records job took around 140 min", you will get better insights of where you are getting performance hits.
VisualVm tutorial
Open visualvm connect your application => sampler => cpu => CPU Samples.
Take snapshot at various times and analyse where is it taking much time. By checking this only you will get enough data for optimisation.
Note: JvisualVm comes under oracle jdk 8 distribution. you can simply type jvisualvm on command prompt/terminal. if not download from here

Spark Thrift Server for exposing big size file?

We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.
When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space).
Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).
Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.
Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.
--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?
Thanks.
Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.
Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.
As #paul said , dont trust file size.
Parquet is a columnar storage data file format, so retrieve data like "*" is really not a good idea. But its good for group by queries.
The driver role is to manage worker executors, then give you thé query result at the end, so all your data will be collected on the driver.
Try limiting your query and specifying some fields rather than *.

Why every DiskStorage flush in EhCache takes 4 seconds?

We are using EhCache 2.6.2. Because we need high survivability we use only DiskStorage and not MemoryStorage.
After every data update we have in the program, we flush the data to the disk.
After a while, the cache.data file exceeded max of 1 gb. When the data file was 250 mb, the flush took 250ms and when it's 1gb, it takes 3.5 sec.
Our objects are about 20kb each, so there are millions of them.
Is there a way to split the data file to few smaller files and let EhCache handle it?
We would prefer to have solution involving only configuration changes and not code change, cause it's in production environment.
Environment details:
Running WebSphere 7 with IBM Java 1.6 with EhCache 2.6.2 on AIX 6.1 64bit.
In Ehcache 2.6.2 all cache data will always be on disk, as the storage model changed, so you could benefit from a speed-up by using memory storage in addition to the disk storage.
What do you mean when you say:
After every data update we have in the program, we flush the data to the disk.
Regarding the performance of the disk store, there is one option that you can try:
<cache diskAccessStripes="4" ...>
...
</cache>
where the diskAccessStripes attribute takes a power of two value. Try it first with small values and see if you gain anything. The exact effect of this attribute will depend on a lot of factors: hardware, operation system as well as usage patterns of your application.

Concurrent calls to a custom plugin processed 1 at a time

I developed a plugin of my own in Neo4j in order to speed the process of inserting node. Mainly because I needed to insert node and relationship only if they didn't exists before which can be too slow using the REST API.
If I try to call my plugin a 100 time, inserting roughly 100 nodes and 100 relationship each time, it take approximately 350ms on each call. Each call is inserting different nodes, in order to rule out locking cause.
However if I parallelize my calls (2, 3 , 4... at time), the response time drop accordingly to the parallelism degree. It takes 750ms to insert my 200 objects when I do 2 call at a time, 1000ms when I do 3 etc.
I'm calling my plugin from a .NET MVC controller, using HttpWebRequest. I set the maxConnection to 10000, and I can see all the TCP connection opened.
I investigated a little on this issue but it seems very wrong. I must have done something wrong, either in my neo4j configuration, or in my plugin code. Using VisualVM I found out that the threads launched by Neo4j to handle my calls are working sequentially. See the picture linked.
http://i.imgur.com/vPWofTh.png
My conf :
Windows 8, 2 core
8G of RAM
Neo4j 2.0M03 installed as a service with no conf tuning
Hope someone will be able to help me. As it is, I will be unable to use Neo4j in production, where there will be tens of concurrent calls, which cannot be done sequentially.
Neo4j is transactional. Every commit triggers an IO operation on filesystem which needs to run in a synchronized block - this explains the picture you've attached. Therefore it's best practice to run writes single threaded. Any pre-processing prior can of course benefit from parallelizing.
In general for maximum performance go with the stable version (1.9.2 as of today). Early milestone builds are not optimized yet, so you might get a wrong picture.
Another thing to consider is the transaction size used in your plugin. 10k to 50k in a single transaction should give you best results. If your transactions are very small, transactional overhead is significant, in case of huge transactions, you need lots of memory.
Write performance is heavily driven by the performance of underlying IO subsystem. If possible use fast SSD drives, even better stripe then.

Fastest way to store data from sensors in java

I am currently writing a Java application that receives data from various sensors. How often this happens varies, but I believe that my application will receive signals about 100k times per day. I would like to log the data received from a sensor every time the application receives a signal. Because the application does much more than just log sensor data, performance is an issue. I am looking for the best and fastest way to log the data. Thus, I might not use a database, but rather write to a file and keep 1 file per day.
So what is faster? Use a database or log to files? No doubt there is also a lot of options to what logging software to use. Which is the best for my purpose if logging to file is the best option?
The data stored might be used later for analytical purposes, so please keep this in mind as well.
I would recommend you first of all to use log4j (or any other logging framework).
You can use a jdbc appender that writes into the db or any kind of file appender that writes into the file. The point is that your code will be generic enough to be changed later if you like...
In general files are much faster than db access, but there is a place for optimizations here.
If the performance is critical, you can use batching/asynchronous calls to the logging infrastructure.
A free database on a cheap PC should be able to record 10 records per second easily.
A tuned database on a good system or a logger on a cheap PC should be able to write 100 records/lines per second easily.
A tuned logger should be able to write 1000 lines per second easily.
A fast binary logger can perform 1 million records per second easily (depending on the size of the record)
Your requirement is about 1.2 records per second per signal which should be able to achieve any way you like. I assume you want to be able to query your data so you want it in a database eventually so I would put it there.
Ah the world of embedded systems. I had a similar problem when working with a hovercraft. I solved it with a separate computer(you can do this with a separate program) over the local area network that would just SIT and LISTEN as a server for logs I sent to it. The FileWriter program was written in C++. This must solve two problems of yours. First is the obvious performance gain while writing the logs. And secondly the Java program is FREED of writing any logs at all(but will act as a proxy) and can concentrate on performance critical tasks. Using a DB for this is going to be an overkill, except if you're using SQLite.
Good luck!

Categories