Consequences of using use StepExecutionContext/JobExecutionContext to share Hashmap with large values

Consequences of using use StepExecutionContext/JobExecutionContext to share Hashmap with large values - java

I have a requirement in which I am retrieving values in one Reader of the Step using SQL statements and doing the same request in next reader.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step.
For this I have gone through the following link on SO :
How can we share data between the different steps of a Job in Spring Batch?
In many of the comments it is mentioned that 'data must be short'.
Also it is mentioned in one response that: these contexts are good to share strings or simple values, but not for sharing collections or huge amounts of data.
By passing that HashMap, I believe it automatically infers that the reference of the HashMap will be passed.
It would be good to know the possible consequences of passing it before hand and any better alternative approach.

Passing data between step is indeed done via the execution context. However, you should be careful about the size of data you put in the execution context as it is persisted between steps.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step
You can read the data from the database only once and put it in a cache. The second reader can then get the data from the cache. This would be faster that reading the data from the database a second time.
Hope this helps.

Related

SQL equivalent of Javax Cache 'put' (INSERT or UPDATE)

I am using javax cache along with database. I uses cache's APIs to get/put/delete entities and the database is behind this cache. For this,I am using CacheLoader and CacheWriter.
So, following are SQL's construct equivalent to cache API
SELECT -> get
INSERT -> put
DELETE -> delete
If I have entry already present in cache and I updated it, then I will get that value 'write' method only. But, since the value is present in database, I need to use UPDATE query.
How to identify which database operation to perform in cache's 'put' operation ?
Note : UPSERT is not good option from performance point of view.

If you put the value in the cache you can first check if the key is already there, in that case you need an UPDATE. If the key was not present, you need an INSERT. It sounds like you could benefit from an ORM with an L2 cache, such as Hibernate, which handles all these scenarios (and many more) for you.

There are several ways I can think of. Basically these are variations of:
Metadata in the database
Within an entity I have typically additional fields which are timestamps for insert and update and a modification counter which are handled by the object to relational mapper (ORM). That is very useful for debugging. The CacheWriter can check whether the insert timestamp is set, if yes, it is an update, if no it is an insert.
It does not matter whether the value gets evicted meanwhile, if your application is reading the latest contents through the cache and writes a modified version of it.
If your application does not read the data before modifying or this happens very often, I suggest to cache a flag that like insertedAlready. That leads to three way logic: isnerted, not inserted, not in the cache = don't know yet. In the letter case you need to do a read before update or insert in the cache writer.
Metadata in the cache only
The cached object stores additional data whether the object was read from the database before. Like:
class CachedDbValue<V> {
boolean insertedAlready;
V databaseContent;
}
The code facing your application needs to wrap the database data into the cached value.
Side note 1: Don't read the object from the cache and modify the instance directly, always make a copy. Modifying the object directly may have different unwanted effects with different JCache implementations. Also check my explanation here: javax.cache store by reference vs. store by value
Side note 2: You are building a caching ORM layer by yourself. Maybe use an existing one.

How to insert an element to a collection field in MongoDB using Spring Data scalably?

Suppose we have a MongoDB collection called Threads where it has a field of typed collection for replies to the original post.
When the user hits the reply, we would want to create a new post instance and append it to the replies field. It can be easily done as follows:
var thread = threadRepository.findById(threadId);
thread.getReplies().add(post);
threadRepository.save(thread);
But a question arises, is this solution scalable? What if there are 1 million replies to that thread?
My main question is:
Will they all be loaded in memory?
If yes, wouldn't it be a waste if all we wanted to do is create a new reply? What is the recommended solution?

If all the replies are nested within the thread document, then yes, they will be loaded into memory unless you explicitly specify which fields to load via #Query(fields=...)
In order to modify the Document without having to load it into memory consider an update instead of a replace operation.
Update update = new Update().push("replies", post);
template.updateFirst(query(where("id").is(thread.id)), update, Thread.class)
With $push it's possible to append an item to an array. Please see the MongoDB reference documentation for more details.
In case the intention is to potentially store millions of replies that way, then please keep at least the Document max size limit in mind and consider a feasible loading strategy.

In spring batch, how to insert a piece of code just after reading a list of item by bulk, with given list of item as parameter?

I'm using spring batch in chunk mode for processing items.
I read them by bulk(6000 items by bulk), process them one by one, and write them all. I read them via a JdbcCursorItemReader, which is very conveniant for bulk processing, and reading.
The problem is that once read, I need to retrieve additional data from another source. Simplest way is to do it in the processor, calling custom method like getAdditionalDataById(String id).
The wrong thing in this is that it consume a lot of times. So I would like to retrieve those additionnal data by bulk too : just after reading 6000 items, get their ids, and call something like
getAllAdditionalDataByIdIn(List<String> ids).
But I don't know where I can insert my piece of code, as the #AfterRead annotation is after each item and not after bulk read. Same goes for #BeforeProcess.
The only solution I can get by now is doing nothing in the processor, and get the additionnal information in the writer, processing items in the writer, and writing them in the writer (It's a custom writer).
Any help will be appreciated.
I'm using spring batch 4.0.1, reading from a sqlserver, and writing to an elasticsearch. The additionnal datas are stored in an elasticsearch too.
I've searched a bit in the code, a lot on the documentation, but can see any annotation, or anything else that can help me.

The problem is that once read, I need to retrieve additional data from another source. Simplest way is to do it in the processor, calling custom method like getAdditionalDataById(String id). The wrong thing in this is that it consume a lot of times.
This is known as the driving query pattern where an item processor is used to enrich items with additional data (from another datasource for instance). This pattern can indeed introduce some performance issues as it requires an additional query for each item.
So I would like to retrieve those additionnal data by bulk too : just after reading 6000 items, get their ids, and call something like getAllAdditionalDataByIdIn(List ids).
The closest you can have is ItemWriteListener#beforeWrite where you get access to the list of items before writing them. With the list of items in scope, you can have their IDs and call your getAllAdditionalDataByIdIn(List<String> ids) method.
Hope this helps.

Where do you store static data frontend or backend?

Let's assume there is a form frontend, which has several dropdowns with data(objects, not just strings) that likely not changing in the future, but it has reasonably size, so it looks a little bit weird putting it into frontend.
Do you create tables for these data backend and fetch it from there even though the backend likely not using or changing it ever?
Could you give me some resources where I can find about these conventions?

If you are the owner of this data it is more efficient to have this on frontend stored in some constants file, no problem whether they are objects or strings. For example,create class DropdownOption and store array of these objects.
If you decide to keep it in database and provide data via REST API count on the performance - every request will reach your endpoint first, create transaction, get data from db, close transaction, map objects to dtos and only after that return to your frontend. More data more time.

Further from Ilia Ilin's answer, an additional thing to consider is if this data set is referenced anywhere, how you'd like the data to behave once a value is updated or removed.
If you load the data on the front end, then any modification will not apply to previously stored data.
If you store the data in a relational DB, fetch it in front end, any modification will cascade to all previous data references.

Serializing a very large list

I'm gettint a large amount of data from a database query and I'm making objects of them. I finally have a list of these objects (about 1M of them) and I want to serialize that to disk for later use. Problem is that it barely fits in memory and won't fit in the future, so I need some system to serialize say the first 100k, the next 100k etc; and also to read the data back in in 100k increments.
I could make some obvious code that checks if the list gets too big and then wirites it to file 'list1', then 'list2' etc but maybe there's a better way to handle this?

You could go through the list, create an object, and then feed it immediately to an ObjectOutputStream which writes them to the file.

Read the objects one by one from the DB
Don't put them into a list but write them into the file as you get them from the DB
Never keep more than a single object in RAM. When you read the object, terminate the reading loop when readObject() returns null (= End of file)

I guess that you checked, it's really necessary to save the data to disk. It couldn't stay in the database, could it?
To handle data that is too big, you need to make it smaller :-)
One idea is to get the data by chunks:
start with the request, so you don't build this huge list (because that will become a point of failure sooner or later)
serialize your smaller list of objects
then loop

Think about setting the fetch size for the JDBC driver also, for example the JDBC driver for mysql defaults to fetching the whole resultset.
read here for more information: fetch size

It seems that you are retreiving a large dataset from db and convert them into list of objects and serialize them in a single shot.
Dont do that.. finally it may lead to application crash.
Instead you have to
minimize the amount of data retrieved from database. (let say
1000 records instead 1 M)
convert them into business object
And serialize them.
And perform the same procedure until the last record
this way you can avoid the performance problem.

ObjectOutputStream will work but it has more overhead. I think DataOutputStream/DataInputStream is a better choice.
Just read/write one by one and let stream worry about buffering. For example, you can do something like this,
DataOutputStream os = new DataOutputStream(new FileOutputStream("myfile"));
for (...)
os.writeInt(num);
One Gotcha with both object and data stream is that write(int) only writes one byte. Please use writeInt(int).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.