I have a requirement to read a large data set from a postgres database which needs to be accessible via a rest api endpoint. The client consuming the data will then need to transform the data into csv format(might need to support json and xml later on).
On the server side we are using Spring Boot v2.1.6.RELEASE and spring-jdbc v5.1.8.RELEASE.
I tried using paging and loop through all the pages and store the result into a list and return the list but resulted in OutOfMemory error as the data set does not fit into memory.
Streaming the large data set looks like a good way to handle memory limits.
Is there any way that I can just return a Stream of all the database entities and also have the rest api return the same to the client? How would the client deserialize this stream?
Are there any other alternatives other than this?
If your data is so huge that it doesn't fit into memory - I'm thinking gigabytes or more - then it's probably too big to reasonably provide as single HTTP response. You will hold the connection open for a very long time. If you have a problem mid-way through, the client will need to start all over at the beginning, potentially minutes ago.
A more user-friendly API would introduce pagination. Your caller could specify a page size and the index of the page to fetch as part of their request
For example
/my-api/some-collection?size=100&page=50
This would represent fetching 100 items, starting from the 5000th (5000 - 5100)
Perhaps you could place some reasonable constraints on the page size based on what you are able to load into memory at one time.
Related
How can i send large volume of json data to a spring controller. Say, i have large json data of about 100k or 1000k records which i need to send to my rest controller in spring or springboot then what is the best/most efficient approach to solve the problem.
I know that the data can be sent using request body but i think sending such a large volume of data in the request body of a REST api is not efficient. I may be wrong here, please correct me if i am.
And the data needs to be stored in the database as quickly as possible. So, i need a quick and reliable approach to the problem.
There are two parts to your problem.
1. How to receive such a huge volume: If there is a huge volume of data being received, its generally a good idea to save(from the input stream of the response) it locally as a file and process that data asynchronously.(Make sure you set an appropriately high read timeout, else the data stream might be interrupted ) .
2. How do you process such a huge file: With big files, memory footprint needs to be minimal. For XML , SaxParsers are a golden standard . I found this library which is very similar to sax parsing, but for Json
http://rapidjson.org/md_doc_sax.html
You can use reactive approach and streaming the data.
With Spring, use MediaType.APPLICATION_STREAM_JSON_VALUE producer and Flux as return type.
On the client side, subscribe to the stream and process the data, or you can use Spring Batch to save the data to the Database.
I have a restriction on memory that my application uses.
I'm generating files consisting of data returned from Elastic Search.
As I need to get all the data that stored in ES I'm using Scroll API to get data and that put it to some Collection<Foo>. For now ES contains about several thousands records of Foo that is returned.
So it is more like imperative approach, because i'm loading all data like:
List<Foo> allFoos = FooSource.loadAllFoo();
And then do some processing, after which I'm saving results to different files.
Then this files are accessible from REST endpoint.
So what I'm looking for is some advice how to limit memory usage that this approach rises.
I was thinking of instead of putting all the data in memory and then operate on it do some sequential processing as the data comes available.
Do I need some Java Rx or Spring Batch technology in this case ? It helped me if you produce some code samples or even pseudo code will be alright.
I have a Mongodb database that contains a Poll Collection.
The Poll collection has a number of Poll documents. This could be a large number of documents.
I am using Java Servlet for serving HTTP requests.
How can I implement a feed kind of retrieval mechanism at the server side?
For e.g., In the first request, I want to retrieve 1 to 10, documents, then 11 to 20 and so on...
As there is a scroll in the view, i want to get the data from server and send to client.
Does Mongodb provide a way to do this?
I think what you are looking for is a pagination. You could use the limit and skip methods with your find query.
First request
db.Poll.find().skip(0).limit(10)
Second request
db.Poll.find().skip(10).limit(10)
...
...
Note: You should also be sorting your find with some field.
db.Poll.find().skip(10).limit(10).sort({_id:-1})
For more info on the cursor methods you could look here: http://docs.mongodb.org/manual/reference/method/js-cursor/
I want to build a reports based application, that retrieve very large amount of data from oracle DB and display it to user, so my solution was to put a java based web service that returns a large amount of data. Is there a standard way to stream a response rather than trying to return a huge chunk of data at once?
You can consider PAGING mechanism. Just display the required set of rows at once, then on request move to next set of rows.
From database end, you can do LIMIT and FETCH certain number of rows at a time.
If you are on 12c, the LIMIT TOP-n functionality is readily available.
I'm new to open source stacks and have been playing with hibernate/jpa/jdbc and memcache. I have a large data set per jdbc query and possibly will have a number these large data sets where I eventually bind to a chart.
However, I'm very focused on performance instead of hitting the database per page load to display it on my web page chart.
Are there some examples of how (memcache, redis, local or distributed) and where to cache this data (jSON or raw result data) to load in memory? Also I need to figure out how to refresh the cache unless it's a time based eviction marking algorithm (i.e. 30min expires so grab new data from data base query instead of using cache or perhaps an automated feed of data into the cache every xhrs/min/etc).?
Thanks!
This is typical problem and solution not straight forward. There are many factor which determine your design. Here is what we did sometime ago
Since our queries to extract data were a bit complex (took around a min to execute) and large dataset, we populated the memcache from a batch which used to pull data from database every 1 hour and push it to the memcached. By keeping the expiry cache larger than the batch interval, we made sure that where will always be data in cache.
There was another used case for dynamic caching, wherein on receiving the request for data, we checked first the memcached and if data not found, query the database, fetch the data, push it to memcached and return the results. But I would advise for this approach only when your database queries are simple and fast enough not to cause the poor overall response.
You can also used Hibernat's second level cache. It depends on your database schema, queries etc. to use this feature efficiently.
Hibernate has built-in support for 2nd level caching. Take a look at EhCache for example.
Also see: http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/performance.html#performance-cache