Streaming large number of small objects with Java

Streaming large number of small objects with Java - java

A client and a server application needs to be implemented in Java. The scenario requires to read large number of small objects from database on the server side and send them to client.
This is not about transferring large files rather it requires streaming large number of small objects to client.
The number of objects needs to be sent from server to client in a single request could be one or one million (let's assume the number of clients is limited for the sake of discussion - ignore throttling).
The total size of the objects in most cases will be too big to hold them in memory. A way to defer read and send operation on the server side until client requests the object is needed.
Based on my previous experience, WCF framework of .NET supports the scenario above with
transferMode of StreamedResponse
ability to return IEnumerable of objects
with the help of yield defer serialization
Is there a Java framework that can stream objects as they requested while keeping the connection open with the client?
NOTE: This may sound like a very general question, but I am hoping to give specific details that would hopefully lead to a clear answer benefiting me and possible others.

A standard approach is to use a form of pagination and get the results in chunks which can be accommodated temporarily in memory. How to do that specific it depends, but a basic JDBC approach would be to first execute a statement to find out the number of records and then get them in chunks. For example, Oracle has a ROWNUM column that you use in order to manage the ranges of records to return. Other databases have some other options.

You could use ObjectOutputStream / ObjectInputStream to do this.
The key to making this work would be to periodically call reset() on the output stream. If you don't do that, the sending and receiving ends will build a massive map that contains references to all objects sent / received over the stream.
However, there may be issues with keeping a single request / response (or database cursor) open for a long time. And resuming a stream that failed could be problematic. So your solution should probably combine the above with some kind of pagination.
The other thing to note is that a scalable solution needs to avoid network latency from becoming the bottleneck. It may be worth implementing a receiver thread that eagerly pulls objects from the stream and buffers them in a (bounded) queue.

Related

Some questions regarding architecture/design to this usecase?

My application needs to work as middleware where it has got orders(in form of xml) from various
customers which contains the supplier id. Once it get the order, it needs to send order request
to different suppliers in the form of xml.i am double minded about three aspects of it. Here they are:-
Questions:
What i am planning at high level is as soon as request come, put it on jms queue.(Now i am not sure
should i create queue for each supplier or one queue should be sufficient. I think one queue will be sufficient.
as maintaining large number of queues will be overhead.). Advantage of maintaining separate queue per supplier is message can be processed faster as there will be separate producer on each queue.
Before putting the object on queue
i need to do some business validations. Also the structure of input xml i am receiving and output xml i need to send to supplier is different. For this i am planning to convert the input xml to java object then put on queue
so that validation can be done with ease at consumer side. Another thought is dont convert the xml into java object, just get all elements
value thru xpath/xstream api and validate them and put xml string as it is on queue because. Then at consumer side convert xml to java object then to different xml format. Is there a way of doing it?
Now my requirement is consumer on queue process the messages on queue after every 5 hours and send the xml request
to suppliers. I am planning to use quartz scheduler here which will pick the job one by one and send to corresponding
supplier based on supplierId. Here is my question is if my job pick the message one by one and then send it to supplier.
it will be too slow . I am planning to handle it where quartz job create ThreadPool with size of say ten threads at time
which concurrently process the messages from queue(So here will be multiple consumers on queue. I think thats not valid for queue. Do i need topic here instead of queue?). Is second approach is better or there is some better than this?
i am expecting a load of 50k request per hour which mean around 15 request per second

Your basic requirement is ,
Get order from customer as XML ( you have not told how you are receiving)
Do basic Business validation .
Send the Orders to Suppliers
And you would be excepting 50k Request ( You haven't provided the approximate an Order size).
Assuming your Average order size 10K, it would be around 500 MB required just to hold it in Queue ( irrespective of number of queues) . i am not sure which environment you are running.
For Point #1
I would choose single Queue instead of multiple Queue
- Choose the appropriate persistent store.
I am assuming you would be using Distributed Queue , so that it can be easily scale while adding clusters.
For Point #2
I would be converting in POJO (Your own format ) and perform business validation. So that later if you want to extend the business validation to ruler or any other conversion it would be easy to maintain.
- basically get the input in any form ( XML / POJO / JSON ...) and convert into Middle format ( you can write custome validator / conversion utility on top of Middle fomart) . And have Keep Mappings between the Common format to input as well output. So that you can write formatters and use them. which will not impact in future while changing format for any specific supplier. Try to externalize the format mapping.
For Point # 3
in your case, A Order needs to be processed by only once. So i would go with Queue. and you can have multiple Message Listeners . Message listeners deliver order in asynchronous. So you can have multiple Listeners for an Queue. And each listeners would run separate thread.
Is there a problem to send the orders as soon as it received ? It would be good for you as well as the supplier to avoid heavy load at particular time.

Since you are the middleware, you should handle data quick at the point of contact, to get your hands free for more incoming requests. Therefore you must find a way to distinquish the incoming data as quick and memory low as possible. Leave the processing of the data to modules more specific to the problem. A receptionist just directs the guests in the right spot.
If you really have to read and understand the received data in your specialized worker later on, use a threadpool. This way you can process the data parallelly without too much worry about outofmem. Just choose your pool size smartly and use only one. You could use a listener pattern to signal new incoming data to the worker multiton. You should avoid jaxb or better the complete deserialization of the data if possible. It eats up memory like hell.
I would not use jmx because you "messages" are relevant for only one listener.
If it is possible send the mail as soon as the worker is done with its work. If not, use a storage. This way you can later proove you processed the data and if something went wrong or you have to update your software, you do not have to worry about volatile data.

Transferring large arrays from server to client in GWT

I'm attempting to transfer a large two dimensional array (17955 X 3) from my server to the client using Asynchronous RPC calls. This is taking a very long period of time which is especially bad because the data is needed in order to initialize the application. I've read that using a JSON object might be faster, but I'm not sure how to do the conversion in Java as I'm pretty new to the language and GWT, and I don't know if the speed difference is significant. I also read somewhere that I can zip the data, but I only read that in a forum and I'm not sure if it's actually possible as I couldn't find information for it elsewhere. Is there any way to transfer large amounts of data from server to client? Thanks for your time.

Read this article on adding JSON capabilities to GWT. In regards to compression this article explains gzipping with GWT.
Also the size of your array is still very large even with the compression you may achieve with gzipping, which will vary depending on how much data is repeated in your array. You may want to consider logically breaking up the array in multiple RPC calls if at all possible.

I would recommend revisiting your design if your application needs such a large amount of data to initialize.

As other's pointed out, you should re-consider your design because even if you are able to solve the data transfer speed issue somehow you will likely find other issues waiting for you:
Processing large amount of data in the browser can be slow.
Lot of data means a lot of used-up memory
What you can think about is:
Partitioning the data:
How is your user going to cope with a lot of data. Your user will probably need some kind of user interface aid to be able to work with such a huge data. If you are going to use paging, tabs or other means to partition the data for user's consumption, why not load the data on demand. For example, you can load a single page of records if you are using a paging grid or you can load a single tab worth of records if you are going to use tabs. Similary, if you are going to allow filtering on the records, you can set a default filter after the load to keep the data to a minumum.
Summarizing the data:
You can also summarize the data on the server, if you are not going to show each row to the user. For example you can initially show summary for each group of records and let the user drill-down in a specific group

C++/Java serialization library suitable for socket stream?

I need to write a server in C++/Obj-C that can receive streamed data from several clients built in Java and C++. The challenge: I need to serialize and deserialize the data structures efficiently. One C++ client will be generating 128x96x2-dimensional float arrays, plus some metadata, about 30 times per second (video features). A Java client will be generating a smaller feature vector -- probably 200 values, 1-10 times per second. I've about resigned myself to rolling my own implementation, but before I do, I'd like to ask recommendations.

Google Protocol Buffers supports your required languages and streaming of serialized data structures, but I am not sure how you would best handle those large arrays. There is some ongoing work here in this area of protobuf for Java - background here.
With this in mind, you might be able to produce something that works using Java and C++ protobuf, with custom code in C++ to handle the Java array encoding in that branch.

JMS Topic message size

Our application uses a topic to push message to a small set of subscribers. what sort of things should i look for when modeling a jms message with respect to the size of the actual message to be pushed. Are there any known limits or is application server specific? Any best practices or suggestions on this topic (pun unintended)?

You are likely to hit practical limits before you hit technical ones. That is, message lengths may have technical limits in the lengths that can be expressed in an int or long, but that's unlikely to be the first constraint you hit.
Message lengths up in the Megabytes tend to be heavyweight. Think in terms of a few K as the sort of ballpark you want to be in.
A technique used sometimes is to send small messages saying "Item 123435 has been updated", consumers then go retrieve data associated with Item 12345 from a database or other storage mechanism. Hence each client can get only the data they need, we don't spray large chunks of data around when subscribers may not need it all.

I suggest you to check the book Enterprise Integration Patterns, where a lot of patterns dealing with issues like the one you are asking are exhaustively analyzed. In particular, if the size of your message is large, you can use Message Sequence to solve the problem:
Huge amounts of data: Sometimes
applications want to transfer a really
large data structure, one that may not
fit comfortably in a single message.
In this case, break the data into more
managable chunks and send them as a
Message Sequence. The chunks
have to be sent as a sequence, and not
just a bunch of messages, so that the
receiver can reconstruct the original
data structure.
Quoted from http://www.eaipatterns.com/MessageConstructionIntro.html
The home page for a brief description of each pattern of the book is available at http://www.eaipatterns.com/index.html

It is implementation specific. As you might expect, smaller is better. Tibco, for instance, recommends to keep message sizes under 100 KB.

Small messages are faster, obviously. Having said that, the underlying JMS server implementation may improve performance using, for instance, message compression, such as Weblogic 9. (http://download.oracle.com/docs/cd/E13222_01/wls/docs92/perform/jmstuning.html#wp1150012)

Java: Advice on handling large data volumes. (Part Deux)

Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?

Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.

#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms

Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.

I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.

You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.

#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?

This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).

I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.

I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.