I'm attempting to transfer a large two dimensional array (17955 X 3) from my server to the client using Asynchronous RPC calls. This is taking a very long period of time which is especially bad because the data is needed in order to initialize the application. I've read that using a JSON object might be faster, but I'm not sure how to do the conversion in Java as I'm pretty new to the language and GWT, and I don't know if the speed difference is significant. I also read somewhere that I can zip the data, but I only read that in a forum and I'm not sure if it's actually possible as I couldn't find information for it elsewhere. Is there any way to transfer large amounts of data from server to client? Thanks for your time.
Read this article on adding JSON capabilities to GWT. In regards to compression this article explains gzipping with GWT.
Also the size of your array is still very large even with the compression you may achieve with gzipping, which will vary depending on how much data is repeated in your array. You may want to consider logically breaking up the array in multiple RPC calls if at all possible.
I would recommend revisiting your design if your application needs such a large amount of data to initialize.
As other's pointed out, you should re-consider your design because even if you are able to solve the data transfer speed issue somehow you will likely find other issues waiting for you:
Processing large amount of data in the browser can be slow.
Lot of data means a lot of used-up memory
What you can think about is:
Partitioning the data:
How is your user going to cope with a lot of data. Your user will probably need some kind of user interface aid to be able to work with such a huge data. If you are going to use paging, tabs or other means to partition the data for user's consumption, why not load the data on demand. For example, you can load a single page of records if you are using a paging grid or you can load a single tab worth of records if you are going to use tabs. Similary, if you are going to allow filtering on the records, you can set a default filter after the load to keep the data to a minumum.
Summarizing the data:
You can also summarize the data on the server, if you are not going to show each row to the user. For example you can initially show summary for each group of records and let the user drill-down in a specific group
Related
A client and a server application needs to be implemented in Java. The scenario requires to read large number of small objects from database on the server side and send them to client.
This is not about transferring large files rather it requires streaming large number of small objects to client.
The number of objects needs to be sent from server to client in a single request could be one or one million (let's assume the number of clients is limited for the sake of discussion - ignore throttling).
The total size of the objects in most cases will be too big to hold them in memory. A way to defer read and send operation on the server side until client requests the object is needed.
Based on my previous experience, WCF framework of .NET supports the scenario above with
transferMode of StreamedResponse
ability to return IEnumerable of objects
with the help of yield defer serialization
Is there a Java framework that can stream objects as they requested while keeping the connection open with the client?
NOTE: This may sound like a very general question, but I am hoping to give specific details that would hopefully lead to a clear answer benefiting me and possible others.
A standard approach is to use a form of pagination and get the results in chunks which can be accommodated temporarily in memory. How to do that specific it depends, but a basic JDBC approach would be to first execute a statement to find out the number of records and then get them in chunks. For example, Oracle has a ROWNUM column that you use in order to manage the ranges of records to return. Other databases have some other options.
You could use ObjectOutputStream / ObjectInputStream to do this.
The key to making this work would be to periodically call reset() on the output stream. If you don't do that, the sending and receiving ends will build a massive map that contains references to all objects sent / received over the stream.
However, there may be issues with keeping a single request / response (or database cursor) open for a long time. And resuming a stream that failed could be problematic. So your solution should probably combine the above with some kind of pagination.
The other thing to note is that a scalable solution needs to avoid network latency from becoming the bottleneck. It may be worth implementing a receiver thread that eagerly pulls objects from the stream and buffers them in a (bounded) queue.
What is the fastest way to populate a Hazelcast Data Grid. Reading through documentation I can see couple of variants:
Use multithreading and IMap.set
Use multithreading and IMap.putAll
Use a Distributed Execution in order to start populating the grid from all participants.
My performance benchmark shows that IMap.putAll is faster than IMap.Set. But it is stated in the Hazelcasty Documentation that IMap.putAll does not come with guarantees that everything will be inserted atomically.
Can someone clarify a little bit about what would be the fastest way to populate a data grid with data ?
Is variant number 3 good ?
I would see the same three options. Anyhow as you mentioned, option two does not guarantee that everything was put into the map atomically but if you just load data and wait for all threads to finish loading data using IMap::putAll you should be fine.
Apart from that IMap::set would be the alternative. In any case you want to multithread the loading process. I would play around a bit with different thread numbers and loading data from a client is normally recommended to keep nodes free for storage operations.
I personally never benchmarked your third option, anyhow it would be possible as well. Just not sure it is worth the additional work.
How much data do you want to load that you're concerned it could be slow? Do you already know that loading is slow? Do you use Java Serialization (this is a huge performance killer)? Do you use indexes (those have to be generated while putting data)?
There's normally a lot of optimizations to apply to speed up, not only, data loading but also normal operation.
I'm in the process of setting up a system which will have to repeatedly parse large amounts of text (as a String or StringBuffer - which might be better?) acquired from the a data source. The text will be displayed and may consist of several thousand words and each time the text is parsed, each word may have to checked against a list of 550 stop words. This will allow the words to be filtered from display.
So I wonder about performance as this could be going on in multiple servlet sessions at any one time; is it better to check each word against a MySQL database table (MyISAM or InnoDB) using an index? Or simply to store the 550 words in a Java array or arraylist within servlet context so they possibly be read more quickly?
So I wonder about the trade off between database IO against storing 550 strings in memory.
Any advice?
Thanks
Mr Morgan.
Assuming that the "data source" is not your database, you can get better performance by doing the stopword search in memory rather than asking the database for do it. It stands to reason:
Any algorithm that the database uses can equally be used as your in-memory algorithm.
By running the algorithm locally, you avoid the cost of sending the text to the database and sending the results back.
It is also likely that you can implement a better algorithm for detecting the stop-words than a general purpose database engine could. And the memory needed for a data structure that represents the 500 or so stopwords should be trivial compared with the space used by the rest of your application, the servlet container and all of the libraries that you use.
550 String is a very small amount of data for today's servers : you don't need the database, it will be much slower.
I recommend using a standard Java Properties file, since you don't have that much data. This lets you use the standard Internationalization/Locale features.
This assumes, of course, that the copy changes fairly slowly. But that is usually the case.
We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
If you're working with huge amounts of data, you might want to consider using a database instead.
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.
Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?
Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.
#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms
Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.
I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.
You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.
#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?
This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).
I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.
I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.