I have a scenario, in which
A HUGE Input file with a specific format, delimited with \n has to be read, it has almost 20 Million records.
Each Record has to be read and processed by sending it to server in specific format.
=====================
I am thinking on how to design it.
- Read the File(nio)
- The thread that reads the file can keep those chunks into a JMS queue.
- Create n threads representing n servers (to which the data is to be sent). and then n Threads running in parallel can pick up one chunk at a time..execute that chunk by sending requests to the server.
Can you suggest if the above is fine, or you see any flaw(s) :). Also it would be great if you can suggest better way/ technologies to do this.
Thank you!
Updated : I wrote a program to read that file with 20m Records, using Apache Commons IO(file iterator) i read the file in chunks (10 lines at at time). and it read the file in 1.2 Seconds. How good is this? Should i think of going to nio? (When i did put a log to print the chunks, it took almost 26seconds! )
20 million records isn't actually that many so first off I would try just processing it normally, you may find performance is fine.
After that you will need to measure things.
You need to read from the disk sequentially for good speed there so that must be single threaded.
You don't want the disk read waiting for the networking or the networking waiting for the disk reads so dropping the data read into a queue is a good idea. You probably will want a chunk size larger than one line though for optimum performance. Measure the performance at different chunk sizes to see.
You may find that network sending is faster than disk reading already. If so then you are done, if not then at that point you can spin up more threads reading from the queue and test with them.
So your tuning factors are:
chunk size
number of threads.
Make sure you measure performance over a decent sized amount of data for various combinations to find the one that works best for your circumstances.
I believe you could batch the records instead of sending one at a time. You could avoid unnecessary network hops given the volume of data that need to be processed by the server.
Related
I am trying to read a huge file which contains a word(different length) per line.
I want to read it with multi-threading depends on the string length.
For example, thread one reads lines which has one length word, thread two reads two lengths and ...
Is there any way to achieve this? If it is, how will be affected the performance?
I found this examples, but I can't put together.
Reference 1 : Multithread file reading
Reference 2 : How to read files in multithreaded mode?
You can use multiple threads, however it won't be any faster. To find all the lines of a given length you have to read all the other lines.
Is there any way to achieve this?
Read all the lines and ignore the ones you filter out.
What you can do is to process different lines in different threads however it depends on how CPU intensive this is as to whether it helps or is slower.
Reading a file in multithreading mode can only make things slower, since disk drive has to move heads between multiple points of reading. Instead, transfer computational work from the reading thread to worker thread(s).
I am wondering is there a way to optimize reading from disk in java. I mean for example I want to print the contains of all text files in some directory, but after uppercase them. I can create another thread do uppercase them, but can I optimize reading by adding another(thread(s)) to read files too? I mean 2,3 or more threads to read difference files from disk. Is there some optimization for doing this or not? I hope that I explain the problem clearly.
I want to print the contains of all text files
This is most likely your bottleneck. If not, you should focus on what you bottleneck is as optimising anything else is likely to complicate your code for no benefit.
I can create another thread do uppercase them,
You can, though passing the work to another thread could be more expensive than making it uppercase depending on how your do this.
can I optimize reading by adding another(thread(s)) to read files too?
Possibly. How many disks do you have. If you have one disk, it can usually only do one thing at a time.
I mean 2,3 or more threads to read difference files from disk.
Most desktop drives can only do one operation at a time.
Is there some optimization for doing this or not?
Yes, but as I said, until you know what your bottleneck is, it's hard to jump to a solution.
I can create another thread do uppercase them
That's actually going in the right direction, but simply making all letters uppercase doesn't take enough time to really matter unless you're processing really large chunks of the file.
Because the standard single-threaded model of read-then-process means you're either reading data or processing it, when you could be doing both at the same time.
For example, you could be creating a series of highly compressed (say, JPEG2000 because it's so CPU intensive) images from a large video stream file. You could have one thread reading frames from the stream, placing them into a queue to process, and then have N threads each processing a frame into an image.
You'd tune the number of threads reading data and the number of threads processing data to keep both your disks and CPUs maximally busy without excess contention.
There are some cases where you can use multiple threads to read from a single file to get better performance. But you need a system designed from the ground up to do that. You need lots of disks (less so if they're SSDs), a pretty substantial IO infrastructure along with a system that has a lot of IO bandwidth, and then you need a file system that can handle multiple simultaneous access to a single file. Then the code you have to write to get better performance from reading using more than one thread has to match things like the physical layout of your files on disk.
That works best if you're doing lots of random reads from a file spread over multiple devices. Like a large, high-powered database server.
For example, lets say I have a huge data file spread over four or five disks (or even RAID arrays), with the file spread out over the disks in 64KB chunks. A handful of threads doing 64KB reads would be ideal to read or write such a file in a random-access mode. Let's say everything is really fast and you can read or write 1 GB/sec from such a file.
But if you turn around and just try to copy that data in a stream, you can still use multiple threads to get maximum performance - say 1 GB/sec - but if you just used a single thread to do read() calls in 1 MB chunks you'd probably get 950 MB/sec - or 95% or maximum multithreaded read performance.
I've actually benchmarked such systems and most of the time, multithreaded IO isn't worth the trouble unless you've invested a lot of money in your hardware and software (opensource file systems tend not to do this very well - you need to get into the realm of IBM's GPFS and Oracle's (nee LSC's then Sun's) QFS) and you know exactly what you're doing when you set it up.
I am creating a multithreaded env using ExecutorService. All my threads are doing the same thing.
They are getting the data from DB, preparing the PDF using itext and writing PDF at a location in D drive.
But I noticed a weird thing. As I am increasing the number of threads, my end-to-end process becomes slower.
For 1 thread - 4000 pdf generated in 1 hour
For 2 threads - 3500 pdf generated in 1 hour
For 3 threads - 3200 pdf generated in 1 hour
For 4 threads - 3000 pdf generated in 1 hour
Using logger, it became clear that, getting the data from DB is very fast, bottleneck is the PDF writing operation.
Somewhere I read that in windows, writing multiple files at same directory simultaneously becomes slower compared to sequential writing.
If true, what other logic I can implement to get a higher performance.
Thank you.
Environment details
OS - Windows 7 ,32 bit
RAM - 3 GB
Processor - Core i3
JDK - 1.6
DB - PostgreSql 9.3
Size of PDF - varies between 500KB and 2 MB
The HDD can only write to one portion of the disk at a time, so if you have several different threads (or even processes) writing at the same time, the disk has to move its heads all over the place, writing a bit to file A here, a bit to File B there, etc. This is why it's actually slower to split this task into threads, you're making the HDD work harder.
If you have any CPU-intensive tasks, they can frequently be multiplexed across a couple of threads to get benefits on any modern CPU, but as soon as you're dealing with a singleton resource like a specific HDD, you're generally better off sticking to a single thread for that aspect of what you're doing.
Writing to HDD is a IO blocking operation so you will gain nothing by doing to with multithreading. With HDD you will actually experience a slow down. If you switch to SSD, then it's possible that you will not experience a slow down doing a multithreading disk access (or the slow down will be less that with HDD at least), but there will be no improvement neither.
The situation might be different if you have a RAID, but it depends on the type of the RAID.
To increase performance in your scenario you should split the work you do on the threads in such a way:
1) Having one IO thread for reading/writing from the disk (or alternatively having one IO thread for reading and another IO thread for writing - for that would even better).
2) Having a separate thread for calculations. This thread should not do any IO operations on the disk.
The IO threads will simply read data from the disk and pass the data into a queue (let's call it an "input queue"). And then the "calculations thread" picks up the data from the "input queue", processes it, and puts results into another queue (let's call it an "results queue"). The IO thread can then pick the data up from the "results queue" and write that into disk.
I have big file more than 1 GB and I want to search for the occurrence of a certain word.
so I want to task over several threads where each thread will handle a portion of the file.
what is the best approach to do this, I thought about read the file into several buffers of fixed size and pass each thread a buffer.
is there a better way to do this
[EDIT] i want to execut each thread on different device
A ByteBuffer, say on a RandomAccessFile would be feasible for files < 2 GB (231).
The general solution would be to use FileChannel, with its MappedByteBuffer.
With several buffers one must take care to have overlapping buffers, so the word can be found on buffer boundaries.
Reading the thread into the buffers will probably take just as long as just doing the search (the extra processing required to search is tiny compared to the time needed to read the file off the disk - and in fact it may well be able to do that processing in the time it would otherwise just be waiting for data).
Searching multiple locations in the file at once will be very slow on most storage systems.
The real question comes as to whether you are only searching each file once or if you frequently search them. If only once then you have no real choice but to scan the file and take the time. If you are doing it frequently then you could consider indexing the contents somehow.
Consider using Hadoop MapReduce.
If you want to execute threads (= divided tasks) on different devices, the input file should be on a distributed file system such as HDFS (Hadoop Distributed File System). MapReduce is a mechanism to divide one job into multiple tasks and run them on different machines in parallel.
I need a disk-based key-value store that can sustain high write and read performance for large data sets. Tall order, I know.
I'm trying the C BerkeleyDB (5.1.25) library from java and I'm seeing serious performance problems.
I get solid 14K docs/s for a short while, but as soon as I reach a few hundred thousand documents the performance drops like a rock, then it recovers for a while, then drops again, etc. This happens more and more frequently, up to the point where most of the time I can't get more than 60 docs/s with a few isolated peaks of 12K docs/s after 10 million docs. My db type of choice is HASH but I also tried BTREE and it is the same.
I tried using a pool of 10 db's and hashing the docs among them to smooth out the performance drops; this increased the write throughput to 50K docs/s but didn't help with the performance drops: all 10 db's slowed to a crawl at the same time.
I presume that the files are being reorganized, and I tried to find a config parameter that affects when this reorganization takes place, so each of the pooled db's would reorganize at a different time, but I couldn't find anything that worked. I tried different cache sizes, reserving space using the setHashNumElements config option so it wouldn't spend time growing the file, but every tweak made it much worse.
I'm about to give berkeleydb up and try much more complex solutions like cassandra, but I want to make sure I'm not doing something wrong in berkeleydb before writing it off.
Anybody here with experience achieving sustained write performance with berkeleydb?
Edit 1:
I tried several things already:
Throttling the writes down to 500/s (less than the average I got after writing 30 million docs in 15 hors, which indicates the hardware is capable of writing 550 docs/s). Didn't work: once a certain number of docs has been written, performance drops regardless.
Write incoming items to a queue. This has two problems: A) It defeats the purpose of freeing up ram. B) The queue eventually blocks because the periods during which BerkeleyDB freezes get longer and more frequent.
In other words, even if I throttle the incoming data to stay below the hardware capability and use ram to hold items while BerkeleyDB takes some time to adapt to the growth, as this time gets increasingly longer, performance approaches 0.
This surprises me because I've seen claims that it can handle terabytes of data, yet my tests show otherwise. I still hope I'm doing something wrong...
Edit 2:
After giving it some more thought and with Peter's input, I now understand that as the file grows larger, a batch of writes will get spread farther apart and the likelihood of them falling into the same disk cylinder drops, until it eventually reaches the seeks/second limitation of the disk.
But BerkeleyDB's periodic file reorganizations are killing performance much earlier than that, and in a much worse way: it simply stops responding for longer and longer periods of time while it shuffles stuff around. Using faster disks or spreading the database files among different disks does not help. I need to find a way around those throughput holes.
What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.
Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.
I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.
Another option is to use SSD drives if the updates are bursty, (They can do 10K+ writes per second as they have no moving parts) with caching, this should give you more than you need but SSD have a limited number of writes.
BerkeleyDB does not perform file reorganizations, unless you're manually invoking the compaction utility. There are several causes of the slowdown:
Writes to keys in random access fashion, which causes much higher disk I/O load.
Writes are durable by default, which forces a lot of extra disk flushes.
Transactional environment is being used, in which case checkpoints cause a slowdown when flushing changes to disk.
When you say "documents", do you mean to say that you're using BDB for storing records larger than a few kbytes? BDB overflow pages have more overhead, and so you should consider using a larger page size.
This is an old question and the problem is probably gone, but I have recently had similar problems (speed of insert dropping dramatically after few hundred thousand records) and they were solved by giving more cache to the database (DB->set_cachesize). With 2GB of cache the insert speed was very good and more or less constant up to 10 million records (I didn't test further).
We have used BerkeleyDB (BDB) at work and have seem similar performance trends. BerkeleyDB uses a Btree to store its key/value pairs. When the number of entries keep increasing, the depth of the tree increases. BerkeleyDB caching works on loading trees into RAM so that a tree traversal does not incur file IO (reading from disk).
I need a disk-based key-value store that can sustain high write and read performance for large data sets.
Chronicle Map is a modern solution for this task. It's much faster than BerkeleyDB on both reads and writes, and is much more scalable in terms of concurrent access from multiple threads/processes.