Download files asynchronously through FTP in java - java

I need to download multiple files through FTP in java. For this I wrote a code using FTPClient which is taking files one by one to download.
I need to take files from a server and download to another network. After I wrote the code, I found that downloading is taking more time to download each file as the file sizes are huge (more than 10GB). I decided to multithread the process i.e. run multiple files at a time. Can anybody help writing me FTP in multithreaded environment.
Although I feel that multithreading won't help as bandwidth of the network would remain same and would be divided among multiple threads and leading to slow download again. Please suggest!!

You have different stuff to check first:
your download speed
remote server's upload speed
maximum server upload speed for each connection
If the server limits the transfer speed for a single file to a threshold lower than it's maximum transfer speed, you can have some advantages by using multi-threading (e.g. with a limit of 10 Kb/s per connection and a maximum upload of 100 Kb/s, you can theoretically have 10 downloads in parallel). If not, multi-threading will not help you.
Also if your download is saturated (all your bandwidth is filled with a single download or the server's upload bandwidth is greater than your download) you will not have any kind of help by multi-threading.
If your multi-threading will be useful, just instantiate a new connection for each file and throw it in a separated thread.

I feel that multithreading won't help as bandwidth of the network would remain same and would be divided among multiple threads and leading to slow download again.
That could well be true. Indeed, if you have too many threads trying to download files at the same time, you are likely to either overload the FTP server or cause network congestion. Both can result in a net decrease in the overall data rate.
The solution is to use a bounded thread pool for the download threads, and tune the pool size.
It is also a good idea to reuse connections where possible, since creating a connection and authenticating the user take time ... and CPU resources at both ends.

Related

What is the right way to create/write a large file in java that are generated by a user?

I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?
The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.
Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.
Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)

Is it safer/better to sequentially upload files in android or is it better to do it concurrently with multiple threads?

I am creating an android application that requires me to upload files to Box.com and I was wondering if it was safer to upload my files sequentially or concurrently? There are going to be a very large number of files so I'm a little bit worried about doing them concurrently?
Can you please tell me the advantages and disadvantages of both?
Thank you very much for your time and assistance in this matter.
Actually, based on our experiences, I'd have to say sequential is the way to go. Here's why:
Speed: For a typical consumer (not business) network connection, your upload speed is much lower than download speed. Whatever you upload sequentially will probably be using maximum bandwidth, whereas concurrent uploads (2x/3x) might use 1/2 or 1/3 the available bandwidth and each take 2x/3x the time to upload. So, concurrency wouldn't necessarily give you a speed advantage, especially on older devices...
Overhead: If you are handling encryption or compression, etc, yourself, the CPU overhead to do this for parallel uploads will be higher, meaning lower battery life. In any case, I would recommend a library like Retrofit to interact with the API.
Safety: If your network connection is interrupted, concurrent uploads leave you with multiple failed uploads or potentially corrupted files online, while a sequential approach minimizes the risk to one file. Resuming downloads from there should be more manageable than with multiple failed uploads.

Pull files concurrently using single SFTP connection in Java - Improving SFTP performance

I need to pull the files concurrently from remote server using single SFTP connection in Java code.
I've already got few links to pull the files one by one on single connection.
Like:
To use sftpChannel.ls("Path to dir"); which will returns list of files in the given path as a vector and you have to iterate on the vector to download each file sftpChannel.get();
But I want to pull multiple files concurrently for eg. 2 files at a time on single connection.
Thank You!
The ChannelSftp.get method returns an InputStream.
So you can call the get multiple times, acquiring a stream for each download. And then keep polling the streams until all reach the end-of-file.
Though I do not see, what advantage this gives you over a sequential download.
If you want to improve performance, you first need to know, what is the bottleneck.
The typical bottlenecks are:
Network speed: If you are saturating the network speed already, you cannot improve anything.
Network latency: If the latency is the bottleneck, increasing size of an SFTP request queue may help. Use the ChannelSftp.setBulkRequests method (the default is 16, so use a higher number)
CPU: If the CPU is the bottleneck, you either have to improve efficiency of the encryption implementation, or spread the load across CPU cores. Spreading the encryption load of a single session/connection is tricky and would have to be supported on low-level SSH implementation. I do not think JSch or any other implementation supports that.
Disk: If a disk drive (local or remote) is the bottleneck (unlikely), the parallel transfers as shown above may help, even when using a single connection, if the parallel transfers use a different disk drive each.
For more in-depth information, see my answers to:
Why is FileZilla SFTP file transfer max capped at 1.3MiB/sec instead of saturating available bandwidth? rsync and WinSCP are even slower
Why is FileZilla so much faster than PSFTP?

Processing large number of text files in java

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

How can I get crawler4j to download all links from a page more quickly?

What I do is:
- crawl the page
- fetch all links of the page, puts them in a list
- start a new crawler, which visits each links of the list
- download them
There must be a quicker way, where I can download the links directly when I visit the page? Thx!
crawler4j automatically does this process for you. You first add one or more seed pages. These are the pages that are first fetched and processed. crawler4j then extracts all the links in these pages and passes them to your shouldVisit function. If you really want to crawl all of them this function should just return true on all functions. If you only want to crawl pages within a specific domain you can check the URL and return true or false based on that.
Those URLs that your shouldVisit returns true, are then fetched by crawler threads and the same process is performed on them.
The example code here is a good sample for starting.
The general approach would be to separate the crawling, and the downloading tasks into separate worker Threads, with a maximum number of Threads, depending on your memory requirements (i.e. maximum RAM you want to use for storing all this info).
However, crawler4j already gives you this functionality. By splitting downloading and crawling into separate Threads, you try to maximize the utilization of your connection, pulling down as much data as both your connection can handle, and as the servers providing the information can send you. The natural limitation to this is that, even if you spawn 1,000 Threads, if the servers are only given you the content at 0.3k per second, that still only 300 KB per second that you'll be downloading. But you just don't have any control over that aspect of it, I'm afraid.
The other way to increase the speed is to run the crawler on a system with a fatter pipe to the internet, since your maximum download speed is, I'm guessing, the limiting factor to how fast you can get data currently. For example, if you were running the crawling on an AWS instance (or any of the cloud application platforms), you would benefit from their extremely high speed connections to backbones, and shorten the amount of time it takes to crawl a collection of websites by effectively expanding your bandwidth far beyond what you're going to get at a home or office connection (unless you work at an ISP, that is).
It's theoretically possible that, in a situation where your pipe is extremely large, the limitation starts to become the maximum write speed of your disk, for any data that you're saving to local (or network) disk storage.

Categories