Inconsistent output from multithreaded FTP InputStreams - java

I'm trying to create a java program that downloads certain asset files from an FTP server to a local file. Because my (free) FTP server doesn't support file sizes over a few megabytes, I decided to split up the files when they are uploaded and recombine them when the program downloads them. This works, but it is rather slow, because for each file, it has to get the InputStream, which takes some time.
The FTP server I use has a way to download the files without actually logging into the server, so I'm using this code to get the InputStream:
private static final InputStream getInputStream(String file) throws IOException {
return new URL("http://site.website.com/path/" + file).openStream();
}
To get the InputStream of a part of the asset file I'm using this code:
public static InputStream getAssetInputStream(String asset, int num) throws IOException, FTPException {
try {
return getInputStream("assets/" + asset + "_" + num + ".raf");
} catch (Exception e) {
// error handling
}
}
Because the getAssetInputStreams(String, int) method takes some time to run (especially if the file size is more then a megabyte), I decided to make the code that actually downloads the file multi-threaded. Here is where my problem lies.
final Map<Integer, Boolean> done = new HashMap<Integer, Boolean>();
final Map<Integer, byte[]> parts = new HashMap<Integer, byte[]>();
for (int i = 0; i < numParts; i++) {
final int part = i;
done.put(part, false);
new Thread(new Runnable() {
#Override
public void run() {
try {
InputStream is = FTP.getAssetInputStream(asset, part);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[DOWNLOAD_BUFFER_SIZE];
int len = 0;
while ((len = is.read(buf)) > 0) {
baos.write(buf, 0, len);
curDownload.addAndGet(len);
totAssets.addAndGet(len);
}
parts.put(part, baos.toByteArray());
done.put(part, true);
} catch (IOException e) {
// error handling
} catch (FTPException e) {
// error handling
}
}
}, "Download-" + asset + "-" + i).start();
}
while (done.values().contains(false)) {
try {
Thread.sleep(100);
} catch(InterruptedException e) {
e.printStackTrace();
}
}
File assetFile = new File(dir, "assets/" + asset + ".raf");
assetFile.createNewFile();
FileOutputStream fos = new FileOutputStream(assetFile);
for (int i = 0; i < numParts; i++) {
fos.write(parts.get(i));
}
fos.close();
This code works, but not always. When I run it on my desktop computer, it works almost always. Not 100% of the time, but often it works just fine. On my laptop, which has a far worse internet connection, it almost never works. The result is a file that is incomplete. Sometimes, it downloads 50% of the file. Sometimes, it downloads 90% of the file, it differs every time.
Now, if I replace the .start() by .run(), the code works just fine, 100% of the time, even on my laptop. It is, however, incredibly slow, so I'd rather not use .run().
Is there a way I could change my code so it does work multi-threaded? Any help will be appreciated.

Firstly, get your FTP server replaced, there are plenty of free FTP servers that support arbitrary file size serving with additional features, but I digress...
Your code seems to have many unrelated problems that could potentially all cause the behavior you are seeing, addressed below:
You have race conditions from accessing the done and parts maps from unprotected/unsynchronized access from multiple threads. This could cause data corruption and loss of synchronization for these variables between threads, potentially causing done.values().contains(false) to return true even when it's really not.
You are calling done.values().contains() repeatedly at a high frequency. Whilst the javadoc doesn't explicitly state, a hash map likely traverses every value in a O(n) fashion to check if a given map contains a value. Coupled with the fact that other threads are modifying the map, you'll get undefined behavior. According to values() javadoc:
If the map is modified while an iteration over the collection is in progress (except through the iterator's own remove operation), the results of the iteration are undefined.
You are somehow calling new URL("http://site.website.com/path/" + file).openStream(); but stating you are using FTP. The http:// in the link defines the protocol openStream() tries to open in and http:// is not ftp://. Not sure if this is a typo or did you mean HTTP (or do you have an HTTP server serving identical files).
Any thread raising any type of Exception will cause the code to fail given that not all parts will have "completed" (based on your busy-wait loop design). Granted, you may be redacted some other logic to guard against this, but otherwise this is a potential problem with the code.
You aren't closing any streams that you've opened. This could mean that the underlying socket itself is also left open. Not only does this constitute resource leakage, if the server itself has some sort of maximum number of simultaneous connection limit, you are only causing new connections to fail because the old, completed transfers are not closed.
Based on the issues above, I propose moving the download logic into a Callable task and running them through an ExecutorService as follows:
LinkedList<Callable<byte[]>> tasksToExecute = new LinkedList<>();
// Populate tasks to run
for(int i = 0; i < numParts; i++){
final int part = i;
// Lambda to
tasksToExecute.add(() -> {
InputStream is = null;
try{
is = FTP.getAssetInputStream(asset, part);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[DOWNLOAD_BUFFER_SIZE];
int len = 0;
while((len = is.read(buf)) > 0){
baos.write(buf, 0, len);
curDownload.addAndGet(len);
totAssets.addAndGet(len);
}
return baos.toByteArray();
}catch(IOException e){
// handle exception
}catch(FTPException e){
// handle exception
}finally{
if(is != null){
try{
is.close();
}catch(IOException ignored){}
}
}
return null;
});
}
// Retrieve an ExecutorService instance, note the use of work stealing pool is Java 8 only
// This can be substituted for newFixedThreadPool(nThreads) for Java < 8 as well for tight control over number of simultaneous links
ExecutorService executor = Executors.newWorkStealingPool(4);
// Tells the executor to execute all the tasks and give us the results
List<Future<byte[]>> resultFutures = executor.invokeAll(tasksToExecute);
// Populates the file
File assetFile = new File(dir, "assets/" + asset + ".raf");
assetFile.createNewFile();
try(FileOutputStream fos = new FileOutputStream(assetFile)){
// Iterate through the futures, writing them to file in order
for(Future<byte[]> result : resultFutures){
byte[] partData = result.get();
if(partData == null){
// exception occured during downloading this part, handle appropriately
}else{
fos.write(partData);
}
}
}catch(IOException ex(){
// handle exception
}
Using the executor service, you further optimize your multi-threading scenario since the output file will start writing as soon as pieces (in order) are available and that threads themselves are reused to save on thread creation costs.
As mentioned, there could be the case where too many simultaneous links causes the server to reject connections (or even more dangerously, write an EOF to make you think the part was downloaded). In this case, the number of worker threads can be tweaked by newFixedThreadPool(nThreads) to ensure at any given time, only nThreads amount of downloads can happen concurrently.

Related

Two Threads Executing Same Method

I am developing an API request and I'm using multi threading.In the output I'm getting the same request twice generated by two threads.As I debugged two thread are calling the same method again.So need help so that this issue is resolved
This is my pseudo code
public void run() {
logger.debug("Thread " + currentThread().getName() + " Running");
String message = "";
Connection connection = null;
InputStream fileinput = null;
Properties properties = new Properties();
try {
File file = new File("/home/sridhar.anirudh/eclipse-workspace/API/Change.properties");
fileinput = new FileInputStream(file);
properties.load(fileinput);
soapEndpointUrl = properties.getProperty("endpoint_url");
soapAction = properties.getProperty("soap_action");
} catch (Exception e) {
e.printStackTrace();
}
try {
connection = Database.getInstance().getConnection();
} catch (SQLException e1) {
logger.error("Failed To Get Connection " + e1.getMessage());
return;
}
if (CATEGORY.equalsIgnoreCase("fraudrestriction")) {
String soapResponse = callSoapWebServiceFraudRestriction(soapEndpointUrl, soapAction);
String response_status = "";
if (soapResponse.contains("<tns:Description>SUCCESS</tns:Description>") &&
soapResponse.contains("<tns:Code>ERR_000</tns:Code>")) {
response_status = "SUCCESS";
If you kick off two copies of the thread, they will both run, creating the effect you see.
You can create multiple worker threads, but you need to allocate the work between those workers such that each performs a subset of the total workload.
Since you're (seemingly) parsing and processing a file, and making a network service request in response to that file's contents, it's not clear how you intend to divide up the work. That's the key; to use multiple threads to improve throughput, you the programmer must devise a means of partitioning the work between those threads.
As an analogy, if you have one (human) worker working on a job, simply hiring a second worker won't get the job completed any faster unless the work is divided between those workers. That division is your problem. There's nothing magical about threads that can do this for you.

Reading stdout of nodejs from Java (using apache commons exec). Thread safe or not?

I'm trying to write a torrent streaming client in Java using webtorrent-cli, which runs on NodeJS. When installed as a node module, webtorrent-cli gives a nice webtorrent.cmd script which can be used to work with it. When download for a torrent starts, the cli updates the standard output each second with useful details like download speed, % of torrent downloaded, seeds available etc.
To observe such a "dynamic" stdout in Java (with commons exec), I am using the following snippet:
private static Thread processCreator() {
return new Thread(() -> {
try {
// Read stdout in a thread safe manner (hopefully)
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
PumpStreamHandler handler = new PumpStreamHandler(baos);
String command = getCommand();
CommandLine cmd = CommandLine.parse(command);
Executor cmdExecutor = new DefaultExecutor();
cmdExecutor.setStreamHandler(handler);
// Schedule a service to print the content of baos each second
final ScheduledExecutorService service = Executors.newSingleThreadScheduledExecutor();
service.scheduleAtFixedRate(() -> {
try {
// Read and reset atomically
synchronized (baos) {
System.out.println(baos.toString("UTF-8"));
// Resetting so that buffer size doesn't grow arbitrarily
baos.reset();
}
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}, 0, 1, TimeUnit.SECONDS);
cmdExecutor.execute(cmd);
// Let the remaining bytes be processed
sleep(1000);
// Shutdown
service.shutdown();
} catch (IOException ioe) {
ioe.printStackTrace();
}
});
}
public static void main(String[] args) throws InterruptedException {
Thread process = processCreator();
process.start();
process.join();
}
I'm concerned about how the ByteArrayOutputStream is being written. The class itself is thread safe, but if the implementation writes to the buffer byte by byte, or in a way that "updated output" (from webtorrent-cli) is only partially written to the buffer by the time scheduled service captures the monitor and starts processing, then that's going to cause problems. In this case, because I'm just printing content of the buffer, it won't be that much of trouble I guess. But I've to process the output and extract out a couple of details in the fixed scheduled service. I can think of a different way to achieve proper co-ordination (e.g.: observe the completeness of an update by marking the event when buffer receives bytes that form the first line in webtorrent-cli's stdout...and mark the update as completed when buffer receives bytes that form the last line. Each update has identical first and last lines...or at least a few bytes in the beginning and end are identical). But that would be a bit more work than this. My question is, can I be certain that write to the buffer has happened in a single atomic call to ByteArrayOutputStream.write(byte[], ...)'. I hope I've explained my question well enough. If you need more details, let me know in the comments. BTW, when the code above is run, the output suggests that co-ordination is being properly managed. But maybe I'm just lucky that the race condition has been avoided so far?

Efficiently making multiple GET requests to the same url in Java

I need to make multiple GET requests to the same URL but with different queries. I will be doing this on a mobile device (Android) so I need to optimise as much as possible. I learnt from watching an Android web seminar by Google that it takes ~200ms to connect to a server and there's also various other delays involved with making data calls. I'm just wondering if theres a way I can optimise the process of making multiple requests to the same URL to avoid some of these delays?
I have been using the below method so far but I have been calling it 6 times, one for each GET request.
//Make a GET request to url with headers.
//The function returns the contents of the retrieved file
public String getRequest(String url, String query, Map<String, List<String>> headers) throws IOException{
String getUrl = url + "?" + query;
BufferedInputStream bis = null;
try {
connection = new URL(url + "?" + query).openConnection();
for(Map.Entry<String, List<String>> h : headers.entrySet()){
for(String s : h.getValue()){
connection.addRequestProperty(h.getKey(), s);
}
}
bis = new BufferedInputStream(connection.getInputStream());
StringBuilder builder = new StringBuilder();
int byteRead;
while ((byteRead = bis.read()) != -1)
builder.append((char) byteRead);
bis.close();
return builder.toString();
} catch (MalformedURLException e) {
throw e;
} catch (IOException e) {
throw e;
}
}
If for every request you expect another result and you cannot combine requests by adding more than one GET variables in the same request then you cannot avoid the 6 calls.
However you can use multiple Threads to simultaneously run your requests. You may use a Thread Pool approach using the native ExecutorService in Java. I would propose you to use an ExecutorCompletionService to run your requests. As the processing time is not CPU-bounded, but network-bounded, you may use more Threads than your current CPUs.
For instance, in some of my projects I use 10+, sometimes 50+ Threads (in a Thread Pool) to simultaneously retrieve URL data, even though I only have 4 CPU cores.

Best practice for reading / writing to a java server socket

How do you design a read and write loop which operates on a single socket (which supports parallel read and write operations)? Do I have to use multiple threads? Is my (java) solution any good? What about that sleep command? How do you use that within such a loop?
I'm trying to use 2 Threads:
Read
public void run() {
InputStream clientInput;
ByteArrayOutputStream byteBuffer;
BufferedInputStream bufferedInputStream;
byte[] data;
String dataString;
int lastByte;
try {
clientInput = clientSocket.getInputStream();
byteBuffer = new ByteArrayOutputStream();
bufferedInputStream = new BufferedInputStream(clientInput);
while(isRunning) {
while ((lastByte = bufferedInputStream.read()) > 0) {
byteBuffer.write(lastByte);
}
data = byteBuffer.toByteArray();
dataString = new String(data);
byteBuffer.reset();
}
} catch (IOException e) {
e.printStackTrace();
}
}
Write
public void run() {
OutputStream clientOutput;
byte[] data;
String dataString;
try {
clientOutput = clientSocket.getOutputStream();
while(isOpen) {
if(!commandQueue.isEmpty()) {
dataString = commandQueue.poll();
data = dataString.getBytes();
clientOutput.write(data);
}
Thread.sleep(1000);
}
clientOutput.close();
}
catch (IOException e) {
e.printStackTrace();
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
Read fails to deliver a proper result, since there is no -1 sent.
How do I solve this issue?
Is this sleep / write loop a good solution?
There are basically three ways to do network I/O:
Blocking. In this mode reads and writes will block until they can be fulfilled, so if you want to do both simultaneously you need separate threads for each.
Non-blocking. In this mode reads and writes will return zero (Java) or in some languages (C) a status indication (return == -1, errno=EAGAIN/EWOULDBLOCK) when they cannot be fulfilled, so you don't need separate threads, but you do need a third API that tells you when the operations can be fulfilled. This is the purpose of the select() API.
Asynchronous I/O, in which you schedule the transfer and are given back some kind of a handle via which you can interrogate the status of the transfer, or, in more advanced APIs, a callback.
You should certainly never use the while (in.available() > 0)/sleep() style you are using here. InputStream.available() has few correct uses and this isn't one of them, and the sleep is literally a waste of time. The data can arrive within the sleep time, and a normal read() would wake up immediately.
You should rather use a boolean variable instead of while(true) to properly close your thread when you will want to. Also yes, you should create multiple thread, one per client connected, as the thread will block itself until a new data is received (with DataInputStream().read() for example). And no, this is not really a design question, each library/Framework or languages have its own way to listen from a socket, for example to listen from a socket in Qt you should use what is called "signals and slots", not an infinite loop.

java: decomprss files into string too slow

Here is how I compressed the string into a file:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Here is how I decompressed it:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?
EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?
The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.
ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
The BufferedReader's only purpose is the readLine() method you don't use, so why not just read from the InputStreamReader? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.
edit: more data
If you know the size of the string ahead, you should add a length parameter to decompressRawText and use it to initialise the StringBuilder. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.
edit: clarification
2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).
You should wrap the FileInputStream with a BufferedInputStream before wrapping with a GZipInputStream, rather than using a BufferedReader.
The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the InputStreamReader is most likely to do this). And that would translate into many read(2) calls once it gets to the FileInputStream.
Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with strace.
Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single InputStream variable. Then, you only have one thing to close in your finally block (and can use Jakarta Commons IOUtils to avoid lots of nested try-catch-finally blocks).
InputStream in = null;
try
{
in = new FileInputStream("foo");
in = new BufferedInputStream(in);
in = new GZIPInputStream(in);
// do something with the stream
}
finally
{
IOUtils.closeQuietly(in);
}
Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.
Similarly when writing.

Categories