HDFS read using multithreading

HDFS read using multithreading - java

I am reading files from HDFS directory using multi-threading using a Producer-Consumer model, leveraging BlockingQueue.
Here is my code;
producer class:
public void readURLS() {
final int capacity = Integer.MAX_VALUE;
BlockingQueue<String> queue = new LinkedBlockingQueue<>(capacity);
try {
FileSystem hdfs = FileSystem.get(hadoopConf);
FileStatus[] status = hdfs.listStatus(new Path("MYHDFS_PATH"));
int i = 0;
for (FileStatus file : status) {
LOG.info("Thread {} started: ", i++);
LOG.info("Reading file {} ", file.getPath().getName());
new Thread(new FetchData(queue, file.getPath(), hadoopConf)).start();
}
} catch (IOException e) {
LOG.error("IOException occured while listing files from HDFS directory");
}
}
FetchData:
#Override
public void run() {
LOG.info("Inside reader to start reading the files ");
try (BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader
(FileSystem.get(hadoopConf).open(file), StandardCharsets.UTF_8))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
if (Thread.interrupted()) {
throw new InterruptedException();
}
LOG.info("Line is :{}", line);
queue.put(line);
}
} catch (IOException e) {
LOG.error("file : {} ", file.toString());
throw new IOException(e);
} catch (InterruptedException e) {
LOG.error("An error has occurred: ", e);
Thread.currentThread().interrupt();
}
While executing the code it throws me InterruptedIOException:
java.io.IOException: Failed on local exception: java.io.**InterruptedIOException**: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected
Any idea why. My idea is to loop over each file and read each file using a separate thread.

I'm also getting same behavior when using HDFS from multiple (many!) threads, and do not know the answer to the question "why?", but keeping the number of threads accessing HDFS concurrently seems to help.
In your case I would recommend to use an ExecutorService with limited number of threads, and fine-tune that number to the limit when you do not get exceptions.
So, create the ExecutorService (with 10 threads as a starting point):
final ExecutorService executorService = Executors.newFixedThreadPool(10);
and instead of your
new Thread(new FetchData(queue, file.getPath(), hadoopConf)).start();
do
executorService.submit(new FetchData(queue, file.getPath(), hadoopConf));
Another improvement is since org.apache.hadoop.fs.FileSystem implements Closeable, you should close it. In your code every thread creates a new instance of FileSystem, but does not close it. So I would extract it into a variable inside your try:
try (FileSystem fileSystem = FileSystem.get(hadoopConf);
BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader
(fileSystem.open(file), StandardCharsets.UTF_8))) {
UPDATE:
Although the above code seems to be the right approach for the Closeable objects, by default FileSystem.get will return cached instances from the
/** FileSystem cache */
static final Cache CACHE = new Cache();
and thus things will break horribly when close() will be called on them.
You could either disable the FileSystem cache by setting fs.hdfs.impl.disable.cache config param to true, or make sure the FileSystem instance(s) only closed when all workers have finished. It also seems that you could just use a single instance of FileSystem for all your workers, although I can't find any confirmation in javadocs that this will work properly without extra synchronisation.

Related

Cancel eclipse plugin job

I run in a eclipse plugin Job a process that performs long operations without much output.
I want to be able to cancel the Job if users request it, but with the implementation bellow, the Job is not stopping unless the process prints something to the output stream.
Process process = new ProcessBuilder(args).start();
Scanner scanner = new Scanner(
(new InputStreamReader(
process.getInputStream(), UTF_8_CHARSET)));
while (scanner.hasNext())
{
if (monitor.isCanceled() {
// user canceled the job, destroy the process to return
process.destroy();
break;
}
CONSOLE.getStream().println(scanner.nextLine());
}
try {
process.waitFor();
} catch (InterruptedException e) {
Activator.log(ERROR, e.getMessage(), e);
} finally {
process.destroyForcibly();
}
Do I have other options to handle the cancelling of the job and to stop the process faster instead of waiting for a new line feed?

You should put the code reading the process output stream in to a separate thread and in your main loop just wait for the process to end with a short timeout so you can check for canceled.
So the main code would be something like:
Process process = new ProcessBuilder(args).start();
// Read standard output
new StreamConsumer(process.getInputStream()).start();
// You should also always read stderr
new StreamConsumer(process.getErrorStream()).start();
// Wait for process to end and check for cancel
while (!process.waitFor(100, TimeUnit.MILLISECONDS) {
if (monitor.isCanceled() {
// user canceled the job, destroy the process to return
process.destroy();
break;
}
}
And StreamConsumer is something like:
public class StreamConsumer extends Thread
{
private final InputStream _input;
public StreamConsumer(final InputStream inputStream)
{
super();
_input = inputStream;
setDaemon(true);
}
#Override
public void run()
{
// TODO your code to read the stream
}
}

Instead of a Scanner use a BufferedReader which provides the ready method which is a non-blocking way of telling you if there is something to read instead of using Scanner's nextLine() which is blocking until something is actually read.
Small example implementation:
volatile static boolean stop = false;
public static void main(String[] args) throws InterruptedException
{
// Create a new thread that reads input from System.in
new Thread(() -> {
BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
while (!stop)
{
try
{
// If there is something to read read it and print it
if (read.ready())
{
System.out.println(read.readLine());
}
}
catch (IOException e)
{
// Do some handling...
}
}
}).start();
// Wait 5 seconds then stop the thread by making the flag false.
Thread.sleep(5000);
stop = true;
}
Your flag is obviously provided with the Process. The take away is to use a non-blocking operation (a peek if there is actually something to print) instead of using a blocking operation.

In my case i didn't know how long the process is going. So a timeout was no option.
My alternative solution is to start a separate monitor thread that is watching the state of the IProgressMonitor. If the progress monitor is in the canceled state then the executing thread is interupted.
public static void startMonitorThread(IProgressMonitor monitor) {
final Thread runThread = Thread.currentThread();
new Thread("Monitor") {
#Override
public void run() {
while (runThread.isAlive() && !monitor.isCanceled()) {
try {
Thread.sleep(100);
} catch (InterruptedException exception) {
break;
}
}
if (runThread.isAlive()) {
runThread.interrupt();
}
};
}.start();
}
The next snippet shows a example how this can be used:
startMonitorThread(monitor);
final Process process = new ProcessBuilder("...").start();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) {
while (process.isAlive()) {
if (reader.ready()) {
final String line = reader.readLine();
// ... do something with line
} else {
Thread.sleep(10);
}
}
}
First i did the same mistake to use a scanner that is blocking (and the interuption didn't worked). Like the answer of Ben is mentioning it is better to use a BufferedReader and the non blocking ready() method to check if there is more to read.

Thread race condition just hangs while using PipedOutputStream

I am using piped output streams to convert OutputStream to InputStream because the AWS java sdk does not allow puting objects on S3 using OutputStreams
I'm using the code below, however, this will intermittently just hang. This code is in a web application. Currently there is no load on the application...I am just trying it out on my personal computer.
ByteArrayOutputStream os = new ByteArrayOutputStream();
PipedInputStream inpipe = new PipedInputStream();
final PipedOutputStream out = new PipedOutputStream(inpipe);
try {
String xmpXml = "<dc:description>somedesc</dc:description>"
JpegXmpRewriter rewriter = new JpegXmpRewriter();
rewriter.updateXmpXml(isNew1,os, xmpXml);
new Thread(new Runnable() {
public void run () {
try {
// write the original OutputStream to the PipedOutputStream
println "starting writeto"
os.writeTo(out);
out.close();
println "ending writeto"
} catch (IOException e) {
System.out.println("Some exception)
}
}
}).start();
ObjectMetadata metadata1 = new ObjectMetadata();
metadata1.setContentLength(os.size());
client.putObject(new PutObjectRequest("test-bucket", "167_sample.jpg", inpipe, metadata1));
}
catch (Exception e) {
System.out.println("Some exception")
}
finally {
isNew1.close()
os.close()
}

Instead of bothering with the complexities of starting another thread, instantiating two concurrent classes, and then passing data from thread to thread, all to solve nothing but a minor limitation in the provided JDK API, you should just create a simple specialization of the ByteArrayOutputStream:
class BetterByteArrayOutputStream extends ByteArrayOutputStream {
public ByteArrayInputStream toInputStream() {
return new ByteArrayInputStream(buf, 0, count);
}
}
This converts it to an input stream with no copying.

Exception propagation within PipedInputStream and PipedOutputStream

I have a data producer that runs in a separate thread and pushes generated data into PipedOutputStream which is connected to PipedInputStream. A reference of this input stream is exposed via public API so that any client can use it. The PipedInputStream contains a limited buffer which, if full, blocks the data producer. Basically, as the client reads data from the input stream, new data is generated by the data producer.
The problem is that the data producer may fail and throw an exception. But as the consumer is running in a separate thread, there is no nice way to get the exception to the client.
What I do is that I catch that exception and close the input stream. That will result in a IOException with message "Pipe closed" on the client side but I would really like to give the client the real reason behind that.
This is a rough code of my API:
public InputStream getData() {
final PipedInputStream inputStream = new PipedInputStream(config.getPipeBufferSize());
final PipedOutputStream outputStream = new PipedOutputStream(inputStream);
Thread thread = new Thread(() -> {
try {
// Start producing the data and push it into output stream.
// The production my fail and throw an Exception with the reason
} catch (Exception e) {
try {
// What to do here?
outputStream.close();
inputStream.close();
} catch (IOException e1) {
}
}
});
thread.start();
return inputStream;
}
I have two ideas how to fix that:
Store the exception in the parent object and expose it to the client via API. I. e. if the reading fails with an IOException, the client could ask the API for the reason.
Extend / re-implement the piped streams so that I could pass a reason to the close() method. Then the IOException thrown by the stream could contain that reason as a message.
Any better ideas?

Coincidentally I just wrote similar code to allow GZip compression of a stream. You don't need to extend PipedInputStream, just FilterInputStream will do and return a wrapped version, e.g.
final PipedInputStream in = new PipedInputStream();
final InputStreamWithFinalExceptionCheck inWithException = new InputStreamWithFinalExceptionCheck(in);
final PipedOutputStream out = new PipedOutputStream(in);
Thread thread = new Thread(() -> {
try {
// Start producing the data and push it into output stream.
// The production my fail and throw an Exception with the reason
} catch (final IOException e) {
inWithException.fail(e);
} finally {
inWithException.countDown();
}
});
thread.start();
return inWithException;
And then InputStreamWithFinalExceptionCheck is just
private static final class InputStreamWithFinalExceptionCheck extends FilterInputStream {
private final AtomicReference<IOException> exception = new AtomicReference<>(null);
private final CountDownLatch complete = new CountDownLatch(1);
public InputStreamWithFinalExceptionCheck(final InputStream stream) {
super(stream);
}
#Override
public void close() throws IOException {
try {
complete.await();
final IOException e = exception.get();
if (e != null) {
throw e;
}
} catch (final InterruptedException e) {
throw new IOException("Interrupted while waiting for synchronised closure");
} finally {
stream.close();
}
}
public void fail(final IOException e) {
exception.set(Preconditions.checkNotNull(e));
}
public void countDown() {complete.countDown();}
}

This is my implementation, taken from above accepted answer https://stackoverflow.com/a/33698661/5165540 , where I don't use the CountDownLatch complete.await() as it would cause a deadlock if the InputStream gets abruptly closed before the writer has finished writing the full content.
I still set the exception caught when PipedOutpuStream is being used, and I create the PipedOutputStream in the spawn thread, using a try-finally-resource pattern to ensure it gets closed, waiting in the Supplier until the 2 streams are piped.
Supplier<InputStream> streamSupplier = new Supplier<InputStream>() {
#Override
public InputStream get() {
final AtomicReference<IOException> osException = new AtomicReference<>();
final CountDownLatch piped = new CountDownLatch(1);
final PipedInputStream is = new PipedInputStream();
FilterInputStream fis = new FilterInputStream(is) {
#Override
public void close() throws IOException {
try {
IOException e = osException.get();
if (e != null) {
//Exception thrown by the write will bubble up to InputStream reader
throw new IOException("IOException in writer", e);
}
} finally {
super.close();
}
};
};
Thread t = new Thread(() -> {
try (PipedOutputStream os = new PipedOutputStream(is)) {
piped.countDown();
writeIozToStream(os, projectFile, dataFolder);
} catch (final IOException e) {
osException.set(e);
}
});
t.start();
try {
piped.await();
} catch (InterruptedException e) {
t.cancel();
Thread.currentThread().interrupt();
}
return fis;
}
};
Calling code is something like
try (InputStream is = streamSupplier.getInputStream()) {
//Read stream in full
}
So when is InputStream is closed this will be signaled in the PipedOutputStream causing eventually a "Pipe closed" IOException, ignored at that point.
If I keep instead the complete.await() line in the FilterInputStreamclose() I could suffer from deadlock (PipedInputStream trying to close, waiting on complete.await(), while PipedOutputStream is waiting forever on PipedInputStreamawaitSpace )

Handling multiple threads in a Client-Server architecture for File synchronization

I had to implement a Client-Server application in Java that automatically updates txt files in a Server directory depending on changes in the files in the Client side, for a homework assignment (had to, because I'm past the deadline).
I have a package that handles the changes in the files correctly, but I'm stumped about how to handle the changes in multiple files. My approach was using separate threads for each file in the client directory and using corresponding threads in the server directory for the same cause. This approach works for a single file, but not for multiples.
The code below is on the client side and calls a file's thread's checkfilestate method to handle the updates.
while(true){
for (Map.Entry<String, SynchronisedFile> entry : fileList.entrySet()) {
try {
System.err.println("SyncTest: calling fromFile.CheckFileState()");
sstt.start();
entry.getValue().CheckFileState();
} catch (IOException e) {
e.printStackTrace();
System.exit(-1);
} catch (InterruptedException e) {
e.printStackTrace();
System.exit(-1);
}
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
System.exit(-1);
}
}
And on the server side, if I start a single thread using:
Thread sstt = new Thread(new SyncThreadServer(sfileList.entrySet().iterator().next().getValue(),clientSocket));
sstt.start();
It works as expected. But if I start the serverside threads at the same time (which contains methods for decoding the Json messages from the input stream) using:
for (Map.Entry<String, SynchronisedFile> entry : sfileList.entrySet())
{
Thread sstt = new Thread(new SyncThreadServer(entry.getValue(),clientSocket));
sstt.setName(entry.getKey());
}
Threads of other files start reading JSON messages intended for other threads from the input stream. I'd like to be able to stop the serverside loop from starting the next thread, at least until the checkFile method is complete for one file/thread. But I still might run into problems after the initial stage, when all the treads are running at the same time. Any solutions on how to handle multiple threads in this case? (All threads use a single socket).
Edit: As I understand, this has to do with synchronization. Threads of other files on the server are accessing the Input stream before the first thread has even finished processing the inputs meant for it. This is the code of the server thread below. I need to somehow block the other threads from accessing the input stream before the first one has finished using it. Any help would be much appreciated. Thanks.
public class SyncThreadServer implements Runnable {
SynchronisedFile toFile; // this would be on the Server //Is an instance of the syncfile class, should be able to proc insts
Socket clientSocket;
public SyncThreadServer(SynchronisedFile tf, Socket aClientSocket){
toFile=tf;
clientSocket = aClientSocket;
}
#Override
public void run() {
Instruction inst = null;
InstructionFactory instFact=new InstructionFactory();
while(true){
{
try{
DataInputStream in = new DataInputStream(clientSocket.getInputStream());
DataOutputStream out = new DataOutputStream(clientSocket.getOutputStream());
String smsg = in.readUTF();
Instruction receivedInst = instFact.FromJSON(smsg);
System.err.println(smsg);
// The Server processes the instruction
toFile.ProcessInstruction(receivedInst);
//if(receivedInst.Type().equals("EndUpdate")){
// out.writeUTF("NEXT"); //TODO: Change to Json
// out.flush();}
//else
//{
out.writeUTF("GO"); //TODO: Change to Json
out.flush();
}
//}
catch (IOException e) {
e.printStackTrace();
System.exit(-1); // just die at the first sign of trouble
} catch (BlockUnavailableException e) {
// The server does not have the bytes referred to by the block hash.
try {
DataOutputStream out2 = new DataOutputStream(clientSocket.getOutputStream());
DataInputStream in2 = new DataInputStream(clientSocket.getInputStream());
out2.writeUTF("NGO"); //TODO: Change to Json
out2.flush();
String msg2 = in2.readUTF();
Instruction receivedInst2 = instFact.FromJSON(msg2);
toFile.ProcessInstruction(receivedInst2);
if(receivedInst2.Type().equals("EndUpdate")){
out2.writeUTF("NEXT"); //TODO: Change to Json
out2.flush();}
else
{
out2.writeUTF("GO"); //TODO: Change to Json
out2.flush();
}
} catch (IOException e1) {
e1.printStackTrace();
System.exit(-1);
} catch (BlockUnavailableException e1) {
assert(false); // a NewBlockInstruction can never throw this exception
}
}
// } //And here
}
}
}

Synchronized access to file

I want to implement my own logger that write logs in file. It could be run from many threads and an issue is how to synchronized access to log file.
private synchronized static void writeToFile(String tag, String msg,
Throwable tr, Context ctx) {
try {
String s = System.getProperty("line.separator");
File f = Environment.getExternalStorageDirectory();
Log.i(TAG, "Path to app " + f.getAbsolutePath());
File l = new File(f, "log.txt");
if (!l.exists()) {
l.createNewFile();
}
String e = Log.getStackTraceString(tr);
StringBuilder b = new StringBuilder();
b.append(HttpCommand.getDateForUrl(System.currentTimeMillis()));
b.append(tag);
b.append(msg);
b.append(e);
b.append(s);
OutputStream out = new FileOutputStream(l);
out.write(b.toString().getBytes());
out.flush();
out.close();
} catch (IOException e) {
Log.e(TAG, "Failed to create backup");
}
}
Is it enough to sync access to database with sync by class if I pass it in different threads?
synchronized(X.class) {
writeTiFile()
}

Since your writeToFile() function is synchronized, you don't have to wrap any calls to it with the synchronized {} block.
See here for more info:
http://docs.oracle.com/javase/tutorial/essential/concurrency/locksync.html

You can use ConcurrentLinkedQueue, put all messages to Queue and create thread that take from queue and write it to file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HDFS read using multithreading - java

Related

Cancel eclipse plugin job

Thread race condition just hangs while using PipedOutputStream

Exception propagation within PipedInputStream and PipedOutputStream

Handling multiple threads in a Client-Server architecture for File synchronization

Synchronized access to file

Categories

Resources