When I run this test
public class Test extends Thread {
String str;
Test(String s) {
this.str = s;
}
#Override
public void run() {
try {
FileWriter fw = new FileWriter("1.txt", true);
for (char c : str.toCharArray()) {
System.out.print(c);
fw.write(c);
}
fw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new File("1.txt").delete();
new Test("11111111111111111111").start();
new Test("22222222222222222222").start();
}
}
it shows exactly how it writes characters to 1.txt
2222222222222222111211111211111121211111
but in 1.txt I see a different result
2222222222222222222211111111111111111111
why is that?
Intermediate buffers. Usually modern OS buffer file writes to write full sectors at once, to avoid too much hard-drive header seeks, allow use of DMA techniques, etc...
This may be an example of ASYNC I/O Write.
The kernel updates the corresponding process(different) pages in the page-cache and marks them dirty(needs to be updated in HDD). Then the control quickly returns to the corresponding process (here 2 different processes) which can continue to run and update in the console in the order called by the scheduler. The data is flushed to HDD later at a more optimal time(low cpu-load) in a more optimal way(sequentially bunched writes). and hence writes from process area sequentially.
Related
This is actually a design question / problem. And I am not sure if writing and reading the file is an ideal solution here. Nonetheless, I will outline what I am trying to do below:
I have the following static method that once the reqStreamingData method of obj is called, it starts retrieving data from client server constantly at a rate of 150 milliseconds.
public static void streamingDataOperations(ClientSocket cs) throws InterruptedException, IOException{
// call - retrieve streaming data constantly from client server,
// and write a line in the csv file at a rate of 150 milliseconds
// using bufferedWriter and printWriter (print method).
// Note that the flush method of bufferedWriter is never called,
// I would assume the data is in fact being written in buffered memory
// not the actual file.
cs.reqStreamingData(output_file); // <- this method comes from client's API.
// I would like to another thread (aka data processing thread) which repeats itself every 15 minutes.
// I am aware I can do that by creating a class that extends TimeTask and fix a schedule
// Now when this thread runs, there are things I want to do.
// 1. flush last 15 minutes of data to the output_file (Note no synchronized statement method or statements are used here, hence no object is being locked.)
// 2. process the data in R
// 3. wait for the output in R to come back
// 4. clear file contents, so that it always store data that only occurs in the last 15 minutes
}
Now, I am not well versed in multithreading. My concern is that
The request data thread and the data processing thread are reading and writing to the file simultaneously but at a different rate, I am
not sure if the data processing thread would delay the request data thread
by a significant amount, since the data processing have more computational heavy task to carry out than the request data thread. But given that they are 2 separate threads, would any error or exception occur here ?
I am not too supportive of the idea of writing and reading the same file at the same time but because I have to use R to process and store the data in R's dataframe in real time, I really cannot think of other ways to approach this. Are there any better alternatives ?
Is there a better design to tackle this problem ?
I understand that this is a lengthy problem. Please let me know if you need more information.
The lines (CSV, or any other text) can be written to a temporary file. When processing is ready to pick up, the only synchronization needed occurs when the temporary file is getting replaced by the new one. This guarantees that the producer never writes to the file that is being processed by the consumer at the same time.
Once that is done, producer continues to add lines to the newer file. The consumer flushes and closes the old file, and then moves it to the file as expected by your R-application.
To further clarify the approach, here is a sample implementation:
public static void main(String[] args) throws IOException {
// in this sample these dirs are supposed to exist
final String workingDirectory = "./data/tmp";
final String outputDirectory = "./data/csv";
final String outputFilename = "r.out";
final int addIntervalSeconds = 1;
final int drainIntervalSeconds = 5;
final FileBasedTextBatch batch = new FileBasedTextBatch(Paths.get(workingDirectory));
final ScheduledExecutorService executor = Executors.newScheduledThreadPool(1);
final ScheduledFuture<?> producer = executor.scheduleAtFixedRate(
() -> batch.add(
// adding formatted date/time to imitate another CSV line
LocalDateTime.now().format(DateTimeFormatter.ISO_DATE_TIME)
),
0, addIntervalSeconds, TimeUnit.SECONDS);
final ScheduledFuture<?> consumer = executor.scheduleAtFixedRate(
() -> batch.drainTo(Paths.get(outputDirectory, outputFilename)),
0, drainIntervalSeconds, TimeUnit.SECONDS);
try {
// awaiting some limited time for demonstration
producer.get(30, TimeUnit.SECONDS);
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
catch (ExecutionException e) {
System.err.println("Producer failed: " + e);
}
catch (TimeoutException e) {
System.out.println("Finishing producer/consumer...");
producer.cancel(true);
consumer.cancel(true);
}
executor.shutdown();
}
static class FileBasedTextBatch {
private final Object lock = new Object();
private final Path workingDir;
private Output output;
public FileBasedTextBatch(Path workingDir) throws IOException {
this.workingDir = workingDir;
output = new Output(this.workingDir);
}
/**
* Adds another line of text to the batch.
*/
public void add(String textLine) {
synchronized (lock) {
output.writer.println(textLine);
}
}
/**
* Moves currently collected batch to the file at the specified path.
* The file will be overwritten if exists.
*/
public void drainTo(Path targetPath) {
try {
final long startNanos = System.nanoTime();
final Output output = getAndSwapOutput();
final long elapsedMillis =
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
System.out.printf("Replaced the output in %d millis%n", elapsedMillis);
output.close();
Files.move(
output.file,
targetPath,
StandardCopyOption.ATOMIC_MOVE,
StandardCopyOption.REPLACE_EXISTING
);
}
catch (IOException e) {
System.err.println("Failed to drain: " + e);
throw new IllegalStateException(e);
}
}
/**
* Replaces the current output with the new one, returning the old one.
* The method is supposed to execute very quickly to avoid delaying the producer thread.
*/
private Output getAndSwapOutput() throws IOException {
synchronized (lock) {
final Output prev = this.output;
this.output = new Output(this.workingDir);
return prev;
}
}
}
static class Output {
final Path file;
final PrintWriter writer;
Output(Path workingDir) throws IOException {
// performs very well on local filesystems when working directory is empty;
// if too slow, maybe replaced with UUID based name generation
this.file = Files.createTempFile(workingDir, "csv", ".tmp");
this.writer = new PrintWriter(Files.newBufferedWriter(this.file));
}
void close() {
if (this.writer != null)
this.writer.flush();
this.writer.close();
}
}
A part of my application writes data to a .csv file in the following way:
public class ExampleWriter {
public static final int COUNT = 10_000;
public static final String FILE = "test.csv";
public static void main(String[] args) throws Exception {
try (OutputStream os = new FileOutputStream(FILE)){
os.write(239);
os.write(187);
os.write(191);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.UTF_8));
for (int i = 0; i < COUNT; i++) {
writer.write(Integer.toString(i));
writer.newLine();
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(checkLineCount(COUNT, new File(FILE)));
}
public static String checkLineCount(int expectedLineCount, File file) throws Exception {
BufferedReader expectedReader = new BufferedReader(new FileReader(file));
try {
int lineCount = 0;
while (expectedReader.readLine() != null) {
lineCount++;
}
if (expectedLineCount == lineCount) {
return "correct";
} else {
return "incorrect";
}
}
finally {
expectedReader.close();
}
}
}
The file will be opened in excel and all kind of languages are present in the data. The os.write parts are for prefixing the file with a byte order mark as to enable all kinds of characters.
Somehow the amount of lines in the file do not match the count in the loop and I can not figure out how. Any help on what I am doing wrong here would be greatly appreciated.
You simply need to flush and close your output stream (forcing fsync) before opening the file for input and counting. Try adding:
writer.flush();
writer.close();
inside your try-block. after the for-loop in the main method.
(As a side note).
Note that using a BOM is optional, and (in many cases) reduces the portability of your files (because not all consuming app's are able to handle it well). It does not guarantee that the file has the advertised character encoding. So i would recommend to remove the BOM. When using Excel, just select the file and and choose UTF-8 as encoding.
You are not flushing the stream,Refer oracle docs for more info
which says that
Flushes this output stream and forces any buffered output bytes to be
written out. The general contract of flush is that calling it is an
indication that, if any bytes previously written have been buffered by
the implementation of the output stream, such bytes should immediately
be written to their intended destination. If the intended destination
of this stream is an abstraction provided by the underlying operating
system, for example a file, then flushing the stream guarantees only
that bytes previously written to the stream are passed to the
operating system for writing; it does not guarantee that they are
actually written to a physical device such as a disk drive.
The flush method of OutputStream does nothing.
You need to flush as well as close the stream. There are 2 ways
manually call close() and flush().
use try with resource
As I can see from your code that you have already implemented try with resource and also BufferedReader class also implements Closeable, Flushable so use code as per below
public static void main(String[] args) throws Exception {
try (OutputStream os = new FileOutputStream(FILE); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.UTF_8))){
os.write(239);
os.write(187);
os.write(191);
for (int i = 0; i < COUNT; i++) {
writer.write(Integer.toString(i));
writer.newLine();
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(checkLineCount(COUNT, new File(FILE)));
}
When COUNT is 1, the code in main() will write a file with two lines, a line with data plus an empty line afterwards. Then you call checkLineCount(COUNT, file) expecting that it will return 1 but it returns 2 because the file has actually two lines.
Therefore if you want the counter to match you must not write a new line after the last line.
(As another side note).
Notice that writing CSV-files the way you are doing is really bad practice. CSV is not so easy as it may look at first sight! So, unless you really know what you are doing (so being aware of all CSV quirks), use a library!
This is not the same as this question: JUnit: How to simulate System.in testing?, which is about mocking stdin.
What I want to know is how to test (as in TDD) that a simple Java class with a main method waits for input.
My test:
#Test
public void appRunShouldWaitForInput(){
long startMillis = System.currentTimeMillis();
// NB obviously you'd want to run this next line in a separate thread with some sort of timeout mechanism...
// that's an implementation detail I've omitted for the sake of avoiding clutter!
App.main( null );
long endMillis = System.currentTimeMillis();
assertThat( endMillis - startMillis ).isGreaterThan( 1000L );
}
My SUT main:
public static void main(String args[]) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter something : ");
String input = br.readLine();
} catch (IOException e) {
e.printStackTrace();
}
... test fails. The code does not wait. But when you run the app at the command prompt it does indeed wait.
NB by the way I did also try with setting stdin to sthg else:
System.setIn(new ByteArrayInputStream( dummy.getBytes()));
scanner = new Scanner(System.in);
... this did not hold up the test.
As a much more general rule, static methods (such as main methods) are difficult to test. For this reason, you almost never call the main method (or any other static method) from your test code. A common pattern to work around this is to convert this:
public class App {
public static void main(String args[]) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter something : ");
String input = br.readLine();
} catch (IOException e) {
e.printStackTrace();
}
}
}
to this:
public class App {
private final InputStream input;
private final OutputStream output;
public App(InputStream input, OutputStream output) {
this.input = input;
this.output = output;
}
public static void main(String[] args) {
new App(System.in, System.out).start();
}
public void start() {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(input));
output.print("Enter something : ");
String nextInput = br.readLine();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Now your test becomes this:
#Test
public void appRunShouldWaitForInput(){
ByteArrayOutputStream output = new ByteArrayOutputStream();
// As you have already noted, you would need to kick this off on another thread and use a blocking implementation of InputStream to test what you want to test.
new App(new ByteArrayInputStream(), output).start();
assertThat(output.toByteArray().length, is(0));
}
The key idea is that, when you run the app "for real" i.e. via the main method, it will use the Standard input and output streams. However, when you run it from your tests, it uses a purely in memory input/output stream which you have full control over in your test. ByteArrayOutputStream is just one example, but you can see in my example that the test is able to inspect the actual bytes that have been written to the output stream.
in general: you can't test if your program is waiting until something happens (because you can't test if it waits forever)
what is usually done in such case:
in your case: don't test it. readLine is provided by external library that was tested pretty intensive. cost of testing this is much higher then the value of such test. refactor and test your business code (string/stream operations), not infrastructure (system input)
in general case it's testing concurrent programming. it's hard so ppl usually try to find some useful simplification:
you can just test if output of your program is correct for the input provided in tests.
for some really hard problems above technique is combined with running test thousands times (and forcing thread switching if possible) to detect errors in concurrent programming (violated invariants)
before your test provides an input date, you can test if your program is in correct state (waiting thread, correct field values etc)
in worst case, you can use delay in tests before providing input to be sure, your program waits and uses the input. this technique is often used because it's simple but it's smelly and if you add more tests like that your whole suit gets slower
I have the following code in my application which does two things:
Parse the file which has 'n' number of data.
For each data in the file, there will be two web service calls.
public static List<String> parseFile(String fileName) {
List<String> idList = new ArrayList<String>();
try {
BufferedReader cfgFile = new BufferedReader(new FileReader(new File(fileName)));
String line = null;
cfgFile.readLine();
while ((line = cfgFile.readLine()) != null) {
if (!line.trim().equals("")) {
String [] fields = line.split("\\|");
idList.add(fields[0]);
}
}
cfgFile.close();
} catch (IOException e) {
System.out.println(e+" Unexpected File IO Error.");
}
return idList;
}
When i try parse the file having 1 million lines of record, the java process fails after processing certain amount of data. I got java.lang.OutOfMemoryError: Java heap space error. I can partly figure out that the java process stops because of this huge data being provided. Kindly suggest me how to proceed with this huge data.
EDIT: Will this part of code new BufferedReader(new FileReader(new File(fileName))); parse the whole file and gets affected to the size of the file.
The problem you have is you are accumulating all the data on the list. The best way to approach this is to do it on a streaming fashion. This means do not accumulate all the ids on the list, but call your web service on each row or accumulate a smaller buffer and then do the call.
Opening the file and creating the BufferedReader will have no impact on memory consumption, as the bytes from the file will be read (more or less) line by line. The problem is at this point in the code idList.add(fields[0]);, the list will grow as large as the file as you keep accumulating all of the file data into it.
Your code should do something like this:
while ((line = cfgFile.readLine()) != null) {
if (!line.trim().equals("")) {
String [] fields = line.split("\\|");
callToRemoteWebService(fields[0]);
}
}
Increase your java heap memory size using the -Xms and -Xmx options. If not set explicitly, the jvm sets the heap size to the ergonomic defaults which in your case is not enough. Read this paper to find out more about tuning the memory in jvm: http://www.oracle.com/technetwork/java/javase/tech/memorymanagement-whitepaper-1-150020.pdf
EDIT: Alternative way on doing this in a producer-consumer way to exploit parallel processing. The general idea is to create a producer thread that reads the file and queues tasks for processing and n consumer threads that consume them. A very general idea (for illustrative purposes) is the following:
// blocking queue holding the tasks to be executed
final SynchronousQueue<Callable<String[]> queue = // ...
// reads the file and submit tasks for processing
final Runnable producer = new Runnable() {
public void run() {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(new File(fileName)));
String line = null;
while ((line = file.readLine()) != null) {
if (!line.trim().equals("")) {
String[] fields = line.split("\\|");
// this will block if there are not available consumer threads to process it...
queue.put(new Callable<Void>() {
public Void call() {
process(fields);
}
});
}
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt());
} finally {
// close the buffered reader here...
}
}
}
// Consumes the tasks submitted from the producer. Consumers can be pooled
// for parallel processing.
final Runnable consumer = new Runnable() {
public void run() {
try {
while (true) {
// this method blocks if there are no items left for processing in the queue...
Callable<Void> task = queue.take();
taks.call();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
Of course you have to write code that manages the lifecycle of the consumer and producer threads. The right way to do this would be by implementing it using an Executor.
When you want to work with big data, you have 2 choices:
use a big enough heap to fit all the data. this will "work" for a while, but if your data size is unbounded, it will eventually fail.
work with the data incrementally. only keep part of the data (of a bounded size) in memory at any one time. this is the ideal solution as it will scale to any amount of data.
I am looking to read the contents of a file in Java. I have about 8000 files to read the contents and have it in HashMap like (path,contents). I think using Threads would be a option for doing this to speed up the process.
From what I know having all 8000 files to read their contents in different threads is not possible(we may want to limit the threads),Any comments on it? Also I am new to threading in Java, can any one help on how to get started on this one?
so far I thought this pesudo code, :
public class ThreadingTest extends Thread {
public HashMap<String, String > contents = new HashMap<String, String>();
public ThreadingTest(ArrayList<String> paths)
{
for(String s : paths)
{
// paths is paths to files.
// Have threading here for each path going to get contents from a
// file
//Not sure how to limit and start threads here
readFile(s);
Thread t = new Thread();
t.start();
}
}
public String readFile(String path) throws IOException
{
FileReader reader = new FileReader(path);
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
return textOnly;
}
}
Any help in completing the threading process. Thanks
Short answer: Read the files sequentially. Disk I/O doesn't parallelize well.
Long Answer: Threading might improve the read performance if the disks are good at random access (SSD disks are) or if the files are placed on several different disks, but if they're not you're just likely to end up with a lot of cache misses and waiting for the disks to seek the right read position. (You may still end up there even if your disks are good at random access.)
If you want to measure instead of guess, use Executors.newFixedThreadPool to create an ExecutorService which can read your files in parallell. Experiment with different thread counts, but don't be surprised if one reader thread per physical disk gives you the best performance.
This is a typical task for thread pool. See the tutorial here: http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html
import java.io.*;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;
public class PooledFileProcessing {
private Map<String, String> contents = Collections.synchronizedMap(new HashMap<String, String>());
// Integer.MAX_VALUE items max
private LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
private ExecutorService executor = new ThreadPoolExecutor(
5, // five workers by default
20, // up to twenty workers
1, TimeUnit.MINUTES, // idle thread dies in one minute
workQueue
);
public void process(final String basePath) {
visit(new File(basePath));
System.out.println(workQueue.size() + " jobs still in queue");
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
System.out.println("interrupted while awaiting termination");
}
System.out.println(contents.size() + " files indexed");
}
public void visit(final File file) {
if (!file.exists()) {
return;
}
if (file.isFile()) { // skip the dirs
executor.submit(new RunnablePullFile(file));
}
// traverse children
if (file.isDirectory()) {
final File[] children = file.listFiles();
if (children != null && children.length > 0) {
for (File child : children) {
visit(child);
}
}
}
}
public static void main(String[] args) {
new PooledFileProcessing().process(args.length == 1 ? args[0] : System.getProperty("user.home"));
}
protected class RunnablePullFile implements Runnable {
private final File file;
public RunnablePullFile(File file) {
this.file = file;
}
public void run() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line;
while (
(line=reader.readLine()) != null &&
sb.length() < 8192 /* remove this check for a nice OOME or swap thrashing */
) {
sb.append(line);
}
contents.put(file.getPath(), sb.toString());
} catch (IOException e) {
System.err.println("failed on file: '" + file.getPath() + "': " + e.getMessage());
if (reader != null) {
try {
reader.close();
} catch (IOException e1) {
// ignore that one
}
}
}
}
}
}
From my experience, threading helps - use a thread pool and play with values around 1..2 threads per core.
Just take care with the hash map - consider putting data to the map via a synchronized method only. I remember I once had some ugly issues in similiar project and they were related to concurrent modifications of a central hash map.
just some quick tips.
First of all, to get you started on threads, you should just look at the Runnable interface, or the Thread class. To make a thread you either have to implement this interface with a class or extend this class with another class. You can also make anonymous threads too, but I dislike the readability of those unless its something SUPER simple.
Next, just some notes on processing text with multiple threads, because it just so happens I have some experience in exactly this! Keep in mind that if the files are large and take a noticeably long time to process a single file that you will want to monitor your CPU. In my experience I was doing lots of calculations and lookups when I was processing which added hugely to my load so in the end I found that I could only make as many threads as I had processors because each thread was so labor intensive. So keep that in mind, you want to monitor the effect each thread has on the processor.
I'm not sure having threads for this would really speed up the process if all the files are on the same physical disk. It could even slow things down because the disk would have to constantly switch from one location to the other.