Is java multi thread can optimize multiple file writing - java

I have a file of 400+ GB like:
ID Data ...4000+columns
001 dsa
002 Data
… …
17201297 asdfghjkl
I wish to chunk down the file as per ID to get faster data retrieval as like:
mylocation/0/0/1/data.json
mylocation/0/0/2/data.json
.....
mylocation/1/7/2/0/1/2/9/7/data.json
my code is working fine but whatever writer I'm using with loop end closing it takes at least 159,206 milisoconds for 0.001% completion of file creation.
In that case can multithread be an option to reduce Time complexity (as like writing 100 or 1k files at a time)?
My Current code is:
int percent = 0;
File file = new File(fileLocation + fileName);
FileReader fileReader = new FileReader(file); // to read input file
BufferedReader bufReader = new BufferedReader(fileReader);
BufferedWriter fw = null;
LinkedHashMap<String, BufferedWriter> fileMap = new LinkedHashMap<>();
int dataCounter = 0;
while ((theline = bufReader.readLine()) != null) {
String generatedFilename = generatedFile + chrNo + "//" + directory + "gnomeV3.json";
Path generatedJsonFilePath = Paths.get(generatedFilename);
if (!Files.exists(generatedJsonFilePath)) {// create directory
Files.createDirectories(generatedJsonFilePath.getParent());
files.createFile(generatedJsonFilePath);
}
String jsonData = DBFileMaker(chrNo, theline, pos);
if (fileMap.containsKey(generatedFilename)) {
fw = fileMap.get(generatedFilename);
fw.write(jsonData);
} else {
fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(generatedFilename)));
fw.write(jsonData);
fileMap.put(generatedFilename, fw);
}
if (dataCounter == 172 * percent) {// As I know my number of rows
long millisec = stopwatch.elapsed(TimeUnit.MILLISECONDS);
System.out.println("Upto: " + pos + " as " + (Double) (0.001 * percent)
+ "% completion successful." + " took: " + millisec + " miliseconds");
percent++;
}
dataCounter++;
}
for (BufferedWriter generatedFiles : fileMap.values()) {
generatedFiles.close();
}

That really depends on your storage. Magnetic disks really like sequential writes, so multithreading would probably have a bad effect on their performance. However, SSDs may benefit from multithreaded writing.
What you should do is Either separate your code to 2 threads, where one thread creates the buffers of data to be written to disk and the second thread only writes the data. This way your disk would always keep busy and not wait for more data to be generated.
Or to have a single thread that generates the buffers to be written, but to use java nio in order to write the data asynchronously, while going on to generate the next buffer.

Related

How to copy large data files line by line?

I have a 35GB CSV file. I want to read each line, and write the line out to a new CSV if it matches a condition.
try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
br.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write(line + "\n");
});
}
}
This takes approx. 7 minutes. Is it possible to speed up that process even more?
If it is an option you could use GZipInputStream/GZipOutputStream to minimize disk I/O.
Files.newBufferedReader/Writer use a default buffer size, 8 KB I believe. You might try a larger buffer.
Converting to String, Unicode, slows down to (and uses twice the memory). The used UTF-8 is not as simple as StandardCharsets.ISO_8859_1.
Best would be if you can work with bytes for the most part and only for specific CSV fields convert them to String.
A memory mapped file might be the most appropriate. Parallelism might be used by file ranges, spitting up the file.
try (FileChannel sourceChannel = new RandomAccessFile("source.csv","r").getChannel(); ...
MappedByteBuffer buf = sourceChannel.map(...);
This will become a bit much code, getting lines right on (byte)'\n', but not overly complex.
you can try this:
try (BufferedWriter writer = new BufferedWriter(new FileWriter(targetFile), 1024 * 1024 * 64)) {
try (BufferedReader br = new BufferedReader(new FileReader(sourceFile), 1024 * 1024 * 64)) {
I think it will save you one or two minutes. the test can be done on my machine in about 4 minutes by specifying the buffer size.
could it be faster? try this:
final char[] cbuf = new char[1024 * 1024 * 128];
try (Writer writer = new FileWriter(targetFile)) {
try (Reader br = new FileReader(sourceFile)) {
int cnt = 0;
while ((cnt = br.read(cbuf)) > 0) {
// add your code to process/split the buffer into lines.
writer.write(cbuf, 0, cnt);
}
}
}
This should save you three or four minutes.
If that's still not enough. (The reason I guess you ask the question probably is you need to execute the task repeatedly). if you want to get it done in one minutes or even couple of seconds. then you should process the data and save it into db, then process the task by multiple servers.
Thanks to all your suggestions, the fastest I came up with was exchanging the writer with BufferedOutputStream, which gave approx 25% improvement:
try (BufferedReader reader = Files.newBufferedReader(Paths.get("sample.csv"))) {
try (BufferedOutputStream writer = new BufferedOutputStream(Files.newOutputStream(Paths.get("target.csv")), 1024 * 16)) {
reader.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write((line + "\n").getBytes());
});
}
}
Still the BufferedReader performs better than BufferedInputStream in my case.

How to read a file using multiple threads in Java when a high throughput(3GB/s) file system is available

I understand that for a normal Spindle Drive system, reading files using multiple threads is inefficient.
This is a different case, I have a high-throughput file systems available to me, which provides read speeds up to 3GB/s, with 196 CPU cores and 2TB RAM
A single threaded Java program reads the file with maximum 85-100 MB/s, so I have potential to get better than single thread. I have to read files as big as 1TB in size and I have enough RAM to load it.
Currently I use the following or something similar, but need to write something with multi-threading to get better throughput:
Java 7 Files: 50 MB/s
List<String> lines = Files.readAllLines(Paths.get(path), encoding);
Java commons-io: 48 MB/s
List<String> lines = FileUtils.readLines(new File("/path/to/file.txt"), "utf-8");
The same with guava: 45 MB/s
List<String> lines = Files.readLines(new File("/path/to/file.txt"), Charset.forName("utf-8"));
Java Scanner Class: Very Slow
Scanner s = new Scanner(new File("filepath"));
ArrayList<String> list = new ArrayList<String>();
while (s.hasNext()){
list.add(s.next());
}
s.close();
I want to be able to load the file and build the same ArrayList, in the correct sorted sequence, as fast as possible.
There is another question that reads similar, but it is actually different, because of :
The question is discussing about systems where multi-threaded file I/O is physically impossible to be efficient, but due to technological advancements, we now have systems that are designed to support high-throughput I/O , and so the limiting factor is CPU/SW , which can be overcome by multi-threading the I/O.
The other question does not answer how to write code to multi-thread I/O.
Here is the solution to read a single file with multiple threads.
Divide the file into N chunks, read each chunk in a thread, then merge them in order. Beware of lines that cross chunk boundaries. It is the basic idea as suggested by user
slaks
Bench-marking below implementation of multiple-threads for a single 20 GB file:
1 Thread : 50 seconds : 400 MB/s
2 Threads: 30 seconds : 666 MB/s
4 Threads: 20 seconds : 1GB/s
8 Threads: 60 seconds : 333 MB/s
Equivalent Java7 readAllLines() : 400 seconds : 50 MB/s
Note: This may only work on systems that are designed to support high-throughput I/O , and not on usual personal computers
package filereadtests;
import java.io.*;
import static java.lang.Math.toIntExact;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.Charset;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class FileRead implements Runnable
{
private FileChannel _channel;
private long _startLocation;
private int _size;
int _sequence_number;
public FileRead(long loc, int size, FileChannel chnl, int sequence)
{
_startLocation = loc;
_size = size;
_channel = chnl;
_sequence_number = sequence;
}
#Override
public void run()
{
try
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
//allocate memory
ByteBuffer buff = ByteBuffer.allocate(_size);
//Read file chunk to RAM
_channel.read(buff, _startLocation);
//chunk to String
String string_chunk = new String(buff.array(), Charset.forName("UTF-8"));
System.out.println("Done Reading the channel: " + _startLocation + ":" + _size);
} catch (Exception e)
{
e.printStackTrace();
}
}
//args[0] is path to read file
//args[1] is the size of thread pool; Need to try different values to fing sweet spot
public static void main(String[] args) throws Exception
{
FileInputStream fileInputStream = new FileInputStream(args[0]);
FileChannel channel = fileInputStream.getChannel();
long remaining_size = channel.size(); //get the total number of bytes in the file
long chunk_size = remaining_size / Integer.parseInt(args[1]); //file_size/threads
//Max allocation size allowed is ~2GB
if (chunk_size > (Integer.MAX_VALUE - 5))
{
chunk_size = (Integer.MAX_VALUE - 5);
}
//thread pool
ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[1]));
long start_loc = 0;//file pointer
int i = 0; //loop counter
while (remaining_size >= chunk_size)
{
//launches a new thread
executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i));
remaining_size = remaining_size - chunk_size;
start_loc = start_loc + chunk_size;
i++;
}
//load the last remaining piece
executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i));
//Tear Down
executor.shutdown();
//Wait for all threads to finish
while (!executor.isTerminated())
{
//wait for infinity time
}
System.out.println("Finished all threads");
fileInputStream.close();
}
}
You should first try the java 7 Files.readAllLines:
List<String> lines = Files.readAllLines(Paths.get(path), encoding);
Using a multi threaded approach is probably not a good option as it will force the filesystem to perform random reads (which is never a good thing on a file system)

The fastest ways to check if a file exists in java

Currently i am tasked with making a tool that can check whether a link is correct or not using java. The link is fed from Jericho HTML Parser, and my job is only to check whether the file is exist / the link is correct or not. That part is done, the hard part is to optimize it, since my code run (i have to say) rather sluggishly on 65ms per run
public static String checkRelativeURL(String originalFileLoc, String relativeLoc){
StringBuilder sb = new StringBuilder();
String absolute = Common.relativeToAbsolute(originalFileLoc, relativeLoc); //built in function to replace the link from relative link to absolute path
sb.append(absolute);
sb.append("\t");
try {
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
}catch (InvalidPathException | NullPointerException ex) {
sb.append(false);
}
sb.append("\t");
return sb.toString();
}
and on this line it took 65 ms
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
I have tried using
File file = new File(absolute);
sb.append(file.isFile());
It's still ran around 65~100ms.
So is there any other faster way to check whether a file exists or not other than this?
Since i am processing more than 70k html files and every milliseconds counts, thanks :(
EDIT:
I tried listing all the files into some List, and it doesn't really helps since it take more than 20mins just to list all the file....
The code that i use to list all the file
static public void listFiles2(String filepath){
Path path = Paths.get(filepath);
File file = null;
String pathString = new String();
try {
if(path.toFile().isDirectory()){
DirectoryStream<Path> stream = Files.newDirectoryStream(path);
for(Path entry : stream){
file = entry.toFile();
pathString = entry.toString();
if(file.isDirectory()){
listFiles2(pathString);
}
if (file.isFile()){
filesInProject.add(pathString);
System.out.println(pathString);
}
}
stream.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
If you know in advance the target OS set (usually it is the case), ultimately the fastest way will be to list so many files through a shell, by invoking a process e.g. using Runtime.exec.
On Windows you can do with
dir /s /b
On Linux
ls -R -1
You can check what is the OS and use appropriate command (error or resort to directory stream if not supported).
If you wish simplicity and don't need to report a progress, you can avoid dealing with the process IO and store the list to a temporary file e.g. ls -R -1 > /tmp/filelist.txt. Alternatively, you can read from the process output directly. Read with a buffered stream, a reader or alike, with large enough buffer.
On SSD it will complete in a blink of an eye and on modern HDD in seconds (half million files is not a problem with this approach).
Once you have the list, you can approach it differently depending on maximum files count and memory requirements. If requirements are loose, e.g. desktop program, you can do with very simple code e.g. pre-loading the complete file list to a HashSet and check existence when needed. Shortening path by removing common root will require much less memory. You can also reduce memory by keeping only filename hash instead of full name (common root removal will probably reduce more).
Or you can optimize it further if you wish, the question just reduces now to a problem of checking existense of a string in a list of strings stored in memory or file, which has many well known optimal solutions.
Bellow is very loose, simplistic sample for Windows. It executes dir on HDD (not SSD) drive root with ~400K files, reads the list and benchmarks (well, kind of) time and memory for string set and md5 set approaches:
public static void main(String args[]) throws Exception {
final Runtime rt = Runtime.getRuntime();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
long time = System.currentTimeMillis();
// windows command: cd to t:\ and run recursive dir
Process p = rt.exec("cmd /c \"t: & dir /s /b > filelist.txt\"");
if (p.waitFor() != 0)
throw new Exception("command has failed");
System.out.println("done executing shell, took "
+ (System.currentTimeMillis() - time) + "ms");
System.out.println();
File f = new File("T:/filelist.txt");
// load into hash set
time = System.currentTimeMillis();
Set<String> fileNames = new HashSet<String>(500000);
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8),
50 * 1024 * 1024)) {
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
fileNames.add(line);
}
}
System.out.println(fileNames.size() + " file names loaded took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
if (fileNames.contains(fileToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println();
System.out.println("hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
// Test memory/performance with MD5 hash set approach instead of full
// names
time = System.currentTimeMillis();
Set<String> nameHashes = new HashSet<String>(50000);
MessageDigest md5 = MessageDigest.getInstance("MD5");
for (String name : fileNames) {
String nameMd5 = new String(md5.digest(name
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
nameHashes.add(nameMd5);
}
System.out.println();
System.out.println(fileNames.size() + " md5 hashes created, took "
+ (System.currentTimeMillis() - time) + "ms");
fileNames.clear();
fileNames = null;
System.gc();
Thread.sleep(100);
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
String md5ToCheck = new String(md5.digest(fileToCheck.toString()
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
if (nameHashes.contains(md5ToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println("md5 hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
}
Output:
mem 3 Mb
done executing shell, took 5686ms
403108 file names loaded took 382ms
mem 117 Mb
hash set 70K checks took 283ms
mem 117 Mb
403108 md5 hashes created, took 486ms
mem 52 Mb
md5 hash set 70K checks took 366ms
mem 48 Mb

What is the best way to write and append a large file in java

I have a java program that sends a series of GET requests to a webservice and stores the response body as a text file.
I have implemented the following example code (filtered much of the code to highlight the concerned) which appends the text file and writes as a new line at the EOF. The code, however, works perfectly but the performances suffers as the size of the file grows bigger.
The total size of data is almost 4 GB and appends about 500 KB to 1 MB of data on avg.
do
{
//send the GET request & fetch data as string
String resultData = HTTP.GET <uri>;
// buffered writer to create a file
BufferedWriter writer = new BufferedWriter(new FileWriter(path, true));
//write or append the file
writer.write(resultData + "\n");
}
while(resultData.exists());
These files are created on daily basis and moved to hdfs for hadoop consumption and as a real-time archive. Is there a better way to achieve this?
1) You are opening a new writer every time, without closing the previous writer object.
2) Don't open the file for each write operation, instead open it before the loop, and close it after the loop.
BufferedWriter writer = new BufferedWriter(new FileWriter(path, true));
do{
String resultData = HTTP.GET <uri>;
writer.write(resultData + "\n");
}while(resultData.exists());
writer.close();
3) Default buffered size of BufferedWriter is 8192 characters, Since you have 4 GB of data, I would increase the buffer size, to improve the performance but at the same time make sure your JVM has enough memory to hold the data.
BufferedWriter writer = new BufferedWriter(new FileWriter(path, true), 8192 * 4);
do{
String resultData = HTTP.GET <uri>;
writer.write(resultData + "\n");
}while(resultData.exists());
writer.close();
4) Since you are making a GET web service call, the performance depends on the response time of webservice also.
According to this answer Java difference between FileWriter and BufferedWriter what you are doing right now is inefficient.
The code you provided is incomplete. Brackets are missing, no close statement for the writer. But if I understand correctly for every resultData you open a new buffered writer and call write once . This means that you should use the FileWriter directly, since the way you are doing it, the buffer is just an overhead.
If what you want it to get data in a loop and write them in a single file, then you should do something like this
try( BufferedWriter writer = new BufferedWriter(new FileWriter("PATH_HERE", true)) ) {
String resultData = "";
do {
//send the GET request & fetch data as string
resultData = HTTP.GET <uri>;
//write or append the file
writer.write(resultData + "\n");
} while(resultData != null && !resultData.isEmpty());
} catch(Exception e) {
e.printStackTrace();
}
The above uses try with resources, which will handle closing the writer after exiting the try block. This is available in java 7.

Existing file slowing down java program

I'm running a few methods together that take in many text files, read their contents, then write things about their contents to a new file. The problem I have is that when the file exist, the program is very slow. If I delete the file and run the program it then is very fast. I'm using BufferedReader and BufferedWriter for my I/O. I feel like there's a simple answer that I'm just not finding. Thanks in advance! I'd rather not post code if possible, Sorry!
EDIT:
here's very generally what's going on
File path= new File("some path");
try {
BufferedWriter writer = new BufferedWriter(new FileWriter(path, false));
//do some string manipulation
writer.append(string);
writer.newLine();
...
//once done
writer.close();
}catch(IOException e) {
//... handle this ...
}
The problem is that when this file exists, everything is slow. If it doesn't then it is fast.
I would revisit whatever it is you're doing when you say " //do some string manipulation".
Here is what I noticed with > 1000 iterations:
the time it takes to get the file handle and close the writer generally remain the same
the inner loop operation with the string "ABCDEFGHIJKLMNOPQRSTUVWXYZ" has a mean variance of 98ms
the same inner loop operation w/ a string quadruple that string's size causes much larger variety in terms of operation time. Sometimes the program finished in 2 seconds, sometimes it was 20 seconds.
I also did a version of this test where the file was always deleted first. It made no difference. Here's the code I ran:
public static void main(String[] args) {
long s = System.currentTimeMillis();
File path = new File("output.txt");
long stop = System.currentTimeMillis();
System.out.println("handle acquired " + (stop - s) );
try {
BufferedWriter writer = new BufferedWriter(new FileWriter(path, false));
//do some string manipulation
s = System.currentTimeMillis();
for (int i =0; i < 10000000; i++) {
String string = "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ";
writer.append(string);
writer.newLine();
}
stop = System.currentTimeMillis();
System.out.println("loop end " + (stop - s) );
s = System.currentTimeMillis();
writer.close();
stop = System.currentTimeMillis();
System.out.println("writer closed " + (stop - s) );
}catch(IOException e) {
//... handle this ...
}
}

Categories