I want to read from file from two different places concurrently. I also want to use buffered i/o stream for efficiency. I tried to work out sth on my own given java API, but it's not working. Anybody will help? I need it for external merge-sort. Thanks for help!
You need to create a RandomAccessFile, which is basically Java's equivalent of C's memory mapped file.
I found an example of this:
try {
File file = new File("filename");
// Create a read-only memory-mapped file
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, (int)roChannel.size());
// Create a read-write memory-mapped file
FileChannel rwChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE, 0, (int)rwChannel.size());
// Create a private (copy-on-write) memory-mapped file.
// Any write to this channel results in a private copy of the data.
FileChannel pvChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer pvBuf = roChannel.map(FileChannel.MapMode.READ_WRITE, 0, (int)rwChannel.size());
} catch (IOException e) {
}
Edit, you stated you can't use a RandomAccessFile, which is the only way to skip up and down through the file. If you're stuck without it, then you must read the file sequentially, but that doesn't mean that you can't open multiple pointers to the same file for reading.
I put together the following test/sample and it shows clearly that you can open the file "twice" with different read pointers and sequentially sum two halves of the file. Again, if you need random access, you must use a RandomAccessFile, and that's what I'd suggest, but here you go:
public class FileTest {
public static void main(String[] args) throws IOException, InterruptedException, ExecutionException{
File temp = File.createTempFile("asfd", "");
BufferedWriter wrt = new BufferedWriter(new FileWriter(temp));
int testLength = 10000;
int numWidth = String.valueOf(testLength).length();
int targetSum = 0;
for(int i = 0; i < testLength; i++){
// each line guaranteed to have a good number of characters for our test
wrt.write(String.format("%0"+ numWidth +"d\n", i));
targetSum += i;
}
wrt.close();
BufferedReader rdr1 = new BufferedReader(new FileReader(temp));
BufferedReader rdr2 = new BufferedReader(new FileReader(temp));
rdr2.skip((numWidth+1)*testLength / 2); // skip first half of the lines
Summer sum1 = new Summer(rdr1, testLength / 2);
Summer sum2 = new Summer(rdr2, testLength / 2);
ExecutorService executor = Executors.newFixedThreadPool(2);
Future<Integer> halfSum1 = executor.submit(sum1);
Future<Integer> halfSum2 = executor.submit(sum2);
System.out.println("Total sum = " + (halfSum1.get() + halfSum2.get()) + " reference " + targetSum);
rdr1.close();
rdr2.close();
temp.delete();
}
private static class Summer implements Callable<Integer>{
private BufferedReader rdr;
private int limit;
public Summer(BufferedReader rdr, int limit) throws IOException{
this.rdr = rdr;
this.limit = limit;
}
#Override
public Integer call() throws Exception {
System.out.println(Thread.currentThread().getName() + " started " + System.currentTimeMillis());
int sum = 0;
for(int i = 0; i < limit; i++){
sum += Integer.valueOf(rdr.readLine());
// uncomment to see interleaving of threads:
//System.out.println(Thread.currentThread().getName());
}
System.out.println(Thread.currentThread().getName() + " finished " + System.currentTimeMillis());
return sum;
}
}
}
What's to stop you from simply opening the file twice, and working with it as if it were two independent files?
File inputFile = new File("src/SameFileTwice.java");
BufferedReader in1 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile)));
BufferedReader in2 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile)));
try {
String strLine;
while ((strLine = in1.readLine()) != null && (strLine = in2.readLine()) != null) {
System.out.println(strLine);
}
} finally {
in1.close();
in2.close();
}
Related
For a homework assignment, I need to implement external sorting such that I can sort a 10GB file with 1GB physical memory. Currently, I'm using a BufferedReader on the large file and constructing/sorting the smaller files sequentially. Then in the merge step, I have BufferedReaders open for all small files and a single BufferedWriter for the large final file where I write to the large file using the merge k sorted lists algorithm with a PriorityQueue. This works, but it needs to be faster (take half as much time to be exact).
The entire splitting step happens sequentially and the entire merging step also happens sequentially. I think I can at least split and sort the files in parallel using multiple threads with different virtual memory spaces. Then the memory used is mostly memory-mapped files and the OS will take care of optimally paging in and out data from physical memory. I was wondering if there was a way for Java to do this using parallel streams. Something along the lines of:
largeFile.splitInParallel(100000)
.lines()
.map((s) -> new LineObject(s))
.sorted()
.forEach(writeSmallFileToDisk)
where the argument to splitInParallel is the number of lines I want in the smaller files. Any help is appreciated, thanks!
EDIT:
My code is
public class Main {
private static final int BUFFER_SIZE = 10_000_000;
/**
* A main method to run examples.
*
* #param args not used
*/
public static void main(String[] args) throws IOException {
System.out.println("Starting...");
String file = args[0];
int batchSize = Integer.parseInt(args[1]);;
try {
FileInputStream fin = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fin, BUFFER_SIZE);
BufferedReader br = new BufferedReader(new InputStreamReader(bis), BUFFER_SIZE);
int lineNumber = 0;
int batchId = 0;
String line;
TaxiEntry[] batch = new TaxiEntry[batchSize];
int i = 0;
while ((line = br.readLine()) != null) {
TaxiEntry taxiEntry = parseLine(line);
batch[i++] = taxiEntry;
lineNumber++;
if (lineNumber % batchSize == 0) {
String outputFileName = String.format("batches/batch_%d.txt", batchId);
BufferedWriter bf = new BufferedWriter(new FileWriter(outputFileName, true), BUFFER_SIZE);
Arrays.parallelSort(batch);
for (int j = 0; j < i; j++) {
bf.write(batch[j].toString());
if (j != i) {
bf.newLine();
}
}
batchId++;
i = 0;
bf.flush();
}
}
String outputFileName = String.format("batches/batch_%d.txt", batchId);
BufferedWriter bf = new BufferedWriter(new FileWriter(outputFileName, true), BUFFER_SIZE);
Arrays.parallelSort(batch, 0, i);
for (int j = 0; j < i; j++) {
bf.write(batch[j].toString());
if (j != i) {
bf.newLine();
}
}
batchId++;
bf.flush();
System.out.println("Processed " + lineNumber + " lines");
merge(batchId);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void merge(int numBatches) throws IOException {
System.out.println("Starting merge...");
// Open readers
BufferedReader[] readers = new BufferedReader[numBatches];
for (int i = 0; i < numBatches; i++) {
String file = String.format("batches/batch_%d.txt", i);
FileInputStream fin = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fin, BUFFER_SIZE);
BufferedReader br = new BufferedReader(new InputStreamReader(bis), BUFFER_SIZE);
readers[i] = br;
}
// Merge
String outputFileName = "result/final.txt";
BufferedWriter bf = new BufferedWriter(new FileWriter(outputFileName, true), BUFFER_SIZE);
PriorityQueue<IndexedTaxiNode> curEntries = new PriorityQueue<>();
for (int i = 0; i < numBatches; i++) {
BufferedReader reader = readers[i];
String next = reader.readLine();
if (next != null) {
TaxiEntry curr = parseLine(next);
curEntries.add(new IndexedTaxiNode(curr, i));
}
}
while (!curEntries.isEmpty()) {
// get max from curEntries
IndexedTaxiNode maxNode = curEntries.remove();
bf.write(maxNode.toString());
bf.newLine();
int index = maxNode.index;
String next = readers[index].readLine();
if (next != null) {
TaxiEntry newEntry = parseLine(next);
curEntries.add(new IndexedTaxiNode(newEntry, index));
}
}
bf.flush();
}
public static TaxiEntry parseLine(String line) {
return new TaxiEntry(line, Double.parseDouble(line.split(",")[16]));
}
}
Doing some timings. I found that the time to read from disk and the time to
do a sort are similar order of magnitude.
System.out.println("Begin loading file");
// do loading stuff
System.out.format("elapsed %.03f ms%n%n", (finishTime - startTime) / 1e6);
System.out.println("Sorting lines");
// do sorting stuff
System.out.format("elapsed %.03f ms%n", (finishTime - startTime) / 1e6);
Console output is:
Begin loading file
elapsed 918.933 ms
Sorting lines
elapsed 1360.896 ms
I used a modest file of about 150 MB for the timings. It might not be a good idea to have lots of threads all reading from disk at the same time.
My suggestion for what it's worth is to have one thread that does all of the disk reading, and another thread that concurrently does sorting. I could only see a way to do this for the splitting and sorting phase.
For the splitting phase, you cannot read all the segments in one go because that would consume too much memory. So you read a few segments, write a few, read a few, and so on. The idea of this interleaving, is to ensure the disk is continuously kept busy, by delegating the sorting operation to another thread. Hopefully by the time the disk is ready to write a segment the sort on that segment has completed so the disk never has to wait.
List<String> lines = new ArrayList<>();
int i = 0;
while (someCondition()) {
String line = reader.readLine();
lines.add(line);
if (lines.size() == BATCH_SIZE) {
sendMsgToWorker(lines); // send to worker thread
if (i == MAX_MESSAGE_QUEUE - 1) {
for (int j = 0; j < MAX_MESSAGE_QUEUE; j++) {
List<String> sortedLines = waitForLineFromWorker(); // wait for worker thread
writeTmpFile(sortedLines);
}
}
lines = new ArrayList<>();
i = (i + 1) % MAX_MESSAGE_QUEUE;
}
}
An outline for the splitting and sorting phase is shown above, without covering any edge cases. The amount of memory used would be proportional to BATCH_SIZE * MAX_MESSAGE_QUEUE.
Unfortunately, I don't see a way to apply concurrency to the phase of merging the multiple files. The disk is just the disk so cannot go any faster even with multiple threads.
You could try investigating parallel quicksort, but the problem with quicksort is choosing a pivot point so that the partitions end up a reasonable size.
I have a file which I would like to read in Java and split this file into n (user input) output files. Here is how I read the file:
int n = 4;
BufferedReader br = new BufferedReader(new FileReader("file.csv"));
try {
String line = br.readLine();
while (line != null) {
line = br.readLine();
}
} finally {
br.close();
}
How do I split the file - file.csv into n files?
Note - Since the number of entries in the file are of the order of 100k, I can't store the file content into an array and then split it and save into multiple files.
Since one file can be very large, each split file could be large as well.
Example:
Source File Size: 5GB
Num Splits: 5: Destination
File Size: 1GB each (5 files)
There is no way to read this large split chunk in one go, even if we have such a memory. Basically for each split we can read a fix size byte-array which we know should be feasible in terms of performance as well memory.
NumSplits: 10 MaxReadBytes: 8KB
public static void main(String[] args) throws Exception
{
RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
long numSplits = 10; //from user input, extract it from args
long sourceSize = raf.length();
long bytesPerSplit = sourceSize/numSplits ;
long remainingBytes = sourceSize % numSplits;
int maxReadBufferSize = 8 * 1024; //8KB
for(int destIx=1; destIx <= numSplits; destIx++) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
if(bytesPerSplit > maxReadBufferSize) {
long numReads = bytesPerSplit/maxReadBufferSize;
long numRemainingRead = bytesPerSplit % maxReadBufferSize;
for(int i=0; i<numReads; i++) {
readWrite(raf, bw, maxReadBufferSize);
}
if(numRemainingRead > 0) {
readWrite(raf, bw, numRemainingRead);
}
}else {
readWrite(raf, bw, bytesPerSplit);
}
bw.close();
}
if(remainingBytes > 0) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
readWrite(raf, bw, remainingBytes);
bw.close();
}
raf.close();
}
static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
byte[] buf = new byte[(int) numBytes];
int val = raf.read(buf);
if(val != -1) {
bw.write(buf);
}
}
import java.io.*;
import java.util.Scanner;
public class split {
public static void main(String args[])
{
try{
// Reading file and getting no. of files to be generated
String inputfile = "C:/test.txt"; // Source File Name.
double nol = 2000.0; // No. of lines to be split and saved in each output file.
File file = new File(inputfile);
Scanner scanner = new Scanner(file);
int count = 0;
while (scanner.hasNextLine())
{
scanner.nextLine();
count++;
}
System.out.println("Lines in the file: " + count); // Displays no. of lines in the input file.
double temp = (count/nol);
int temp1=(int)temp;
int nof=0;
if(temp1==temp)
{
nof=temp1;
}
else
{
nof=temp1+1;
}
System.out.println("No. of files to be generated :"+nof); // Displays no. of files to be generated.
//---------------------------------------------------------------------------------------------------------
// Actual splitting of file into smaller files
FileInputStream fstream = new FileInputStream(inputfile); DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in)); String strLine;
for (int j=1;j<=nof;j++)
{
FileWriter fstream1 = new FileWriter("C:/New Folder/File"+j+".txt"); // Destination File Location
BufferedWriter out = new BufferedWriter(fstream1);
for (int i=1;i<=nol;i++)
{
strLine = br.readLine();
if (strLine!= null)
{
out.write(strLine);
if(i!=nol)
{
out.newLine();
}
}
}
out.close();
}
in.close();
}catch (Exception e)
{
System.err.println("Error: " + e.getMessage());
}
}
}
Though its a old question but for reference I am listing out the code which I used to split large files to any sizes and it works with any Java versions above 1.4 .
Sample Split and Join blocks were like below:
public void join(String FilePath) {
long leninfile = 0, leng = 0;
int count = 1, data = 0;
try {
File filename = new File(FilePath);
//RandomAccessFile outfile = new RandomAccessFile(filename,"rw");
OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
while (true) {
filename = new File(FilePath + count + ".sp");
if (filename.exists()) {
//RandomAccessFile infile = new RandomAccessFile(filename,"r");
InputStream infile = new BufferedInputStream(new FileInputStream(filename));
data = infile.read();
while (data != -1) {
outfile.write(data);
data = infile.read();
}
leng++;
infile.close();
count++;
} else {
break;
}
}
outfile.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public void split(String FilePath, long splitlen) {
long leninfile = 0, leng = 0;
int count = 1, data;
try {
File filename = new File(FilePath);
//RandomAccessFile infile = new RandomAccessFile(filename, "r");
InputStream infile = new BufferedInputStream(new FileInputStream(filename));
data = infile.read();
while (data != -1) {
filename = new File(FilePath + count + ".sp");
//RandomAccessFile outfile = new RandomAccessFile(filename, "rw");
OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
while (data != -1 && leng < splitlen) {
outfile.write(data);
leng++;
data = infile.read();
}
leninfile += leng;
leng = 0;
outfile.close();
count++;
}
} catch (Exception e) {
e.printStackTrace();
}
}
Complete java code available here in File Split in Java Program link.
a clean solution to edit.
this solution involves loading the entire file into memory.
set all line of a file in List<String> rowsOfFile;
edit maxSizeFile to choice max size of a single file splitted
public void splitFile(File fileToSplit) throws IOException {
long maxSizeFile = 10000000 // 10mb
StringBuilder buffer = new StringBuilder((int) maxSizeFile);
int sizeOfRows = 0;
int recurrence = 0;
String fileName;
List<String> rowsOfFile;
rowsOfFile = Files.readAllLines(fileToSplit.toPath(), Charset.defaultCharset());
for (String row : rowsOfFile) {
buffer.append(row);
numOfRow++;
sizeOfRows += row.getBytes(StandardCharsets.UTF_8).length;
if (sizeOfRows >= maxSizeFile) {
fileName = generateFileName(recurrence);
File newFile = new File(fileName);
try (PrintWriter writer = new PrintWriter(newFile)) {
writer.println(buffer.toString());
}
recurrence++;
sizeOfRows = 0;
buffer = new StringBuilder();
}
}
// last rows
if (sizeOfRows > 0) {
fileName = generateFileName(recurrence);
File newFile = createFile(fileName);
try (PrintWriter writer = new PrintWriter(newFile)) {
writer.println(buffer.toString());
}
}
Files.delete(fileToSplit.toPath());
}
method to generate Name of file:
public String generateFileName(int numFile) {
String extension = ".txt";
return "myFile" + numFile + extension;
}
Have a counter to count no of entries. Let's say one entry per line.
step1: Initially create new subfile, set counter=0;
step2: increment counter as you read each entry from source file to buffer
step3: when counter reaches limit to number of entries that you want to write in each sub file, flush contents of buffer to subfile. close the subfile
step4 : jump to step1 till you have data in source file to read from
There's no need to loop twice through the file. You could estimate the size of each chunk as the source file size divided by number of chunks needed. Then you just stop filling each cunk with data as it's size exceeds estimated.
Here is one that worked for me and I used it to split 10GB file. it also enables you to add a header and a footer. very useful when splitting document based format such as XML and JSON because you need to add document wrapper in the new split files.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
public class FileSpliter
{
public static void main(String[] args) throws IOException
{
splitTextFiles("D:\\xref.csx", 750000, "", "", null);
}
public static void splitTextFiles(String fileName, int maxRows, String header, String footer, String targetDir) throws IOException
{
File bigFile = new File(fileName);
int i = 1;
String ext = fileName.substring(fileName.lastIndexOf("."));
String fileNoExt = bigFile.getName().replace(ext, "");
File newDir = null;
if(targetDir != null)
{
newDir = new File(targetDir);
}
else
{
newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
}
newDir.mkdirs();
try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
{
String line = null;
int lineNum = 1;
Path splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%02d", i) + ext);
BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
while ((line = reader.readLine()) != null)
{
if(lineNum == 1)
{
System.out.print("new file created '" + splitFile.toString());
if(header != null && header.length() > 0)
{
writer.append(header);
writer.newLine();
}
}
writer.append(line);
if (lineNum >= maxRows)
{
if(footer != null && footer.length() > 0)
{
writer.newLine();
writer.append(footer);
}
writer.close();
System.out.println(", " + lineNum + " lines written to file");
lineNum = 1;
i++;
splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%02d", i) + ext);
writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
}
else
{
writer.newLine();
lineNum++;
}
}
if(lineNum <= maxRows) // early exit
{
if(footer != null && footer.length() > 0)
{
writer.newLine();
lineNum++;
writer.append(footer);
}
}
writer.close();
System.out.println(", " + lineNum + " lines written to file");
}
System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
}
}
Below code used to split a big file into small files with lesser lines.
long linesWritten = 0;
int count = 1;
try {
File inputFile = new File(inputFilePath);
InputStream inputFileStream = new BufferedInputStream(new FileInputStream(inputFile));
BufferedReader reader = new BufferedReader(new InputStreamReader(inputFileStream));
String line = reader.readLine();
String fileName = inputFile.getName();
String outfileName = outputFolderPath + "\\" + fileName;
while (line != null) {
File outFile = new File(outfileName + "_" + count + ".split");
Writer writer = new OutputStreamWriter(new FileOutputStream(outFile));
while (line != null && linesWritten < linesPerSplit) {
writer.write(line);
line = reader.readLine();
linesWritten++;
}
writer.close();
linesWritten = 0;//next file
count++;//nect file count
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
Split a file to multiple chunks (in memory operation), here I'm splitting any file to a size of 500kb(500000 bytes) :
public static List<ByteArrayOutputStream> splitFile(File f) {
List<ByteArrayOutputStream> datalist = new ArrayList<>();
try {
int sizeOfFiles = 500000;
byte[] buffer = new byte[sizeOfFiles];
try (FileInputStream fis = new FileInputStream(f); BufferedInputStream bis = new BufferedInputStream(fis)) {
int bytesAmount = 0;
while ((bytesAmount = bis.read(buffer)) > 0) {
try (OutputStream out = new ByteArrayOutputStream()) {
out.write(buffer, 0, bytesAmount);
out.flush();
datalist.add((ByteArrayOutputStream) out);
}
}
}
} catch (Exception e) {
//get the error
}
return datalist; }
I am a bit late to answer, But here's how I did it:
Approach:
First I determine how many bytes each of the individual files should contain then I split the large file by bytes. Only one file chunk worth of data is loaded into memory at a time.
Example:- if a 5 GB file is split into 10 files then only 500MB worth of bytes are loaded into memory at a time which are held in the buffer variable in the splitBySize method below.
Code Explaination:
The method splitFile first gets the number of bytes each of the individual file chunks should contain by calling the getSizeInBytes method, then it calls the splitBySize method which splits the large file by size (i..e maxChunkSize represents the number of bytes each of file chunks will contain).
public static List<File> splitFile(File largeFile, int noOfFiles) throws IOException {
return splitBySize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles));
}
public static List<File> splitBySize(File largeFile, int maxChunkSize) throws IOException {
List<File> list = new ArrayList<>();
int numberOfFiles = 0;
try (InputStream in = Files.newInputStream(largeFile.toPath())) {
final byte[] buffer = new byte[maxChunkSize];
int dataRead = in.read(buffer);
while (dataRead > -1) {
list.add(stageLocally(buffer, dataRead));
numberOfFiles++;
dataRead = in.read(buffer);
}
}
System.out.println("Number of files generated: " + numberOfFiles);
return list;
}
private static int getSizeInBytes(long totalBytes, int numberOfFiles) {
if (totalBytes % numberOfFiles != 0) {
totalBytes = ((totalBytes / numberOfFiles) + 1)*numberOfFiles;
}
long x = totalBytes / numberOfFiles;
if (x > Integer.MAX_VALUE){
throw new NumberFormatException("Byte chunk too large");
}
return (int) x;
}
Full Code:
public class StackOverflow {
private static final String INPUT_FILE_PATH = "/Users/malkesingh/Downloads/5MB.zip";
private static final String TEMP_DIRECTORY = "/Users/malkesingh/temp";
public static void main(String[] args) throws IOException {
File input = new File(INPUT_FILE_PATH);
File outPut = fileJoin2(splitFile(input, 5));
try (InputStream in = Files.newInputStream(input.toPath()); InputStream out = Files.newInputStream(outPut.toPath())) {
System.out.println(IOUtils.contentEquals(in, out));
}
}
public static List<File> splitFile(File largeFile, int noOfFiles) throws IOException {
return splitBySize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles));
}
public static List<File> splitBySize(File largeFile, int maxChunkSize) throws IOException {
List<File> list = new ArrayList<>();
int numberOfFiles = 0;
try (InputStream in = Files.newInputStream(largeFile.toPath())) {
final byte[] buffer = new byte[maxChunkSize];
int dataRead = in.read(buffer);
while (dataRead > -1) {
list.add(stageLocally(buffer, dataRead));
numberOfFiles++;
dataRead = in.read(buffer);
}
}
System.out.println("Number of files generated: " + numberOfFiles);
return list;
}
private static int getSizeInBytes(long totalBytes, int numberOfFiles) {
if (totalBytes % numberOfFiles != 0) {
totalBytes = ((totalBytes / numberOfFiles) + 1)*numberOfFiles;
}
long x = totalBytes / numberOfFiles;
if (x > Integer.MAX_VALUE){
throw new NumberFormatException("Byte chunk too large");
}
return (int) x;
}
private static File stageLocally(byte[] buffer, int length) throws IOException {
File outPutFile = File.createTempFile("temp-", "split", new File(TEMP_DIRECTORY));
try(FileOutputStream fos = new FileOutputStream(outPutFile)) {
fos.write(buffer, 0, length);
}
return outPutFile;
}
public static File fileJoin2(List<File> list) throws IOException {
File outPutFile = File.createTempFile("temp-", "unsplit", new File(TEMP_DIRECTORY));
FileOutputStream fos = new FileOutputStream(outPutFile);
for (File file : list) {
Files.copy(file.toPath(), fos);
}
fos.close();
return outPutFile;
}}
import java.util.*;
import java.io.*;
public class task13 {
public static void main(String[] args)throws IOException{
Scanner s =new Scanner(System.in);
System.out.print("Enter path:");
String a=s.next();
File f=new File(a+".txt");
Scanner st=new Scanner(f);
System.out.println(f.canRead()+"\n"+f.canWrite());
long l=f.length();
System.out.println("Length is:"+l);
System.out.print("Enter no.of partitions:");
int p=s.nextInt();
long x=l/p;
st.useDelimiter("\\Z");
String t=st.next();
int j=0;
System.out.println("Each File Length is:"+x);
for(int i=1;i<=p;i++){
File ft=new File(a+"-"+i+".txt");
ft.createNewFile();
int g=(j*(int)x);
int h=(j+1)*(int)x;
if(g<=l&&h<=l){
FileWriter fw=new FileWriter(a+"-"+i+".txt");
String v=t.substring(g,h);
fw.write(v);
j++;
fw.close();
}}
}}
I'm writing a potentially long list of items to a file. The items I'm writing are of variable length. If the file size produced is greater than 10M it should be broken up into multiple files. To aid performance I'm currently using a BufferedWriter as below:
final FileOutputStream fos = new FileOutputStream(file);
final OutputStreamWriter osr = new OutputStreamWriter(fos, "UTF-8");
final BufferedWriter bw = new BufferedWriter(osr);
By doing this though I'm not able accurately monitor the size of the file I'm writing. I could flush this often or remove it, but of course that would have a performance impact. What's the best option for this? Ideally I'd like to get the sizes of the files produced as close to the 10M mark as possible.
In this case, don't use a BufferedWriter but a simple FileOutputStream (which you have already). Write (and check the size of) your string's .getBytes("UTF-8"). You may have to write additional newlines if needed though, but that's simple enough.
This way you can know by advance the size you have to write.
The following example uses Java 7's try-with-resources statement; if you're targeting an earlier platform you'll have to manually close the streams.
final int MAX_BYTES = 1024 * 1024 * 10;
final int NEWLINE_BYTES = System.getProperty("line.separator")
.getBytes("UTF-8").length;
int bytesWritten = 0;
int fileIndex = 0;
while (existsMoreData()) {
try (
FileOutputStream fos = new FileOutputStream(
getFileNameForIndex(fileIndex));
OutputStreamWriter osr = new OutputStreamWriter(fos, "UTF-8");
BufferedWriter bw = new BufferedWriter(osr)) {
String toWrite = getCurrentStringToWrite();
int bytesOfString = toWrite.getBytes("UTF-8").length;
if (bytesWritten + bytesOfString + NEWLINE_BYTES > MAX_BYTES
|| bytesWritten == 0 /* if this part > MAX_BYTES */ ) {
// need to start a new file
fileIndex++;
bytesWritten = 0;
continue; // auto-closed because of try-with-resources
} else {
bw.write(toWrite, 0, toWrite.length());
bw.newLine();
bytesWritten += bytesOfString + NEWLINE_BYTES;
incrementDataToWrite();
}
} catch (IOException ie) {
ie.printStackTrace();
}
}
Possible implementations:
String[] data = someLongString.split("\n");
int currentPart = 0;
private boolean existsMoreData() {
return currentPart + 1 < data.length;
}
private String getCurrentStringToWrite() {
return data[currentPart];
}
private void incrementDataToWrite() {
currentPart++;
}
private String getFileNameForIndex(int index) {
final String BASE_NAME = "/home/codebuddy/somefile";
return String.format("%s_%s.txt", BASE_NAME, index);
// equivalent to:
// return BASE_NAME + "_" + index + ".txt";
}
is there any possibility my following BufferedReader is able to put the input directly into a byte[]?
public static Runnable reader() throws IOException {
Log.e("Communication", "reader");
din = new DataInputStream(sock.getInputStream());
brdr = new BufferedReader(new InputStreamReader(din), 300);
boolean done = false;
while (!done) {
try {
char[] buffer = new char[200];
int length = brdr.read(buffer, 0, 200);
String message = new String(buffer, 0, length);
btrar = message.getBytes("ISO-8859-1");
int i=0;
for (int counter = 0; counter < message.length(); counter++) {
i++;
System.out.println(btrar[counter] + " = " + " btrar " + i);
}
...
thats the part of the reader, pls have a look.
I want the input directly to btrar,
is there any possibility my following BufferedReader is able to put the input directly into a byte[]?
Any Reader is designed to let you read characters, not bytes. To read binary data, just use an InputStream - using BufferedInputStream to buffer it if you want.
It's not really clear what you're trying to do, but you can use something like:
BufferedInputStream input = new BufferedInputStream(sock.getInputStream());
while (!done) {
// TODO: Rename btrar to something more meaningful
int bytesRead = input.read(btrar);
// Do something with the data...
}
Is there any reason to prefer a CharBuffer to a char[] in the following:
CharBuffer buf = CharBuffer.allocate(DEFAULT_BUFFER_SIZE);
while( in.read(buf) >= 0 ) {
out.append( buf.flip() );
buf.clear();
}
vs.
char[] buf = new char[DEFAULT_BUFFER_SIZE];
int n;
while( (n = in.read(buf)) >= 0 ) {
out.write( buf, 0, n );
}
(where in is a Reader and out in a Writer)?
No, there's really no reason to prefer a CharBuffer in this case.
In general, though, CharBuffer (and ByteBuffer) can really simplify APIs and encourage correct processing. If you were designing a public API, it's definitely worth considering a buffer-oriented API.
I wanted to mini-benchmark this comparison.
Below is the class I have written.
The thing is I can't believe that the CharBuffer performed so badly. What have I got wrong?
EDIT: Since the 11th comment below I have edited the code and the output time, better performance all round but still a significant difference in times. I also tried out2.append((CharBuffer)buff.flip()) option mentioned in the comments but it was much slower than the write option used in the code below.
Results: (time in ms)
char[] : 3411
CharBuffer: 5653
public class CharBufferScratchBox
{
public static void main(String[] args) throws Exception
{
// Some Setup Stuff
String smallString =
"1111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000";
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 1000; i++)
{
stringBuilder.append(smallString);
}
String string = stringBuilder.toString();
int DEFAULT_BUFFER_SIZE = 1000;
int ITTERATIONS = 10000;
// char[]
StringReader in1 = null;
StringWriter out1 = null;
Date start = new Date();
for (int i = 0; i < ITTERATIONS; i++)
{
in1 = new StringReader(string);
out1 = new StringWriter(string.length());
char[] buf = new char[DEFAULT_BUFFER_SIZE];
int n;
while ((n = in1.read(buf)) >= 0)
{
out1.write(
buf,
0,
n);
}
}
Date done = new Date();
System.out.println("char[] : " + (done.getTime() - start.getTime()));
// CharBuffer
StringReader in2 = null;
StringWriter out2 = null;
start = new Date();
CharBuffer buff = CharBuffer.allocate(DEFAULT_BUFFER_SIZE);
for (int i = 0; i < ITTERATIONS; i++)
{
in2 = new StringReader(string);
out2 = new StringWriter(string.length());
int n;
while ((n = in2.read(buff)) >= 0)
{
out2.write(
buff.array(),
0,
n);
buff.clear();
}
}
done = new Date();
System.out.println("CharBuffer: " + (done.getTime() - start.getTime()));
}
}
If this is the only thing you're doing with the buffer, then the array is probably the better choice in this instance.
CharBuffer has lots of extra chrome on it, but none of it is relevant in this case - and will only slow things down a fraction.
You can always refactor later if you need to make things more complicated.
The difference, in practice, is actually <10%, not 30% as others are reporting.
To read and write a 5MB file 24 times, my numbers taken using a Profiler. They were on average:
char[] = 4139 ms
CharBuffer = 4466 ms
ByteBuffer = 938 (direct) ms
Individual tests a couple times favored CharBuffer.
I also tried replacing the File-based IO with In-Memory IO and the performance was similar. If you are trying to transfer from one native stream to another, then you are better off using a "direct" ByteBuffer.
With less than 10% performance difference, in practice, I would favor the CharBuffer. It's syntax is clearer, there's less extraneous variables, and you can do more direct manipulation on it (i.e. anything that asks for a CharSequence).
Benchmark is below... it is slightly wrong as the BufferedReader is allocated inside the test-method rather than outside... however, the example below allows you to isolate the IO time and eliminate factors like a string or byte stream resizing its internal memory buffer, etc.
public static void main(String[] args) throws Exception {
File f = getBytes(5000000);
System.out.println(f.getAbsolutePath());
try {
System.gc();
List<Main> impls = new java.util.ArrayList<Main>();
impls.add(new CharArrayImpl());
//impls.add(new CharArrayNoBuffImpl());
impls.add(new CharBufferImpl());
//impls.add(new CharBufferNoBuffImpl());
impls.add(new ByteBufferDirectImpl());
//impls.add(new CharBufferDirectImpl());
for (int i = 0; i < 25; i++) {
for (Main impl : impls) {
test(f, impl);
}
System.out.println("-----");
if(i==0)
continue; //reset profiler
}
System.gc();
System.out.println("Finished");
return;
} finally {
f.delete();
}
}
static int BUFFER_SIZE = 1000;
static File getBytes(int size) throws IOException {
File f = File.createTempFile("input", ".txt");
FileWriter writer = new FileWriter(f);
Random r = new Random();
for (int i = 0; i < size; i++) {
writer.write(Integer.toString(5));
}
writer.close();
return f;
}
static void test(File f, Main impl) throws IOException {
InputStream in = new FileInputStream(f);
File fout = File.createTempFile("output", ".txt");
try {
OutputStream out = new FileOutputStream(fout, false);
try {
long start = System.currentTimeMillis();
impl.runTest(in, out);
long end = System.currentTimeMillis();
System.out.println(impl.getClass().getName() + " = " + (end - start) + "ms");
} finally {
out.close();
}
} finally {
fout.delete();
in.close();
}
}
public abstract void runTest(InputStream ins, OutputStream outs) throws IOException;
public static class CharArrayImpl extends Main {
char[] buff = new char[BUFFER_SIZE];
public void runTest(InputStream ins, OutputStream outs) throws IOException {
Reader in = new BufferedReader(new InputStreamReader(ins));
Writer out = new BufferedWriter(new OutputStreamWriter(outs));
int n;
while ((n = in.read(buff)) >= 0) {
out.write(buff, 0, n);
}
}
}
public static class CharBufferImpl extends Main {
CharBuffer buff = CharBuffer.allocate(BUFFER_SIZE);
public void runTest(InputStream ins, OutputStream outs) throws IOException {
Reader in = new BufferedReader(new InputStreamReader(ins));
Writer out = new BufferedWriter(new OutputStreamWriter(outs));
int n;
while ((n = in.read(buff)) >= 0) {
buff.flip();
out.append(buff);
buff.clear();
}
}
}
public static class ByteBufferDirectImpl extends Main {
ByteBuffer buff = ByteBuffer.allocateDirect(BUFFER_SIZE * 2);
public void runTest(InputStream ins, OutputStream outs) throws IOException {
ReadableByteChannel in = Channels.newChannel(ins);
WritableByteChannel out = Channels.newChannel(outs);
int n;
while ((n = in.read(buff)) >= 0) {
buff.flip();
out.write(buff);
buff.clear();
}
}
}
I think that CharBuffer and ByteBuffer (as well as any other xBuffer) were meant for reusability so you can buf.clear() them instead of going through reallocation every time
If you don't reuse them, you're not using their full potential and it will add extra overhead. However if you're planning on scaling this function this might be a good idea to keep them there
You should avoid CharBuffer in recent Java versions, there is a bug in #subsequence(). You cannot get a subsequence from the second half of the buffer since the implementation confuses capacity and remaining. I observed the bug in java 6-0-11 and 6-0-12.
The CharBuffer version is slightly less complicated (one less variable), encapsulates buffer size handling and makes use of a standard API. Generally I would prefer this.
However there is still one good reason to prefer the array version, in some cases at least. CharBuffer was only introduced in Java 1.4 so if you are deploying to an earlier version you can't use Charbuffer (unless you roll-your-own/use a backport).
P.S If you use a backport remember to remove it once you catch up to the version containing the "real" version of the backported code.