How to deal with a huge, one-line file in Java - java

I need to read a huge file (15+GB) and perform some minor modifications (add some newlines so a different parser can actually work with it). You might think that there are already answers for doing this normally:
Reading a very huge file in java
How to read a large text file line by line using Java?
but my entire file is on one line.
My general approach so far is very basic:
char[] buffer = new char[X];
BufferedReader reader = new BufferedReader(new ReaderUTF8(new FileInputStream(new File("myFileName"))), X);
char[] bufferOut = new char[X+a little];
int bytesRead = -1;
int i = 0;
int offset = 0;
long totalBytesRead = 0;
int countToPrint = 0;
while((bytesRead = reader.read(buffer)) >= 0){
for(i = 0; i < bytesRead; i++){
if(buffer[i] == '}'){
bufferOut[i+offset] = '}';
offset++;
bufferOut[i+offset] = '\n';
}
else{
bufferOut[i+offset] = buffer[i];
}
}
writer.write(bufferOut, 0, bytesRead+offset);
offset = 0;
totalBytesRead += bytesRead;
countToPrint += 1;
if(countToPrint == 10){
countToPrint = 0;
System.out.println("Read "+((double)totalBytesRead / originalFileSize * 100)+" percent.");
}
}
writer.flush();
After some experimentation, I've found that a value of X larger than a million gives optimal speed - it looks like I'm getting about 2% every 10 minutes, while a value of X of ~60,000 only got 60% in 15 hours. Profiling reveals that I'm spending 96+% of my time in the read() method, so that's definitely my bottleneck. As of writing this, my 8 million X version has finished 32% of the file after 2 hours and 40 minutes, in case you want to know how it performs long-term.
Is there a better approach for dealing with such a large, one-line file? As in, is there a faster way of reading this type of file that gives me a relatively easy way of inserting the newline characters?
I am aware that different languages or programs could probably handle this gracefully, but I'm limiting this to a Java perspective.

You are making this far more complicated than it should be. By just making use of the buffering already provided by the standard classes you should get a thorughput of at least several MB per second without any hassles.
This simple test program processes 1GB in less than 2 minutes on my PC (including creating the test file):
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.Random;
public class TestFileProcessing {
public static void main(String[] argv) {
try {
long time = System.currentTimeMillis();
File from = new File("C:\\Test\\Input.txt");
createTestFile(from, StandardCharsets.UTF_8, 1_000_000_000);
System.out.println("Created file in: " + (System.currentTimeMillis() - time) + "ms");
time = System.currentTimeMillis();
File to = new File("C:\\Test\\Output.txt");
doIt(from, to, StandardCharsets.UTF_8);
System.out.println("Converted file in: " + (System.currentTimeMillis() - time) + "ms");
} catch (IOException e) {
throw new RuntimeException(e.getMessage(), e);
}
}
public static void createTestFile(File file, Charset encoding, long size) throws IOException {
Random r = new Random(12345);
try (OutputStream fout = new FileOutputStream(file);
BufferedOutputStream bout = new BufferedOutputStream(fout);
Writer writer = new OutputStreamWriter(bout, encoding)) {
for (long i=0; i<size; ++i) {
int c = r.nextInt(26);
if (c == 0)
writer.write('}');
else
writer.write('a' + c);
}
}
}
public static void doIt(File from, File to, Charset encoding) throws IOException {
try (InputStream fin = new FileInputStream(from);
BufferedInputStream bin = new BufferedInputStream(fin);
Reader reader = new InputStreamReader(bin, encoding);
OutputStream fout = new FileOutputStream(to);
BufferedOutputStream bout = new BufferedOutputStream(fout);
Writer writer = new OutputStreamWriter(bout, encoding)) {
int c;
while ((c = reader.read()) >= 0) {
if (c == '}')
writer.write('\n');
writer.write(c);
}
}
}
}
As you see no elaborate logic or excessive buffer sizes are used. What is used is simply buffering the streams closest to the hardware, the FileInput/OutputStream.

Related

Large file (5 GB) hashing with bcrypt takes long time (246sec, ~4 minutes)

I have a very simple code, which is designed to avoid OutOfMemory exceptions. To this reason, the file is streamed and chunks are created, each chunk is hashed, and the final hash is generated from these (this called a Hash tree: https://en.wikipedia.org/wiki/Hash_list).
My questions are the following:
Hashing with bcrypt takes very long time (246sec for a 5GB file), I'm not sure it is a real problem (or is it normal?), but I consider the time very-very long. (will it freeze the program?)
Also, is it possible to speed it up? I planning to parallelize hashing each list elements, it is a good approach? Should I do it?
The code sample is using a 5GB test file downloaded from: https://testfiledownload.com/
My code is the following:
package main;
import org.springframework.security.crypto.bcrypt.BCryptPasswordEncoder;
import org.springframework.util.StopWatch;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) throws IOException {
// Sample test file downloaded from: https://testfiledownload.com/
StopWatch sw = new StopWatch();
sw.start();
String h_text = encryptor(iparser("C:\\Users\\Thend\\Desktop\\test_5gb\\5gb.test"));
sw.stop();
System.out.println(" Hash: "+h_text+" | Execution time: "+String.format("%.5f", sw.getTotalTimeMillis() / 1000.0f)+"sec");
}
public static List<byte[]> iparser (String file) throws IOException {
List<byte[]> temp = new ArrayList<>();
if (Files.size(Path.of(file)) > 104857600) { // 100 MB = 104857600 Bytes (in binary)
// Add chunk by chunk
try (FileInputStream fis = new FileInputStream(file)) {
byte[] buffer = new byte[10485760]; // 10 MB chunk
int len;
while ((len = fis.read(buffer)) > 0) {
temp.add(buffer);
}
return temp;
}
} else {
// Add whole
try (FileInputStream fis = new FileInputStream(file)) {
byte[] buffer = new byte[(int) file.length()]; // Add whole
int len;
while ((len = fis.read(buffer)) > 0) {
temp.add(buffer);
}
return temp;
}
}
}
public static String encryptor(List<byte[]> list) {
BCryptPasswordEncoder bcpe = new BCryptPasswordEncoder();
ArrayList<String> temp = new ArrayList<>();
if (list.size() > 1) {
// If there is more than one element in the list
list.forEach((n) -> {
//String tohash = new String(n, StandardCharsets.UTF_8);
String tohash = new String(n);
String hashedByteArray = bcpe.encode(tohash);
temp.add((hashedByteArray.split(Pattern.quote("$"))[3]).substring(22));
});
return bcpe.encode(String.join("", temp));
} else {
// If there is only one element in the list
return bcpe.encode(String.join("", temp));
}
}
}
The console output:
Hash: $2a$10$60BSOrudT4BT3RC4dqcooupGd6fmg0/LU0RLGhBTSLvbZgypGuyBq | Execution time: 246,88400sec
Process finished with exit code 0

Java FileInputStream FileOutputStream difference in the run

Could someone tell me why the 1. run is wrong? (The return code is 0, but the file written is only half of the original one.
Thanks in advance!
public class FileCopyFisFos {
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream("d:/Test1/OrigFile.MP4");
FileOutputStream fos = new FileOutputStream("d:/Test2/DestFile.mp4");
// 1. run
// while (fis.read() != -1){
// int len = fis.read();
// fos.write(len);
// }
// 2. run
// int len;
// while ((len = fis.read()) != -1){
// fos.write(len);
// }
fis.close();
fos.close();
}
}
FileInputStream 's read() method follows this logic:
Reads a byte of data from this input stream. This method blocks if no input is yet available.
So assigning the value of its return to a variable, such as:
while((len = fis.read())!= -1)
Is avoiding the byte of data just read from the stream to be forgotten, as every read() call will be assigned to your len variable.
Instead, this code bypasses one of every two bytes from the stream, as the read() executed in the while condition is never assigned to a variable. So the stream advances without half of the bytes being read (assigned to len):
while (fis.read() != -1) { // reads a byte of data (but not saved)
int len = fis.read(); // next byte of data saved
fos.write(len); // possible -1 written here
}
#aran and others already pointed out the solution to your problem.
However there are more sides to this, so I extended your example:
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
public class FileCopyFisFos {
public static void main(final String[] args) throws IOException {
final File src = new File("d:/Test1/OrigFile.MP4");
final File sink = new File("d:/Test2/DestFile.mp4");
{
final long startMS = System.currentTimeMillis();
final long bytesCopied = copyFileSimple(src, sink);
System.out.println("Simple copy transferred " + bytesCopied + " bytes in " + (System.currentTimeMillis() - startMS) + "ms");
}
{
final long startMS = System.currentTimeMillis();
final long bytesCopied = copyFileSimpleFaster(src, sink);
System.out.println("Simple+Fast copy transferred " + bytesCopied + " bytes in " + (System.currentTimeMillis() - startMS) + "ms");
}
{
final long startMS = System.currentTimeMillis();
final long bytesCopied = copyFileFast(src, sink);
System.out.println("Fast copy transferred " + bytesCopied + " bytes in " + (System.currentTimeMillis() - startMS) + "ms");
}
System.out.println("Test completed.");
}
static public long copyFileSimple(final File pSourceFile, final File pSinkFile) throws IOException {
try (
final FileInputStream fis = new FileInputStream(pSourceFile);
final FileOutputStream fos = new FileOutputStream(pSinkFile);) {
long totalBytesTransferred = 0;
while (true) {
final int readByte = fis.read();
if (readByte < 0) break;
fos.write(readByte);
++totalBytesTransferred;
}
return totalBytesTransferred;
}
}
static public long copyFileSimpleFaster(final File pSourceFile, final File pSinkFile) throws IOException {
try (
final FileInputStream fis = new FileInputStream(pSourceFile);
final FileOutputStream fos = new FileOutputStream(pSinkFile);
BufferedInputStream bis = new BufferedInputStream(fis);
BufferedOutputStream bos = new BufferedOutputStream(fos);) {
long totalBytesTransferred = 0;
while (true) {
final int readByte = bis.read();
if (readByte < 0) break;
bos.write(readByte);
++totalBytesTransferred;
}
return totalBytesTransferred;
}
}
static public long copyFileFast(final File pSourceFile, final File pSinkFile) throws IOException {
try (
final FileInputStream fis = new FileInputStream(pSourceFile);
final FileOutputStream fos = new FileOutputStream(pSinkFile);) {
long totalBytesTransferred = 0;
final byte[] buffer = new byte[20 * 1024];
while (true) {
final int bytesRead = fis.read(buffer);
if (bytesRead < 0) break;
fos.write(buffer, 0, bytesRead);
totalBytesTransferred += bytesRead;
}
return totalBytesTransferred;
}
}
}
The hints that come along with that code:
There is the java.nio package that usualy does those things a lot faster and in less code.
Copying single bytes is 1'000-40'000 times slower that bulk copy.
Using try/resource/catch is the best way to avoid problems with reserved/locked resources like files etc.
If you solve something that is quite commonplace, I suggest you put it in a utility class of your own or even your own library.
There are helper classes like BufferedInputStream and BufferedOutputStream that take care of efficiency greatly; see example copyFileSimpleFaster().
But as usual, it is the quality of the concept that has the most impact on the implementation; see example copyFileFast().
There are even more advanced concepts (similar to java.nio), that take into account concepts like OS caching behaviour etc, which will give performance another kick.
Check my outputs, or run it on your own, to see the differences in performance:
Simple copy transferred 1608799 bytes in 12709ms
Simple+Fast copy transferred 1608799 bytes in 51ms
Fast copy transferred 1608799 bytes in 4ms
Test completed.

I can't change the speed of a music file without avoiding that a disturbing noise appear

I am trying to change the speed of an audio file, if I do it with unsigned values all is all right, but once I start using double values things get messy, for instance, my code works with all the "x.5" numbers but it doesn't with any other number with decimals and in my case, I want to increase the speed by 1.3 points. But all I get is a file where you can't barely hear nothing but an annoying noise.
Here is the code that I am using:
import javax.swing.JOptionPane;
import javax.sound.sampled.*;
import java.net.URL;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.ByteArrayInputStream;
import java.util.Date;
class AcceleratePlayback {
public static void main(String[] args) throws Exception {
//double playBackSpeed = 1.5; Works
//double playBackSpeed = 1.3; Doesn't work
File file1= new File("Sample2.wav");
File file2= new File("DEF.wav");
AudioInputStream ais = AudioSystem.getAudioInputStream(file1);
AudioFormat af = ais.getFormat();
int frameSize = af.getFrameSize();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] b = new byte[2^16];
int read = 1;
while( read>-1 ) {
read = ais.read(b);
if (read>0) {
baos.write(b, 0, read);
}
}
byte[] b1 = baos.toByteArray();
byte[] b2 = new byte[(int)(b1.length/playBackSpeed)];
for (int i=0; i<(b2.length/frameSize); i++) {
for (int j=0; j<frameSize; j++) {
b2[(i*frameSize)+j] = b1[(int)((i*frameSize*playBackSpeed)+j)];
}
}
ByteArrayInputStream bais = new ByteArrayInputStream(b2);
AudioInputStream aisAccelerated =
new AudioInputStream(bais, af, b2.length/frameSize);
AudioSystem.write(aisAccelerated, AudioFileFormat.Type.WAVE, file2);
}
}
Your reading-from falls on odd barriers: that is because of truncation the read-from byte starts at an odd location. Use the following to start from even location:
for (i=0; i<(b2.length/frameSize); i++) {
int ind=(int)((i*frameSize*playBackSpeed));
if((ind%2)==1) ind++;
for (j=0; j<frameSize; j++) {
b2[(i*frameSize)+j] = b1[ind+j];
}
}
Or you can change the jump to 4:
if((ind%4)>0) ind+=(4-(ind%4));

Fast parsing of strings of numbers in java

I have found plenty of different suggestions on how to parse an ASCII file containing double precision numbers into an array of doubles in Java. What I currently use is roughly the following:
stream = FileInputStream(fname);
breader = BufferedReader(InputStreamReader(stream));
scanner = java.util.Scanner(breader);
array = new double[size]; // size is known upfront
idx = 0;
try {
while(idx<size){
array[idx] = scanner.nextDouble();
idx++;
}
}
catch {...}
For an example file with 1 million numbers this code takes roughly 2 seconds. Similar code written in C, using fscanf, takes 0.1 second (!) Clearly I got it all wrong. I guess calling nextDouble() so many times is the wrong way to go because of the overhead, but I cannot figure out a better way.
I am no Java expert and hence I need a little help with this: can you tell me how to improve this code?
Edit The corresponding C code follows
fd = fopen(fname, "r+");
vals = calloc(sizeof(double), size);
do{
nel = fscanf(fd, "%lf", vals+idx);
idx++;
} while(nel!=-1);
(Summarizing some of the things that I already mentioned in the comments:)
You should be careful with manual benchmarks. The answer to the question How do I write a correct micro-benchmark in Java? points out some of the basic caveats. However, this case is not so prone to the classical pitfalls. In fact, the opposite might be the case: When the benchmark solely consists of reading a file, then you are most likely not benchmarking the code, but mainly the hard disc. This involves the usual side effects of caching.
However, there obviously is an overhead beyond the pure file IO.
You should be aware that the Scanner class is very powerful and convenient. But internally, it is a beast consisting of large regular expressions and hides a tremendous complexity from the user - a complexity that is not necessary at all when your intention is to only read double values!
There are solutions with less overhead.
Unfortunately, the simplest solution is only applicable when the numbers in the input are separated by line separators. Then, reading this file into an array could be written as
double result[] =
Files.lines(Paths.get(fileName))
.mapToDouble(Double::parseDouble)
.toArray();
and this could even be rather fast. When there are multiple numbers in one line (as you mentioned in the comment), then this could be extended:
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
So regarding the general question of how to efficiently read a set of double values from a file, separated by whitespaces (but not necessarily separated by newlines), I wrote a small test.
This should not be considered as a real benchmark, and be taken with a grain of salt, but it at least tries to address some basic issues: It reads files with different sizes, multiple times, with different methods, so that for the later runs, the effects of hard disc caching should be the same for all methods:
Updated to generate sample data as described in the comment, and added the stream-based approach
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StreamTokenizer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Locale;
import java.util.Random;
import java.util.Scanner;
import java.util.StringTokenizer;
import java.util.stream.Stream;
public class ReadingFileWithDoubles
{
private static final int MIN_SIZE = 256000;
private static final int MAX_SIZE = 2048000;
public static void main(String[] args) throws IOException
{
generateFiles();
long before = 0;
long after = 0;
double result[] = null;
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
for (int i=0; i<10; i++)
{
before = System.nanoTime();
result = readWithScanner(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithScanner " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStreamTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStreamTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithBufferAndStringTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithBufferAndStringTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStream(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStream " +
(after - before) / 1e6 +
", result " + result);
}
}
}
private static double[] readWithScanner(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
Scanner scanner = new Scanner(br))
{
// Do this to avoid surprises on systems with a different locale!
scanner.useLocale(Locale.ENGLISH);
int idx = 0;
double array[] = new double[size];
while (idx < size)
{
array[idx] = scanner.nextDouble();
idx++;
}
return array;
}
}
private static double[] readWithStreamTokenizer(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StreamTokenizer st = new StreamTokenizer(br);
st.resetSyntax();
st.wordChars('0', '9');
st.wordChars('.', '.');
st.wordChars('-', '-');
st.wordChars('e', 'e');
st.wordChars('E', 'E');
double array[] = new double[size];
int index = 0;
boolean eof = false;
do
{
int token = st.nextToken();
switch (token)
{
case StreamTokenizer.TT_EOF:
eof = true;
break;
case StreamTokenizer.TT_WORD:
double d = Double.parseDouble(st.sval);
array[index++] = d;
break;
}
} while (!eof);
return array;
}
}
// This one is reading the whole file into memory, as a String,
// which may not be appropriate for large files
private static double[] readWithBufferAndStringTokenizer(
String fileName, int size) throws IOException
{
double array[] = new double[size];
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StringBuilder sb = new StringBuilder();
char buffer[] = new char[1024];
while (true)
{
int n = br.read(buffer);
if (n == -1)
{
break;
}
sb.append(buffer, 0, n);
}
int index = 0;
StringTokenizer st = new StringTokenizer(sb.toString());
while (st.hasMoreTokens())
{
array[index++] = Double.parseDouble(st.nextToken());
}
return array;
}
}
private static double[] readWithStream(
String fileName, int size) throws IOException
{
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
return result;
}
private static void generateFiles() throws IOException
{
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
if (!new File(fileName).exists())
{
System.out.println("Creating "+fileName);
writeDoubles(new FileOutputStream(fileName), n);
}
else
{
System.out.println("File "+fileName+" already exists");
}
}
}
private static void writeDoubles(OutputStream os, int n) throws IOException
{
OutputStreamWriter writer = new OutputStreamWriter(os);
Random random = new Random(0);
int numbersPerLine = random.nextInt(4) + 1;
for (int i=0; i<n; i++)
{
writer.write(String.valueOf(random.nextDouble()));
numbersPerLine--;
if (numbersPerLine == 0)
{
writer.write("\n");
numbersPerLine = random.nextInt(4) + 1;
}
else
{
writer.write(" ");
}
}
writer.close();
}
}
It compares 4 methods:
Reading with a Scanner, as in your original code snippet
Reading with a StreamTokenizer
Reading the whole file into a String, and dissecting it with a StringTokenizer
Reading the file as a Stream of lines, which are then flat-mapped to a Stream of tokens, which are then mapped to a DoubleStream
Reading the file as one large String may not be appropriate in all cases: When the files become (much) larger, then keeping the whole file in memory as a String may not be a viable solution.
A test run (on a rather old PC, with a slow hard disc drive (no solid state)) showed roughly these results:
...
size = 1024000, readWithScanner 9932.940919, result [D#1c7353a
size = 1024000, readWithStreamTokenizer 1187.051427, result [D#1a9515
size = 1024000, readWithBufferAndStringTokenizer 1172.235019, result [D#f49f1c
size = 1024000, readWithStream 2197.785473, result [D#1469ea2 ...
Obviously, the scanner imposes a considerable overhead that may be avoided when reading more directly from the stream.
This may not be the final answer, as there may be more efficient and/or more elegant solutions (and I'm looking forward to see them!), but maybe it is helpful at least.
EDIT
A small remark: There is a certain conceptual difference between the approaches in general. Roughly speaking, the difference lies in who determines the number of elements that are read. In pseudocode, this difference is
double array[] = new double[size];
for (int i=0; i<size; i++)
{
array[i] = readDoubleFromInput();
}
versus
double array[] = new double[size];
int index = 0;
while (thereAreStillNumbersInTheInput())
{
double d = readDoubleFromInput();
array[index++] = d;
}
Your original approach with the scanner was written like the first one, while the solutions that I proposed are more similar to the second. But this should not make a large difference here, assuming that the size is indeed the real size, and potential errors (like too few or too many numbers in the input) don't appear or are handled in some other way.

Splitting a .gz file into specified file sizes in Java

This is my first posting, so not sure how apt my description of the issue is..
Below is a program I have written to split a .gz file into files based on the size of each file, the user wants. The parent .gz file is getting split, but not into the size as specified in the code.
For example, in the main I have said I want the parent file to be split into files of size 1 MB. But on executing the code, its getting split into n number of files of different sizes. Can someone help me pin point where I am going wrong? Any help would be great as I have run out of ideas..
package com.bitsighttech.collection.packaging;
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import org.apache.log4j.Logger;
public class FileSplitter
{
private static Logger logger = Logger.getLogger(FileSplitter.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
public List<File> split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
List<File> splitFileList = new ArrayList<File>();
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
System.out.println("FileSizeString----------------------"+FileSizeString);
String unit = m.group(2);
double fileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
fileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
fileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
fileSizeInMB = Double.parseDouble(FileSizeString) * KB;
}
catch (NumberFormatException e) {
logger.error("invalid number [" + fileSizeInMB + "] for expected file size");
}
System.out.println("fileSizeInMB----------------------"+fileSizeInMB);
double fileSize = fileSizeInMB * MB;
long fileSizeInByte = (long) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
System.out.println("0000000000000000000000000"+expectedNoOfFiles);
GZIPInputStream in = new GZIPInputStream(new FileInputStream(inputFile));
DataInputStream datain = new DataInputStream(in);
BufferedReader fis = new BufferedReader(new InputStreamReader(datain));
int count= 0 ;
int splinterCount = 1;
GZIPOutputStream outputFileWriter = null;
while ((count = fis.read()) != -1)
{
System.out.println("count----------------------1 "+count);
int outputFileLength = 0;
outputFileWriter = new GZIPOutputStream(new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles + ".gz"));
while ( (count = fis.read()) != -1
&& outputFileLength < fileSizeInByte
) {
outputFileWriter.write(count);
outputFileLength ++;
count = fis.read();
}
System.out.println("count----------------------2 "+count);
//outputFileWriter.finish();
outputFileWriter.close();
splinterCount ++;
}
fis.close();
datain.close();
in.close();
outputFileWriter.close();
System.out.println("Finished");
}catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return null;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return splitFileList;
}
public static void main(String args[])
{
String filePath1 = "F:\\filename.gz";
File file = new File(filePath1);
FileSplitter fileSplitter = new FileSplitter();
String splitlen = "1 MB";
int noOfFilesSplit = 3;
fileSplitter.split(file, splitlen);
}
}
Andreas' answer covers your main question, but there are a lot of problems in that code. Most importantly, you're throwing out one byte for each 'split' (the outer while calls fis.read() and ignores the value).
Why are you wrapping your gzip input stream in a DataInputStream and a BufferedReader if you're still reading it a byte at a time?
Edit
Ah, and you're also throwing out the last byte of each split, too (except for the very last one).
Hard to tell, but it looks to me like your counting the uncompressed bytes. The compressed chunks (resulting files) will be smaller.
When you compress data with gzip the output file size depends on the complexity of data. Here you are compressing equally sized blocks, but their compressed sizes are different. No lossless compression algorithm reduces the size of input by a constant factor.
If you want splinters of equal size you should split the compressed data instead of decompressing first. But that of course means that the splinters have to be decompressed in order and you can't decompress one without reading the ones that precede it.

Categories