Storing a large binary file

Storing a large binary file - java

Are there any ways to store a large binary file like 50 MB in the ten files with 5 MB?
thanks
are there any special classes for doing this?

Use a FileInputStream to read the file and a FileOutputStream to write it.
Here a simple (incomplete) example (missing error handling, writes 1K chunks)
public static int split(File file, String name, int size) throws IOException {
FileInputStream input = new FileInputStream(file);
FileOutputStream output = null;
byte[] buffer = new byte[1024];
int count = 0;
boolean done = false;
while (!done) {
output = new FileOutputStream(String.format(name, count));
count += 1;
for (int written = 0; written < size; ) {
int len = input.read(buffer);
if (len == -1) {
done = true;
break;
}
output.write(buffer, 0, len);
written += len;
}
output.close();
}
input.close();
return count;
}
and called like
File input = new File("C:/data/in.gz");
String name = "C:/data/in.gz.part%02d"; // %02d will be replaced by segment number
split(input, name, 5000 * 1024));

Yes, there are. Basically just count the bytes which you write to file and if it hits a certain limit, then stop writing, reset the counter and continue writing to another file using a certain filename pattern so that you can correlate the files with each other. You can do that in a loop. You can learn here how to write to files in Java and for the remnant just apply the primary school maths.

Related

How to read a File character-by-character in reverse without running out-of-memory?

The Story
I've been having a problem lately...
I have to read a file in reverse character by character without running out of memory.
I can't read it line-by-line and reverse it with StringBuilder because it's a one-line file that takes up to a gig (GB) of I/O space.
Hence it would take up too much of the JVM's (and the System's) Memory.
I've decided to just read it character by character from end-to-start (back-to-front) so that I could process as much as I can without consuming too much memory.
What I've Tried
I know how to read a file in one go:
(MappedByteBuffer+FileChannel+Charset which gave me OutOfMemoryExceptions)
and read a file character-by-character with UTF-8 character support
(FileInputStream+InputStreamReader).
The problem is that FileInputStream's #read() only calls #read0() which is a native method!
Because of that I have no idea about the underlying code...
Which is why I'm here today (or at least until this is done)!

This will do it (but as written it is not very efficient).
just skip to the last location read less one and read and print the character.
then reset the location to the mark, adjust size and continue.
File f = new File("Some File name");
int size = (int) f.length();
int bsize = 1;
byte[] buf = new byte[bsize];
try (BufferedInputStream b =
new BufferedInputStream(new FileInputStream(f))) {
while (size > 0) {
b.mark(size);
b.skip(size - bsize);
int k = b.read(buf);
System.out.print((char) buf[0]);
size -= k;
b.reset();
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
This could be improved by increasing the buffer size and making equivalent adjustments in the mark and skip arguments.
Updated Version
I wasn't fully satisfied with my answer so I made it more general. Some variables could have served double duty but using meaningful names helps clarify how they are used.
Mark must be used so reset can be used. However, it only needs to be set once and is set to position 0 outside of the main loop. I do not know if marking closer to the read point is more efficient or not.
skipCnt - initally set to fileLength it is the number of bytes to skip before reading. If the number of bytes remaining is greater than the buffer size, then the skip count will be skipCnt - bsize. Else it will be 0.
remainingBytes - a running total of how many bytes are still to be read. It is updated by subtracting the current readCnt.
readCnt - how many bytes to read. If remainingBytes is greater than bsize then set to bsize, else set to remainingBytes
The while loop continuously reads the file starting near the end and then prints the just read information in reverse order. All variables are updated and the process repeats until the remainingBytes reaches 0.
File f = new File("some file");
int bsize = 16;
int fileSize = (int)f.length();
int remainingBytes = fileSize;
int skipCnt = fileSize;
byte[] buf = new byte[bsize];
try (BufferedInputStream b =
new BufferedInputStream(new FileInputStream(f))) {
b.mark(0);
while(remainingBytes > 0) {
skipCnt = skipCnt > bsize ? skipCnt - bsize : 0;
b.skip(skipCnt);
int readCnt = remainingBytes > bsize ? bsize : remainingBytes;
b.read(buf,0,readCnt);
for (int i = readCnt-1; i >= 0; i--) {
System.out.print((char) buf[i]);
}
remainingBytes -= readCnt;
b.reset();
}
} catch (IOException ioe) {
ioe.printStackTrace();
}

This doesn't support multi byte UTF-8 characters
Using a RandomAccessFile you can easily read a file in chunks from the end to the beginning, and reverse each of the chunks.
Here's a simple example:
import java.io.FileWriter;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.stream.IntStream;
class Test {
private static final int BUF_SIZE = 10;
private static final int FILE_LINE_COUNT = 105;
public static void main(String[] args) throws Exception {
// create a large file
try (FileWriter fw = new FileWriter("largeFile.txt")) {
IntStream.range(1, FILE_LINE_COUNT).mapToObj(Integer::toString).forEach(s -> {
try {
fw.write(s + "\n");
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
// reverse the file
try (RandomAccessFile raf = new RandomAccessFile("largeFile.txt", "r")) {
long size = raf.length();
byte[] buf = new byte[BUF_SIZE];
for (long i = size - BUF_SIZE; i > -BUF_SIZE; i -= BUF_SIZE) {
long offset = Math.max(0, i);
long readSize = Math.min(i + BUF_SIZE, BUF_SIZE);
raf.seek(offset);
raf.read(buf, 0, (int) readSize);
for (int j = (int) readSize - 1; j >= 0; j--) {
System.out.print((char) buf[j]);
}
}
}
}
}
This uses a very small file and very small chunks so that you can test it easily. Increase those constants to see it work on a larger scale.
The input file contains newlines to make it easy to read the output, but the reversal doesn't depend on the file "having lines".

How to get the new content from the file

Scenario:
1.Create fromX.txt and toY.txt file (content has to be appended and will come from another logic)
2.check every second fromX.txt file for new addition if yes write it to toY.txt
how to get the just new content fromX.txt file?
I have tried implementing it by counting number of lines and looking for any change in it.
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}

You implement it like this:
Open the using RandomAccessFile
Seek to where the end-of-file was last time. (If this is the first time, seek to the start of the file.)
Read until you reach the new end-of-file.
Record where the end-of-file is.
Close the RandomAccessFile
Record the position as a byte offset from the start of the file, and use the same value for seeking.
You can modify the above to reuse the RandomAccessFile object rather than opening / closing it each time.
UPDATE - The javadocs for RandomAccessFile are here. Look for the seek and getFilePointer methods.

How to read bytes from a file, whereas the result byte[] is exactly as long

I want the result byte[] to be exactly as long as the file content. How to achieve that.
I am thinking of ArrayList<Byte>, but it doe not seem to be efficient.

Personally I'd go the Guava route:
File f = ...
byte[] content = Files.toByteArray(f);
Apache Commons IO has similar utility methods if you want.
If that's not what you want, it's not too hard to write that code yourself:
public static byte[] toByteArray(File f) throws IOException {
if (f.length() > Integer.MAX_VALUE) {
throw new IllegalArgumentException(f + " is too large!");
}
int length = (int) f.length();
byte[] content = new byte[length];
int off = 0;
int read = 0;
InputStream in = new FileInputStream(f);
try {
while (read != -1 && off < length) {
read = in.read(content, off, (length - off));
off += read;
}
if (off != length) {
// file size has shrunken since check, handle appropriately
} else if (in.read() != -1) {
// file size has grown since check, handle appropriately
}
return content;
} finally {
in.close();
}
}

I'm pretty sure File#length() doesn't iterate through the file. (Assuming this is what you meant by length()) Each OS provides efficient enough mechanisms to find file size without reading it all.

Allocate an adequate buffer (if necessary, resize it while reading) and keep track of how many bytes read. After finishing reading, create a new array with the exact length and copy the content of the reading buffer.

Small function that you can use :
// Returns the contents of the file in a byte array.
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
// Get the size of the file
long length = file.length();
// You cannot create an array using a long type.
// It needs to be an int type.
// Before converting to an int type, check
// to ensure that file is not larger than Integer.MAX_VALUE.
if (length > Integer.MAX_VALUE) {
throw new RuntimeException(file.getName() + " is too large");
}
// Create the byte array to hold the data
byte[] bytes = new byte[(int)length];
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}
// Close the input stream and return bytes
is.close();
return bytes;
}

extracting contents of ZipFile entries when read from byte[] (Java)

I have a zip file whose contents are presented as byte[] but the original file object is not accessible. I want to read the contents of each of the entries. I am able to create a ZipInputStream from a ByteArrayInputStream of the bytes and can read the entries and their names. However I cannot see an easy way to extract the contents of each entry.
(I have looked at Apache Commons but cannot see an easy way there either).
UPDATE #Rich's code seems to solve the problem, thanks
QUERY why do both examples have a multiplier of * 4 (128/512 and 1024*4) ?

If you want to process nested zip entries from a stream, see this answer for ideas. Because the inner entries are listed sequentially they can be processed by getting the size of each entry and reading that many bytes from the stream.
Updated with an example that copies each entry to Standard out:
ZipInputStream is;//obtained earlier
ZipEntry entry = is.getNextEntry();
while(entry != null) {
copyStream(is, out, entry);
entry = is.getNextEntry();
}
...
private static void copyStream(InputStream in, OutputStream out,
ZipEntry entry) throws IOException {
byte[] buffer = new byte[1024 * 4];
long count = 0;
int n = 0;
long size = entry.getSize();
while (-1 != (n = in.read(buffer)) && count < size) {
out.write(buffer, 0, n);
count += n;
}
}

It actually uses the ZipInputStream as the InputStream (but don't close it at the end of each entry).

It's a little bit tricky to calculate the start of next ZipEntry. Please see this example included in JDK 6,
public static void main(String[] args) {
try {
ZipInputStream is = new ZipInputStream(System.in);
ZipEntry ze;
byte[] buf = new byte[128];
int len;
while ((ze = is.getNextEntry()) != null) {
System.out.println("----------- " + ze);
// Determine the number of bytes to skip and skip them.
int skip = (int)ze.getSize() - 128;
while (skip > 0) {
skip -= is.skip(Math.min(skip, 512));
}
// Read the remaining bytes and if it's printable, print them.
out: while ((len = is.read(buf)) >= 0) {
for (int i=0; i<len; i++) {
if ((buf[i]&0xFF) >= 0x80) {
System.out.println("**** UNPRINTABLE ****");
// This isn't really necessary since getNextEntry()
// automatically calls it.
is.closeEntry();
// Get the next zip entry.
break out;
}
}
System.out.write(buf, 0, len);
}
}
is.close();
} catch (Exception e) {
e.printStackTrace();
}
}

How can I get the count of line in a file in an efficient way? [duplicate]

This question already has answers here:
Number of lines in a file in Java
(19 answers)
Closed 6 years ago.
I have a big file. It includes approximately 3.000-20.000 lines. How can I get the total count of lines in the file using Java?

BufferedReader reader = new BufferedReader(new FileReader("file.txt"));
int lines = 0;
while (reader.readLine() != null) lines++;
reader.close();
Update: To answer the performance-question raised here, I made a measurement. First thing: 20.000 lines are too few, to get the program running for a noticeable time. I created a text-file with 5 million lines. This solution (started with java without parameters like -server or -XX-options) needed around 11 seconds on my box. The same with wc -l (UNIX command-line-tool to count lines), 11 seconds. The solution reading every single character and looking for '\n' needed 104 seconds, 9-10 times as much.

Files.lines
Java 8+ has a nice and short way using NIO using Files.lines. Note that you have to close the stream using try-with-resources:
long lineCount;
try (Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8)) {
lineCount = stream.count();
}
If you don't specify the character encoding, the default one used is UTF-8. You may specify an alternate encoding to match your particular data file as shown in the example above.

use LineNumberReader
something like
public static int countLines(File aFile) throws IOException {
LineNumberReader reader = null;
try {
reader = new LineNumberReader(new FileReader(aFile));
while ((reader.readLine()) != null);
return reader.getLineNumber();
} catch (Exception ex) {
return -1;
} finally {
if(reader != null)
reader.close();
}
}

I found some solution for this, it might useful for you
Below is the code snippet for, count the no.of lines from the file.
File file = new File("/mnt/sdcard/abc.txt");
LineNumberReader lineNumberReader = new LineNumberReader(new FileReader(file));
lineNumberReader.skip(Long.MAX_VALUE);
int lines = lineNumberReader.getLineNumber();
lineNumberReader.close();

Read the file through and count the number of newline characters. An easy way to read a file in Java, one line at a time, is the java.util.Scanner class.

This is about as efficient as it can get, buffered binary read, no string conversion,
FileInputStream stream = new FileInputStream("/tmp/test.txt");
byte[] buffer = new byte[8192];
int count = 0;
int n;
while ((n = stream.read(buffer)) > 0) {
for (int i = 0; i < n; i++) {
if (buffer[i] == '\n') count++;
}
}
stream.close();
System.out.println("Number of lines: " + count);

Do You need exact number of lines or only its approximation? I happen to process large files in parallel and often I don't need to know exact count of lines - I then revert to sampling. Split the file into ten 1MB chunks and count lines in each chunk, then multiply it by 10 and You'll receive pretty good approximation of line count.

All previous answers suggest to read though the whole file and count the amount of newlines you find while doing this. You commented some as "not effective" but thats the only way you can do that. A "line" is nothing else as a simple character inside the file. And to count that character you must have a look at every single character within the file.
I'm sorry, but you have no choice. :-)

This solution is about 3.6× faster than the top rated answer when tested on a file with 13.8 million lines. It simply reads the bytes into a buffer and counts the \n characters. You could play with the buffer size, but on my machine, anything above 8KB didn't make the code faster.
private int countLines(File file) throws IOException {
int lines = 0;
FileInputStream fis = new FileInputStream(file);
byte[] buffer = new byte[BUFFER_SIZE]; // BUFFER_SIZE = 8 * 1024
int read;
while ((read = fis.read(buffer)) != -1) {
for (int i = 0; i < read; i++) {
if (buffer[i] == '\n') lines++;
}
}
fis.close();
return lines;
}

If the already posted answers aren't fast enough you'll probably have to look for a solution specific to your particular problem.
For example if these text files are logs that are only appended to and you regularly need to know the number of lines in them you could create an index. This index would contain the number of lines in the file, when the file was last modified and how large the file was then. This would allow you to recalculate the number of lines in the file by skipping over all the lines you had already seen and just reading the new lines.

Old post, but I have a solution that could be usefull for next people.
Why not just use file length to know what is the progression? Of course, lines has to be almost the same size, but it works very well for big files:
public static void main(String[] args) throws IOException {
File file = new File("yourfilehere");
double fileSize = file.length();
System.out.println("=======> File size = " + fileSize);
InputStream inputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "iso-8859-1");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
int totalRead = 0;
try {
while (bufferedReader.ready()) {
String line = bufferedReader.readLine();
// LINE PROCESSING HERE
totalRead += line.length() + 1; // we add +1 byte for the newline char.
System.out.println("Progress ===> " + ((totalRead / fileSize) * 100) + " %");
}
} finally {
bufferedReader.close();
}
}
It allows to see the progression without doing any full read on the file. I know it depends on lot of elements, but I hope it will be usefull :).
[Edition]
Here is a version with estimated time. I put some SYSO to show progress and estimation. I see that you have a good time estimation errors after you have treated enough line (I try with 10M lines, and after 1% of the treatment, the time estimation was exact at 95%).
I know, some values has to be set in variable. This code is quickly written but has be usefull for me. Hope it will be for you too :).
long startProcessLine = System.currentTimeMillis();
int totalRead = 0;
long progressTime = 0;
double percent = 0;
int i = 0;
int j = 0;
int fullEstimation = 0;
try {
while (bufferedReader.ready()) {
String line = bufferedReader.readLine();
totalRead += line.length() + 1;
progressTime = System.currentTimeMillis() - startProcessLine;
percent = (double) totalRead / fileSize * 100;
if ((percent > 1) && i % 10000 == 0) {
int estimation = (int) ((progressTime / percent) * (100 - percent));
fullEstimation += progressTime + estimation;
j++;
System.out.print("Progress ===> " + percent + " %");
System.out.print(" - current progress : " + (progressTime) + " milliseconds");
System.out.print(" - Will be finished in ===> " + estimation + " milliseconds");
System.out.println(" - estimated full time => " + (progressTime + estimation));
}
i++;
}
} finally {
bufferedReader.close();
}
System.out.println("Ended in " + (progressTime) + " seconds");
System.out.println("Estimative average ===> " + (fullEstimation / j));
System.out.println("Difference: " + ((((double) 100 / (double) progressTime)) * (progressTime - (fullEstimation / j))) + "%");
Feel free to improve this code if you think it's a good solution.

Quick and dirty, but it does the job:
import java.io.*;
public class Counter {
public final static void main(String[] args) throws IOException {
if (args.length > 0) {
File file = new File(args[0]);
System.out.println(countLines(file));
}
}
public final static int countLines(File file) throws IOException {
ProcessBuilder builder = new ProcessBuilder("wc", "-l", file.getAbsolutePath());
Process process = builder.start();
InputStream in = process.getInputStream();
LineNumberReader reader = new LineNumberReader(new InputStreamReader(in));
String line = reader.readLine();
if (line != null) {
return Integer.parseInt(line.trim().split(" ")[0]);
} else {
return -1;
}
}
}

Read the file line by line and increment a counter for each line until you have read the entire file.

Try the unix "wc" command. I don't mean use it, I mean download the source and see how they do it. It's probably in c, but you can easily port the behavior to java. The problem with making your own is to account for the ending cr/lf problem.

The buffered reader is overkill
Reader r = new FileReader("f.txt");
int count = 0;
int nextchar = 0;
while (nextchar != -1){
nextchar = r.read();
if (nextchar == Character.getNumericValue('\n') ){
count++;
}
}
My search for a simple example has createde one thats actually quite poor. calling read() repeadedly for a single character is less than optimal. see here for examples and measurements.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Storing a large binary file - java

Are there any ways to store a large binary file like 50 MB in the ten files with 5 MB? thanks are there any special classes for doing this?

Related

How to read a File character-by-character in reverse without running out-of-memory?

How to get the new content from the file

How to read bytes from a file, whereas the result byte[] is exactly as long

extracting contents of ZipFile entries when read from byte[] (Java)

How can I get the count of line in a file in an efficient way? [duplicate]

Categories

Resources