File splitting loss of data - java

I wrote a program for file splitting and joining. When I break the file into small pieces I found that the size of smaller file is not equal to the original one, there is loss of approximately 30-50 bytes of data. and the combined file doesn't run correctly
e.g. a file ABC has been broken into 2 parts, ABC1 and ABC2 but the problem is
sizeof(ABC) is not equal to sizeof(ABC1) + sizeof(ABC2). By sizeof(ABC) I mean from Windows's perspective, i.e. from the Windows property dialog box.
My code is:
for(int i =0;i<no_of_parts;i++)
{
copied_data = 0;// a variable that count the no of byte transferred in the part of file
fos = new FileOutputStream(jTextField2.getText()+"\\".part"+i);
bouts = new BufferedOutputStream(fos);
while((b = bins.read())!= -1)
{
bouts.write(b);
copied_data++;
if(copied_data==each_part_size_in_byte)
break;
}
}

What about closing your output stream? It will flush the buffer and free the file descriptor you use. Call bouts.close().

When you create a file, it is created in blocks of memories instead of individual bytes. So when you divide the file into two, both of them have sizes in fixed blocks which may be more than your actual size of the written data.

Related

Reading large file in bytes by chunks with dynamic buffer size

I'm trying to read a large file by chunks and save them in an ArrayList of bytes.
My code, in short, looks like this:
public ArrayList<byte[]> packets = new ArrayList<>();
FileInputStream fis = new FileInputStream("random_text.txt");
byte[] buffer = new byte[512];
while (fis.read(buffer) > 0){
packets.add(buffer);
}
fis.close();
But it has a behavior that I don't know how to solve, for example: If a file has only the words "hello world", this chunk does not necessarily need to be 512 bytes long. In fact, I want each chunk to be a maximum of 512 bytes not that they all necessarily have that size.
First of all, what you are doing is probably a bad idea. Storing a file's contents in memory like this is liable to be a waste of heap space ... and can lead to OutOfMemoryError exceptions and / or a requirement for an excessively large heap if you process large (enough) input files.
The second problem is that your code is wrong. You are repeatedly reading the data into the same byte array. Each time you do, it overwrites what was there before. So you will end up will a list containing lots of reference to a single byte array ... containing just the last chunk of data that you read.
To solve the problem that you asked about1, you will need to copy the chunk that you read to a new (smaller) byte array.
Something like this:
public ArrayList<byte[]> packets = new ArrayList<>();
try (FileInputStream fis = new FileInputStream("random_text.txt")) {
byte[] buffer = new byte[512];
int len;
while ((len = fis.read(buffer)) > 0) {
packets.add(Arrays.copyOf(buffer, len));
}
}
Note that this also deals with the second problem I mentioned. And fixes a potential resource leak by using try with resource syntax to manage the closure of the input stream.
A final issue: If this is really a text file that you are reading, you probably should be using a Reader to read it, and char[] or String to hold it.
But even if you do that there are some awkward edge cases if your text contains Unicode codepoints that are not in code plane 0. For example, emojis. The edge cases will involve code points that are represented as a surrogate pair AND the pair being split on a chunk boundary. Reading and storing the text as lines would avoid that.
1 - The issue here is not the "wasted" space. Unless you are reading and caching a large number of small file, any space wastage due to "short" chunks will be unimportant. The important issue is knowing which bytes in each byte[] are actually valid data.

Is Sequence input stream faster than file input stream

I want to know which one is faster. Sequence input stream or file input stream.
Here is my sample program
FileInputStream fileInput = new FileInputStream("C:\\Eclipse\\File Output Stream.txt");
FileInputStream fileInput1=new FileInputStream("C:\\Eclipse\\Buffer Output Stream.txt");
SequenceInputStream sequence=new SequenceInputStream(fileInput,fileInput1);
FileOutputStream outputSequenceStream=new FileOutputStream("C:\\Eclipse\\File OutputSequence Stream.txt");
int i=0;
byte[] b = new byte[10];
long start=System.currentTimeMillis();
//System.out.println(start);
while((i=sequence.read())!=-1){
//outputSequenceStream.write(i);
System.out.println(Integer.toBinaryString(i)+" "+i+" "+ (char)i);
}
System.out.println(System.currentTimeMillis()-start);
System.out.println("success");
System.out.println("Reading one file after another using file input");
FileInputStream fileInput2 = new FileInputStream("C:\\Eclipse\\File Output Stream.txt");
FileInputStream fileInput3=new FileInputStream("C:\\Eclipse\\Buffer Output Stream.txt");
start=System.currentTimeMillis();
/* Reading first file */
while((i=fileInput2.read())!=-1){
System.out.println((char)i);
}
/* Reading second file */
while((i=fileInput3.read())!=-1){
System.out.println((char)i);
}
System.out.println(System.currentTimeMillis()-start);
System.out.println("Success");
File input stream gives me less Number than sequence output stream.So does that mean Sequence is slower than file input stream.If so then why do we use sequence stream instead wouldn't it be better to use file input stream?
The javadoc is pretty clear about the purpose of that class:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
It is nothing but an abstraction, that allows you to easily "concatenate" multiple input sources.
It shouldn't affect performance at all, in that sense, the "real" answer here is to learn how to properly benchmark java code. See here for starters.
On top of that, you are also forgetting about the operating system. In order to really measure IO performance, you should be using different files (to avoid the OS reading things into memory first, and all subsequent reads going to memory!) You would also have to use files with maybe 100 MB of data, not 10 bits.
In other words: your numbers are simply meaningless, it is therefore not possible to draw any conclusions from them.

How to speed up reading of a large OBJ (text) file?

I am using an OBJ Loader library that I found on the 'net and I want to look into speeding it up.
It works by reading an .OBJ file line by line (its basically a text file with lines of numbers)
I have a 12mb OBJ file that equates to approx 400,000 lines. suffice to say, it takes forever to try and read line by line.
Is there a way to speed it up? It uses a BufferedReader to read the file (which is stored in my assets folder)
Here is the link to the library: click me
Just an idea: you could first get the size of the file using the File class, after getting the file:
File file = new File(sdcard,"sample.txt");
long size = file.length();
The size returned is in bytes, thus, divide the file size into a sizable number of chunks e.g. 5, 10, 20 e.t.c. with a byte range specified and saved for each chunk. Create byte arrays of the same size as each chunk created above, then "assign" each chunk to a separate worker thread, which should read its chunk into its corresponding character array using the read(buffer, offset, length) method i.e. read "length" characters of each chunk into the array "buffer" beginning at array index "offset". You have to convert the bytes into characters. Then concatenate all arrays together to get the final complete file contents. Insert checks for the chunk sizes so each thread does not overlap the others boundaries. Again, this is just an idea, hopefully it will work when actually implemented.

Reading big files and performing some operations in java

First of all I would try to explain what I need to do.
I need to read a file (whose size could be from 1 byte to 2 GB), 2 GB maximum because I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList
However I also need to do the following:
User could type blockSize which is the number of chars I have to read into the StringBuilder (which is basically number of file bytes converted to chars)
Once I have collected the user defined char count, I create a copy of the String Builder and put it into an Array List
All steps are performed for every char read. The problem is with String Builder since if the file is big (<500 MB), I get the exception OutOfMemoryError.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at java.lang.StringBuilder.<init>(StringBuilder.java:106)
at borrows.wheeler.ReadFile.readFile(ReadFile.java:43)
Java Result: 1
I post my code, maybe someone could suggest improvements to this code or suggest some alternatives.
public class ReadFile {
//matrix block size
public int blockSize = 100;
public int charCounter = 0;
public ArrayList readFile(File file) throws FileNotFoundException, IOException {
FileChannel fc = new FileInputStream(file).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int) fc.size());
ArrayList characters = new ArrayList();
int counter = 0;
StringBuilder sb = new StringBuilder();//blockSize-1
while (mbb.hasRemaining()) {
char charAscii = (char)mbb.get();
counter++;
charCounter++;
if (counter == blockSize){
sb.append(charAscii);
characters.add(new StringBuilder(sb));//new StringBuilder(sb)
sb.delete(0, sb.length());
counter = 0;
}else{
sb.append(charAscii);
}
if(!mbb.hasRemaining()){
characters.add(sb);
}
}
fc.close();
return characters;
}
}
EDIT:
I am doing Burrows-Wheeler transformation. There i should read every file then by Block Size create as many as needed matrixes. well i believe that wiki will explain better than me:
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
If you load large files, it's not entirely surprising that you run out of memory.
How much memory do you have? Are you on a 64-bit system with 64-bit Java? How much heap memory have you allocated (e.g using -Xmx setting)?
Bear in mind that you will need at least twice as much memory as the filesize, because Java uses Unicode UTF-16, which uses at least 2 bytes for each character, but your input is one byte per character. So to load a 2GB file you will need at least 4GB allocated to the heap just for storing this text data.
Also, you need to sort out the logic in your code - you do the same sb.append(charAscii) in the if and the else, and you test !mbb.hasRemaining() in every iteration of a while((mbb.hasRemaining()) loop.
As I asked in your previous question, do you need to store StringBuilders, or would the resulting Strings be OK? Storing strings would save space because StringBuilder allocates memory in big chunks (I think it doubles in size every time it runs out of space!) so may waste a lot.
If you do have to use StringBuilders then pre-sizing them to the value of blockSize would make the code more memory-efficient (and faster).
I try to use MappedByteBuffer for fast reading. Maybe later I will try
to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII
encoding) to chars which later I put into a StringBuilder and then I
put this String Builder into an ArrayList
This sounds more like a problem than a solution. I suggest to you that the file already is ASCII, or character data; that it could be read pretty efficiently using a BufferedReader; and that it can be processed one line at a time.
So do that. You won't get even double the speed by using a MappedByteBuffer, and everything you're doing including the MappedByteBuffer is consuming memory on a truly heroic scale.
If the file isn't such that it can be processed line by line, or record by record, there is something badly wrong upstream.

Writing huge string data to file in java, optimization options

I have a chat like desktop java swing app, where i keep getting String type data. Eventually the String variable keeps growing larger and larger.
1) Is it wise idea to keep the large variable in memory and only when the logging is finished save this to disk.
2) If not, then should i continue saving everytime i get a new string (of length about 30-40).
How should i go about optimizing such a desgin?
I would use a BufferedWriter, like PrintWriter. This will buffer the data for you and write every 8 KB (actually every 8192 characters). If you want to write more often you can use flush() or a smaller buffer.
PrintWriter pw = new PrintWriter("my.log");
// will actually write to the OS, 5 times. (1000 * 40 / 8192)
for(int i = 0; i < 1000; i++) {
pw.printf("%39d%n", i); // a 40 character number.
}
pw.flush();
or you can use
pw.println(lineOfText);
BTW: If you want to know what a really huge file looks like ;) This example writes an 8 TB file http://vanillajava.blogspot.com/2011/12/using-memory-mapped-file-for-huge.html
Perhaps you should use a StringBuilder. Append each new message to it, and at the end convert it to a string.
For example,
StringBuilder sb = new StringBuilder();
// Do your code that continuously adds new messages/strings.
sb.append(new_string);
// Then once you are done...
String result = sb.toString();
If you were to have some string, say String message, and every time you got a new message/string you did message += new_string, it will eat up more memory.
As suggested by Viruzzo, only save so much, then discard the earlier strings at some point. Don't hold on to every message forever.

Categories