I have a database dump program that writes out flat files of a table in a very specific format. I now need to test this against our old program and confirm the produced files are identical. Doing this manually is painful, so I need to write some unit tests.
I need to compare two file contents byte by byte, and see the first difference. The issue is they have all manner of crazy bytes with CF/LF/null's etc littered throughout.
Here is a screenshot of the two files fro Scite to give you an idea:
http://imageshack.us/photo/my-images/840/screenshot1xvt.png/
What is the best strategy for confirming each byte corresponds?
Apache Commons IO has a FileUtils.contentEquals(File file1, File file2) method that seems to do what you want. Pros:
Looks efficient -- reads the file contents using a buffered stream, doesn't even open the files if the lengths are different.
Convenient.
Con:
Won't give you details about where the differences are. It sounds from your comment like you want this.
I would say your best bet is to just download the source code, see what they're doing, and then enhance it to print out the line numbers. The hard part will be figuring out which line you're on. By reading at the byte level, you will have to explicitly check for \r, \n, or \r\n and then increment your own "line number" counter. I also don't know what kind of i18n issues (if any) you'll run into.
class DominicFile {
static boolean equalfiles(File f1, File f2) {
byte[] b1 = getBytesFromFile(f1);
byte[] b2 = getBytesFromFile(f2);
if(b1.length != b2.length) return false;
for(int i = 0; i < b1.length; i++) {
if(b1[i] != b2[i]) return false;
}
return true;
}
// returns the index (0 indexed) of the first difference, or -1 if identical
// fails for files 2G or more due to limitations of "int"... use long if needed
static int firstDiffBetween(File f1, File f2) {
byte[] b1 = getBytesFromFile(f1);
byte[] b2 = getBytesFromFile(f2);
int shortest = b1.length;
if(b2.length < shortest) shortest = b2.length;
for(int i = 0; i < shortest; i++) {
if(b1[i] != b2[i]) return i;
}
return -1;
}
// Returns the contents of the file in a byte array.
// shamelessly stolen from http://www.exampledepot.com/egs/java.io/file2bytearray.html
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
// Get the size of the file
long length = file.length();
// You cannot create an array using a long type.
// It needs to be an int type.
// Before converting to an int type, check
// to ensure that file is not larger than Integer.MAX_VALUE.
if (length > Integer.MAX_VALUE) {
// File is too large
}
// Create the byte array to hold the data
byte[] bytes = new byte[(int)length];
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}
// Close the input stream and return bytes
is.close();
return bytes;
}
}
Why not do an MD5 checksum, like the one describe here
Related
I'm using RandomAccessFile for writing segments. And now I want to read some file segments but have problems with the end of the reading.
For example, I want to read one file page (each page contains 512 bytes).
var totalRead = 0;
var readingByte = 0;
val bytesToRead = 512; // Each file page - 512 bytes
var randomAccessFile = new RandomAccessFile(dbmsFile, "rw");
randomAccessFile.seek(pageId * PAGE_SIZE); // Start reading from chosen page (by pageId)
valr stringRepresentation = new StringBuilder("");
while (totalRead < bytesToRead) {
readingBytes = randomAccessFile.read();
totalRead += readingBytes;
stringRepresentation.append((char) readingBytes);
}
But this approach is not right, because actually, it's reading the non-full pages, just a small part of it. Because 512 - around 41 file records. And just because I'm parsing it symbol by symbol, it cannot be correct. How can I do it better?
Your code is adding the value of the byte to totalRead rather than increment by 1, so it will count to 512 much faster than expected and miss a section of data. The loop should check for and exit if randomAccessFile.read() returns -1 / EOF:
while (totalRead < bytesToRead && (readingByte = randomAccessFile.read()) != -1) {
totalRead++;
stringRepresentation.append((char) readingByte);
}
Note that you this code may not handle all byte to char conversions correctly as it casts a byte to char.
I just finished coding a Huffman compression/decompression program. The compression part of it seems to work fine but I am having a little bit of a problem with the decompression. I am quite new to programming and this is my first time doing any sort of byte manipulation/file handling so I am aware that my solution is probably awful :D.
For the most part my decompression method works as intended but sometimes it drops data after decompression (aka my decompressed file is smaller than my original file).
Also whenever I try to decompress a file that isnt a plain text file (for example a .jpg) the decompression returns a completely empty file (0 bytes), the compression compresses these other types of files just fine though.
Decompression method:
public static void decompress(File file){
try {
BitFileReader bfr = new BitFileReader(file);
int[] charFreqs = new int[256];
TreeMap<String, Integer> decodeMap = new TreeMap<String, Integer>();
File nF = new File(file.getName() + "_decomp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
DataInputStream data = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
int uniqueBytes;
int counter = 0;
int byteCount = 0;
uniqueBytes = data.readUnsignedByte();
// Read frequency table
while (counter < uniqueBytes){
int index = data.readUnsignedByte();
int freq = data.readInt();
charFreqs[index] = freq;
counter++;
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillDecodeMap(tree, new StringBuffer(), decodeMap);
// Skip BitFileReader position to actual compressed code
bfr.skip(uniqueBytes*5);
// Get total number of compressed bytes
for(int i=0; i<charFreqs.length; i++){
if(charFreqs[i] > 0){
byteCount += charFreqs[i];
}
}
// Decompress data and write
counter = 0;
StringBuffer code = new StringBuffer();
while(bfr.hasNextBit() && counter < byteCount){
code.append(""+bfr.nextBit());
if(decodeMap.containsKey(code.toString())){
bfw.writeByte(decodeMap.get(code.toString()));
code.setLength(0);
counter++;
}
}
bfw.close();
bfr.close();
data.close();
System.out.println("Decompression successful!");
}
catch (IOException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
File f = new File("test");
compress(f);
f = new File("test_comp");
decompress(f);
}
}
When I compress the file I save the "character" (byte) values and the frequencies of each unique "character" + the compressed bytes in the same file (all in binary form). I then use this saved info to fill by charFreqs array in my decompress() method and then use that array to build my tree. The formatting of the saved structure looks like this:
<n><value 1><frequency>...<value n><frequency>[the compressed bytes]
(without the <> of course) where n is the number of unique bytes/characters I have in my original text (AKA my leaf values).
I have tested my code a bit and the bytes seem to get dropped somewhere in the while() loop at the bottom of my decompress method (charFreqs[] and the tree seem to retain all the original byte values).
EDIT: Upon request I have now shortened my post a bit in an attempt to make it less cluttered and more "straight to the point".
EDIT 2: I fixed it (but not fully)! The fault was in my BitFileWriter and not in my decompress method. My decompression still does not function properly though. Whenever I try to decompress something that isn't a plain text file (for example a .jpg) it returns a empty "decompressed" file (0 bytes in size). I have no idea what is causing this...
I am having trouble using the NIO MappedByteBuffer function to read very large seismic files. The format my program reads is called SEGY and consists of seismic data samples as well as meta data regarding, among other items, the numeric ID and XY coordinates of the seismic data.
The structure of the format is fairly fixed with a 240 byte header followed by a fixed number of data samples making up each seismic trace. The number of samples per trace can vary from file to file but usually is around 1000 to 2000.
Samples can be written as single bytes, 16 or 32 bit integers, or either IBM or IEEE float. The data in each trace header can likewise be in any of the above formats. To further confuse the issue SEGY files can be in big or little endian byte order.
The files can range in size from 3600 bytes up to several terrabytes.
My application is a SEGY editor and viewer. For many of the functions it performs I must read only one or two variables, say long ints from each trace header.
At present I am reading from a RandomAccessFile into a byte buffer, then extracting the needed variables from a view buffer. This works but is painfully slow for very large files.
I have written a new file handler using a mapped byte buffer that breaks the file into 5000 trace MappedByteBuffers. This works well and is very fast until my system runs low on memory and then it slows to a crawl and I am forced to reboot just to make my Mac useable again.
For some reason the memory from the buffers is never released, even after my program is finished. I need to either do a purge or reboot.
This is my code. Any suggestions would be most appreciated.
package MyFileHandler;
import java.io.*;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
public class MyFileHandler
{
/*
A buffered file IO class that keeps NTRACES traces in memory for reading and writing.
the buffers start and end at trace boundaries and the buffers are sequential
i.e 1-20000,20001-40000, etc
The last, or perhaps only buffer will contain less than NTRACES up to the last trace
The arrays BufferOffsets and BufferLengths contain the start and length for all the
buffers required to read and write to the file
*/
private static int NTRACES = 5000;
private boolean HighByte;
private long FileSize;
private int BytesPerTrace;
private FileChannel FileChnl;
private MappedByteBuffer Buffer;
private long BufferOffset;
private int BufferLength;
private long[] BufferOffsets;
private int[] BufferLengths;
private RandomAccessFile Raf;
private int BufferIndex;
private ArrayList Maps;
public MyFileHandler(RandomAccessFile raf, int bpt)
{
try
{
HighByte = true;
// allocate a filechannel to the file
FileChnl = raf.getChannel();
FileSize = FileChnl.size();
BytesPerTrace = bpt;
SetUpBuffers();
BufferIndex = 0;
GetNewBuffer(0);
} catch (IOException ioe)
{
ioe.printStackTrace();
}
}
private void SetUpBuffers()
{
// get number of traces in entire file
int ntr = (int) ((FileSize - 3600) / BytesPerTrace);
int nbuffs = ntr / NTRACES;
// add one to nbuffs unless filesize is multiple of NTRACES
if (Math.IEEEremainder(ntr, NTRACES) != 0)
{
nbuffs++;
}
BufferOffsets = new long[nbuffs];
BufferLengths = new int[nbuffs];
// BuffOffset are in bytes, not trace numbers
//get the offsets and lengths of each buffer
for (int i = 0; i < nbuffs; i++)
{
if (i == 0)
{
// first buffer contains EBCDIC header 3200 bytes and binary header 400 bytes
BufferOffsets[i] = 0;
BufferLengths[i] = 3600 + (Math.min(ntr, NTRACES) * BytesPerTrace);
} else
{
BufferOffsets[i] = BufferOffsets[i - 1] + BufferLengths[i - 1];
BufferLengths[i] = (int) (Math.min(FileSize - BufferOffsets[i], NTRACES * BytesPerTrace));
}
}
GetMaps();
}
private void GetMaps()
{
// map the file to list of MappedByteBuffer
Maps = new ArrayList(BufferOffsets.length);
try
{
for(int i=0;i<BufferOffsets.length;i++)
{
MappedByteBuffer map = FileChnl.map(FileChannel.MapMode.READ_WRITE, BufferOffsets[i], BufferLengths[i]);
SetByteOrder(map);
Maps.add(map);
}
} catch (IOException ioe)
{
ioe.printStackTrace();
}
}
private void GetNewBuffer(long offset)
{
if (Buffer == null || offset < BufferOffset || offset >= BufferOffset + BufferLength)
{
BufferIndex = GetBufferIndex(offset);
BufferOffset = BufferOffsets[BufferIndex];
BufferLength = BufferLengths[BufferIndex];
Buffer = (MappedByteBuffer)Maps.get(BufferIndex);
}
}
private int GetBufferIndex(long offset)
{
int indx = 0;
for (int i = 0; i < BufferOffsets.length; i++)
{
if (offset >= BufferOffsets[i] && offset < BufferOffsets[i]+BufferLengths[i])
{
indx = i;
break;
}
}
return indx;
}
private void SetByteOrder(MappedByteBuffer ByteBuff)
{
if (HighByte)
{
ByteBuff.order(ByteOrder.BIG_ENDIAN);
} else
{
ByteBuff.order(ByteOrder.LITTLE_ENDIAN);
}
}
// public methods to read, (get) or write (put) an array of types, byte, short, int, or float.
// for sake of brevity only showing get and put for ints
public void Get(int[] buff, long offset)
{
GetNewBuffer(offset);
Buffer.position((int) (offset - BufferOffset));
Buffer.asIntBuffer().get(buff);
}
public void Put(int[] buff, long offset)
{
GetNewBuffer(offset);
Buffer.position((int) (offset - BufferOffset));
Buffer.asIntBuffer().put(buff);
}
public void HighByteOrder(boolean hb)
{
// all byte swapping is done by the buffer class
// set all allocated buffers to same byte order
HighByte = hb;
}
public int GetBuffSize()
{
return BufferLength;
}
public void Close()
{
try
{
FileChnl.close();
} catch (Exception e)
{
e.printStackTrace();
}
}
}
You are mapping the entire file into memory, via a possibly large number of MappedByteBuffers, and as you are keeping them in a Map they are never released. It is pointless. You may as well map the entire file with a single MappedByteBuffer, or the minimum number you need to overcome the address limitation. There is no benefit in using more of them than you need.
But I would only map the segment of the file that is currently being viewed/edited, and release it when the user moves to another segment.
I'm surprised that MappedByteBuffer is found to be so much faster. Last time I tested, reads via mapped byte buffers were only 20% faster than RandomAccessFile, and writes not at all. I'd like to see the RandomAccessFile code, as it seems there is probably something wrong with it that could easily be fixed.
I need to read a file that is in ascii and convert it into hex before applying some functions (search for a specific caracter)
To do this, I read a file, convert it in hex and write into a new file. Then I open my new hex file and I apply my functions.
My issue is that it makes way too much time to read and convert it (approx 8sec for a 9Mb file)
My reading method is :
public static void convertToHex2(PrintStream out, File file) throws IOException {
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int value = 0;
StringBuilder sbHex = new StringBuilder();
StringBuilder sbResult = new StringBuilder();
while ((value = bis.read()) != -1) {
sbHex.append(String.format("%02X ", value));
}
sbResult.append(sbHex);
out.print(sbResult);
bis.close();
}
Do you have any suggestions to make it faster ?
Did you measure what your actual bottleneck is? Because you seem to read very little amount of data in your loop and process that each time. You might as well read larger chunks of data and process those, e.g. using DataInputStream or whatever. That way you would benefit more from optimized reads of your OS, file system, their caches etc.
Additionally, you fill sbHex and append that to sbResult, to print that somewhere. Looks like an unnecessary copy to me, because sbResult will always be empty in your case and with sbHex you already have a StringBuilder for your PrintStream.
Try this:
static String[] xx = new String[256];
static {
for( int i = 0; i < 256; ++i ){
xx[i] = String.format("%02X ", i);
}
}
and use it:
sbHex.append(xx[value]);
Formatting is a heavy operation: it does not only the coversion - it also has to look at the format string.
I have a series of objects stored within a file concatenated as below:
sizeOfFile1 || file1 || sizeOfFile2 || file2 ...
The size of the files are serialized long objects and the files are just the raw bytes of the files.
I am trying to extract the files from the input file. Below is my code:
FileInputStream fileInputStream = new FileInputStream("C:\Test.tst");
ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
while (fileInputStream.available() > 0)
{
long size = (long) objectInputStream.readObject();
FileOutputStream fileOutputStream = new FileOutputStream("C:\" + size + ".tst");
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
int chunkSize = 256;
final byte[] temp = new byte[chunkSize];
int finalChunkSize = (int) (size % chunkSize);
final byte[] finalTemp = new byte[finalChunkSize];
while(fileInputStream.available() > 0 && size > 0)
{
if (fileInputStream.available() > finalChunkSize)
{
int i = fileInputStream.read(temp);
secBufferedOutputStream.write(temp, 0, i);
size = size - i;
}
else
{
int i = fileInputStream.read(finalTemp);
secBufferedOutputStream.write(finalTemp, 0, i);
size = 0;
}
}
bufferedOutputStream.close();
}
fileOutputStream.close();
My code fails after it reads the first sizeOfFile; it just reads the rest of the input file into one file when there are multiple files stored.
Can anyone see the issue here?
Regards.
Wrap it in a DataInputStream and use readFully(byte[]).
But I question the design. Serialization and random access do not mix. It sounds like you should be using a database.
NB you are misusing available(). See the method's Javadoc page. It is never correct to use it as a count of the total number of bytes in the stream. There are few if any correct uses of available(), and this isn't one of them.
you could try NIO instead...
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, SIZE);
This reads only SIZE bytes from the file.
B
This is using DataInput to read longs. In this particular case I am not using readFully() as a segment might be too long to keep it in memory:
DataInputStream in = new DataInputStream(FileInputStream());
byte[] buf = new byte[64*1024];
while(true) {
OutputStream out = ...;
long size;
try { size = in.readLong(); } catch (EOFException e) { break; }
while(size > 0) {
int len = (size > buf.length)?buf.length:size;
len = in.read(buf, 0, len);
out.write(buf, 0, len);
size-=len;
}
out.close();
}
Save yourself a lot of trouble by doing one of these things:
Switch to using Avro, trust me you would be crazy not to. It's easy to learn, and will accomodate schema changes. Using ObjectXXXStream is one of the worst ideas ever, as soon as you change your schema your old files are garbage.
or use Thrift
or use Hibernate (but this is probably not a great option, hibernate takes a lot of time to learn, and takes a lot of configuration)
If you really refuse to switch to avro, I recommend reading up on apache's IOUtils class. It has a method to copy from one input stream to another, saving you a lot of headaches. Unfortunately what you want to do is a little more complicated, you want the size prefixing each file. You might be able to use a combination of SequenceInputStream objects to do that.
There is also GzipOutputStream and ZipOutputStream, but I think those require some other jars added to your classpath too.
I'm not going to write an example because I honestly think you should just learn avro or thrift and use that.