Java - Compare InputStreams of two identical files

Java - Compare InputStreams of two identical files - java

I am creating a JUnitTest test that compares a file that is created with a benchmark file, present in the resources folder in the src folder in Eclipse.
Code
public class CompareFileTest
{
private static final String TEST_FILENAME = "/resources/CompareFile_Test_Output.xls";
#Test
public void testCompare()
{
InputStream outputFileInputStream = null;
BufferedInputStream bufferedInputStream = null;
File excelOne = new File(StandingsCreationHelper.directoryPath + "CompareFile_Test_Input1.xls");
File excelTwo = new File(StandingsCreationHelper.directoryPath + "CompareFile_Test_Input1.xls");
File excelThree = new File(StandingsCreationHelper.directoryPath + "CompareFile_Test_Output.xls");
CompareFile compareFile = new CompareFile(excelOne, excelTwo, excelThree);
// The result of the comparison is stored in the excelThree file
compareFile.compare();
try
{
outputFileInputStream = new FileInputStream(excelThree);
bufferedInputStream = new BufferedInputStream(outputFileInputStream);
assertTrue(IOUtils.contentEquals(CompareFileTest.class.getResourceAsStream(TEST_FILENAME), bufferedInputStream));
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
However, I get an Assertion Error message, without any details. Since I just created the benchmark file from the compare file operation, both files should be identical.
Thanks in advance!
EDIT: After slim's comments, I used a file diff tool and found that both files are different, although, since they are copies, I am not sure how that happened. Maybe there is a timestamp or something?

IOUtils.contentEquals() does not claim to give you any more information than a boolean "matches or does not match", so you cannot hope to get extra information from that.
If your aim is just to get to the bottom of why these two files are different, you might step away from Java and use other tools to compare the files. For example https://superuser.com/questions/125376/how-do-i-compare-binary-files-in-linux
If your aim is for your jUnit tests to give you more information when the files do not match (for example, the exception could say Expected files to match, but byte 5678 differs [0xAE] vs [0xAF]), you will need to use something other than IOUtils.contentEquals() -- by rolling your own, or by hunting for something appropriate in Comparing text files w/ Junit

I had a similar issue.
I was using JUNIT assertion library Assertions and got the memory address being compared rather than the actual file it seemed.
Instead of comparing the InputStream objects I converted them to byte arrays and compared those. Not an absolute specials, but I dare to claim that if the byte array is identical, then the underlying InputStream and its file have a large chance of being equal.
like this:
Assertions.assertEquals(
this.getClass().getResourceAsStream("some_image_or_other_file.ext").readAllBytes(),
someObject.getSomeObjectInputStream().readAllBytes());
Not sure that this will scale though for larger files. Certainly not OK for complex diffs, but it does the trick for an assertion.

Related

Multiple file reading loop and distinguishing between .pdf and .doc files

Am writing a Java program in Eclipse to scan keywords from resumes and filter the most suitable resume among them, apart from showing the keywords for each resume. The resumes can be of doc/pdf format.
I've successfully implemented a program to read pdf files and doc files seperately (by using Apache's PDFBox and POI jar packages and importing libraries for the required methods), display the keywords and show resume strength in terms of the number of keywords found.
Now there are two issues am stuck in:
(1) I need to distinguish between a pdf file and a doc file within the program, which is easily achievable by an if statement but am confused how to write the code to detect if a file has a .pdf or .doc extension. (I intend to build an application to select the resumes, but then the program has to decide whether it will implement the doc type file reading block or the pdf type file reading block)
(2) I intend to run the program for a list of resumes, for which I'll need a loop within which I'll run the keyword scanning operations for each resume, but I can't think of a way as because even if the files were named like 'resume1', 'resume2' etc we can't assign the loop's iterable variable in the file location like : 'C:/Resumes_Folder/Resume[i]' as thats the path.
Any help would be appreciated!

You can use a FileFilter to read only one type or another, then respond accordingly. It'll give you a List containing only files of the desired type.
The second requirement is confusing to me. I think you would be well served by creating a class that encapsulates the data and behavior that you want for a parsed Resume. Write a factory class that takes in an InputStream and produces a Resume with the data you need inside.
You are making a classic mistake: You are embedding all the logic in a main method. This will make it harder to test your code.
All problem solving consists of breaking big problems into smaller ones, solving the small problems, and assembling them to finally solve the big problem.
I would recommend that you decompose this problem into smaller classes. For example, don't worry about looping over a directory's worth of files until you can read and parse an individual PDF and DOC file.
Create an interface:
public interface ResumeParser {
Resume parse(InputStream is) throws IOException;
}
Implement different implementations for PDF and Word Doc.
Create a factory to give you the appropriate ResumeParser based on file type:
public class ResumeParserFactory {
public ResumeParser create(String fileType) {
if (fileType.contains(".pdf") {
return new PdfResumeParser();
} else if (fileType.contains(".doc") {
return new WordResumeParser();
} else {
throw new IllegalArgumentException("Unknown document type: " + fileType);
}
}
}
Be sure to write unit tests as you go. You should know how to use JUnit.

Another alternative to using a FileFilter is to use a DirectoryStream, because Files::newDirectoryStream easily allows to specify relevant file endings:
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir, "*.{doc,pdf}")) {
for (Path entry: stream) {
// process files here
}
} catch (DirectoryIteratorException ex) {
// I/O error encounted during the iteration, the cause is an IOException
throw ex.getCause();
}
}

You can do something basic like:
// Put the path to the folder containing all the resumes here
File f = new File("C:\\");
ArrayList<String> names = new ArrayList<>
(Arrays.asList(Objects.requireNonNull(f.list())));
for (String fileName : names) {
if (fileName.length() > 3) {
String type = fileName.substring(fileName.length() - 3);
if (type.equalsIgnoreCase("doc")) {
// doc file logic here
} else if (type.equalsIgnoreCase("pdf")) {
// pdf file logic here
}
}
}
But as DuffyMo's answer says, you can also use a FileFilter (it's definitely a better option than my quick code).
Hope it helps.

Create a text file if it doesn't exist and append to it if it does using Java BufferedWriter

This is probably ridiculously simple for gun Java programmers, yet the fact that I (a relative newbie to Java) couldn't find a simple, straightforward example of how to do it means that I'm going to use the self-answer option to hopefully prevent others going through similar frustration.
I needed to output error information to a simple text file. These actions would be infrequent and small (and sometimes not needed at all) so there is no point keeping a stream open for the file; the file is opened, written to and closed in the one action.
Unlike other "append" questions that I've come across, this one requires that the file be created on the first call to the method in that run of the Java application. The file will not exist before that.
The original code was:
Path pathOfLog = Paths.get(gsOutputPathUsed + gsOutputFileName);
Charset charSetOfLog = Charset.forName("US-ASCII");
bwOfLog = Files.newBufferedWriter(pathOfLog, charSetOfLog);
bwOfLog.append(stringToWrite, 0, stringToWrite.length());
iReturn = stringToWrite.length();
bwOfLog.newLine();
bwOfLog.close();
The variables starting with gs are pre-populated string variables showing the output location, and stringToWrite is an argument which is passed in.
So the .append method should be enough to show that I wanted to append content, right?
But it isn't; each time the procedure was called the file was left containing only the string of the most recent call.

The answer is that you also need to specify open options when calling the newBufferedWriter method. What gets you is the default arguments as specified in the documentation:
If no options are present then this method works as if the CREATE,
TRUNCATE_EXISTING, and WRITE options are present.
Specifically, it's TRUNCATE_EXISTING that causes the problem:
If the file already exists and it is opened for WRITE access, then its
length is truncated to 0.
The solution, then, is to change
bwOfLog = Files.newBufferedWriter(pathOfLog, charSetOfLog);
to
bwOfLog = Files.newBufferedWriter(pathOfLog, charSetOfLog,StandardOpenOption.CREATE, StandardOpenOption.APPEND);
Probably obvious to long time Java coders, less so to new ones. Hopefully this will help someone avoid a bit of head banging.

You can also try this :
Path path = Paths.get("C:\\Users", "textfile.txt");
String text = "\nHello how are you ?";
try (BufferedWriter writer = Files.newBufferedWriter(path, StandardCharsets.UTF_8, StandardOpenOption.APPEND,StandardOpenOption.CREATE)) {
writer.write(text);
} catch (IOException e) {
e.printStackTrace();
}

Comparing Files dependent on operative system. JUnit

I have a simple JUnit test which checks two files have the same content. It works perfectly fine in my Unix laptop.
Here it is the test:
boolean response = false;
try {
File got = File.createTempFile("got-", ".csv");
String outputPath = got.getAbsolutePath();
testedObject.createCsvFile(outputPath);
got = new File(outputPath);
String expectedFilePath = getClass().getClassLoader().getResource("expected.csv").getFile();
File expected = new File(expectedFilePath);
response = FileUtils.contentEquals(got, expected); // Here it is the key
} catch (IOException e) {
// Nothing to do Yay!
}
Assert.assertTrue(response);
It works because if I compare both files manually, example via diff command, are exactly the same. Now.
My teem-mate codes with a Windows laptop, when he ran the test it brokes down! and we started debugging.
Visually, both files are the same; I mean in a human revision you cannot realize any difference. But If in a Cwin terminal we executed:
diff expected.csv got.csv and windows thought each line was different
And the test falls.
What is the problem, is the operative system? If that is true, Is there any way to compare file content not dependent on operative system

My guess is that this is most likely this is due to the \n value, which in unix like software is \r\n.
Anyway, the correct way to see if two files have the same content, is to hash both of them (ie via sha1) and check if the hashes matches!

This behaviour can be attributed to the Line Feed being different on both operating systems.
If you want it to be platform independent , you should pick up the value from the runtime using
System.getProperty("line.separator");
Also you might want to have a look at the char encoding for both the files

This answer can help you: Java Apache FileUtils readFileToString and writeStringToFile problems. The question's author is talking about PDF file, but this answer makes sense for your question.

Java - find matching pairs from list

background:
I need to load test a process on a server that I am working with. What I am doing is I am creating a bunch of files on client side and will upload them to server. The server is monitoring for new files (in input dir, file names are unique) and once there is a new file it processes it, once done, it creates a response file with same name but different extension to output dir. If the processing fails, it puts the incoming file to error dir. I am using the inotifywait to monitor the changes on server, which outputs:
10:48:47 /path/to/in/ CREATE ABCD.infile1
10:48:55 /path/to/out/ CREATE ABCD.outfile1
or
10:49:11 /path/to/in/ CREATE ASDF.infile1
10:49:19 /path/to/err/ CREATE ASDF.infile1
problem:
I need to parse the list of all results (planning to implement in java) like so, that I take the infile and match it with the same file name (either found in ERR or OUT), calculate the time taken and indicate weather it was success or not. The idea I am having is to create 3 lists (in, out, err) and try to parse, something like (in pseudo-code)
inList
outList
errList
for item : inList
if outlist.contains(item) parse;
else if errList.contains(item) parse;
else error;
question:
Is this efficient? Or is there a better way to approach this situation? Anyway, you might think that it is a code you are executing just once, why the struggle, but I really would like to know how do handle this properly.

The solution with lists is problematic, as you will have to keep them synchronized properly with the state of drive and always load them. What is more you will reach at some point capacity limit for file stored in single location.
Alternatives what you have are that you use i/o API to check path existence, or introduce a between database where you will store your values.
Another approach is database where you will store the information about keys and physical paths that file really has.
If I was you i would start with the I/O API and design a simple interface that could be replaced in future if the solution would appear to be inefficient.

You can use the "UserDefinedfileAttributeView" concept.
Create your own File attribute, say, "Result" and set its value accordingly for the files in IN dir. If the file is moved to OUT dir, "Result"="Success" and if the file is moved to ERR dir, "Result"="Error"
I tried the below code, hope it helps.
public static void main(String[] args) {
try{
Path file = Paths.get("C:\\Users\\rohit\\Desktop\\imp docs\\Steps.txt");
UserDefinedFileAttributeView userView = Files.getFileAttributeView(file, UserDefinedFileAttributeView.class);
String attribName = "RESULT";
String attribValue = "SUCCESS";
userView.write(attribName, Charset.defaultCharset().encode(attribValue));
List<String> attribList = userView.list();
for (String s : attribList) {
ByteBuffer buf = ByteBuffer.allocate(userView.size(s));
userView.read(s, buf);
buf.flip();
String value = Charset.defaultCharset().decode(buf).toString();
if("SUCCESS".equals(value)){
System.out.print(String.format("User defined attribute: %s", s));
System.out.println(String.format("; value: %s", value));
}
}
}
catch(Exception e){
}
You can do this for every file placed in IN dir.

Java, Linux: how to detect whether two java.io.Files refer to the same physical file

I'm looking for an efficient way to detect whether two java.io.Files refer to the same physical file. According to the docs, File.equals() should do the job:
Tests this abstract pathname for
equality with the given object.
Returns true if and only if the
argument is not null and is an
abstract pathname that denotes the
same file or directory as this
abstract pathname.
However, given a FAT32 partition (actually a TrueCrypt container) which is mounted at /media/truecrypt1:
new File("/media/truecrypt1/File").equals(new File("/media/truecrypt1/file")) == false
Would you say that this conforms to the specification? And in this case, how to work around that problem?
Update: Thanks to commenters, for Java 7 I've found java.io.Files.isSameFile() which works for me.

The answer in #Joachim's comment is normally correct. The way to determine if two File object refer to the same OS file is to use getCanonicalFile() or getCanonicalPath(). The javadoc says this:
"A canonical pathname is both absolute and unique. [...] Every pathname that denotes an existing file or directory has a unique canonical form."
So the following should work:
File f1 = new File("/media/truecrypt1/File"); // different capitalization ...
File f2 = new File("/media/truecrypt1/file"); // ... but same OS file (on Windows)
if (f1.getCanonicalPath().equals(f2.getCanonicalPath())) {
System.out.println("Files are equal ... no kittens need to die.");
}
However, it would appear that you are viewing a FAT32 file system mounted on UNIX / Linux. AFAIK, Java does not know that this is happening, and is just applying the generic UNIX / Linux rules for file names ... which give the wrong answer in this scenario.
If this is what is really happening, I don't think there is a reliable solution in pure Java 6. However,
You could do some hairy JNI stuff; e.g. get the file descriptor numbers and then in native code, use the fstat(2) system call to get hold of the two files' device and inode numbers and comparing those.
Java 7 java.nio.file.Path.equals(Object) looks like it might give the right answer if you call resolve() on the paths first to resolve symlinks. (It is a little unclear from the javadoc whether each mounted filesystem on Linux will correspond to a distinct FileSystem object.)
The Java 7 tutorials have this section on seeing if two Path objects are for the same file ... which recommends using java.nio.file.Files.isSameFile(Path, Path)
Would you say that this conforms to the specification?
No and yes.
No in the sense that the getCanonicalPath() method is not returning the same value for each existing OS file ... which is what you'd expect from reading the javadoc.
Yes in the technical sense that the Java codebase (not the javadoc) is the ultimate specification ... both in theory and in practice.

you could try to obtain an exclusive write lock on the file, and see if that fails:
boolean isSame;
try {
FileOutputStream file1 = new FileOutputStream (file1);
FileOutputStream file2 = new FileOutputStream (file2);
FileChannel channel1 = file1.getChannel();
FileChannel channel2 = file2.getChannel();
FileLock fileLock1 = channel1.tryLock();
FileLock fileLock2 = channel2.tryLock();
isSame = fileLock2 != null;
} catch(/*appropriate exceptions*/) {
isSame = false;
} finally {
fileLock1.unlock();
fileLock2.unlock();
file1.close();
file2.close();
///cleanup etc...
}
System.out.println(file1 + " and " + file2 + " are " + (isSame?"":"not") + " the same");
This is not always guaranteed to be correct tho - because another process could potentially have obtained the lock, and thus fail for you. But at least this doesn't require you to shell out to an external process.

There's a chance the same file has two paths (e.g. over the network \\localhost\file and \\127.0.0.1\file would refer to the same file with a different path).
I would go with comparing digests of both files to determine whether they are identical or not. Something like this
public static void main(String args[]) {
try {
File f1 = new File("\\\\79.129.94.116\\share\\bots\\triplon_bots.jar");
File f2 = new File("\\\\triplon\\share\\bots\\triplon_bots.jar");
System.out.println(f1.getCanonicalPath().equals(f2.getCanonicalPath()));
System.out.println(computeDigestOfFile(f1).equals(computeDigestOfFile(f2)));
}
catch(Exception e) {
e.printStackTrace();
}
}
private static String computeDigestOfFile(File f) throws Exception {
MessageDigest md = MessageDigest.getInstance("MD5");
InputStream is = new FileInputStream(f);
try {
is = new DigestInputStream(is, md);
byte[] buffer = new byte[1024];
while(is.read(buffer) != -1) {
md.update(buffer);
}
}
finally {
is.close();
}
return new BigInteger(1,md.digest()).toString(16);
}
It outputs
false
true
This approach is of course much slower than any sort of path comparison, it also depends on the size of files. Another possible side effect is that two files will be considered equals equal indifferently from their locations.

The Files.isSameFile method was added for exactly this kind of usage - that is, you want to check if two non-equal paths locate the same file.

On *nix systems, casing does have an importance. file is not the same as File or fiLe.

The API doc of equals() says (right after your quote):
On UNIX systems, alphabetic case is
significant in comparing pathnames; on
Microsoft Windows systems it is not.

You can try Runtime.exec() of
ls -i /fullpath/File # extract the inode number.
df /fullpath/File # extract the "Mounted on" field.
If the mount point and the "inode" number is the same, they are the same file whether you have symbolic links or case-insensitive file systems.
Or even
bash test "file1" -ef "file2"
FILE1 and FILE2 have the same device and inode numbers

The traditional Unix way to test whether two filenames refer to the same underlying filesystem object is to stat them and test whether they have the same [dev,ino] pair.
That does assume no redundant mounts, however. If those are allowed, you have to go about it differently.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Compare InputStreams of two identical files - java

Related

Multiple file reading loop and distinguishing between .pdf and .doc files

Create a text file if it doesn't exist and append to it if it does using Java BufferedWriter

Comparing Files dependent on operative system. JUnit

Java - find matching pairs from list

Java, Linux: how to detect whether two java.io.Files refer to the same physical file

Categories

Resources