Read file as the same Byte Array across operating systems

Read file as the same Byte Array across operating systems - java

I have a Spring Boot app that runs from a WAR file, and I test it in different Operating Systems.
It reads a file from the "resources" directory (this is the path within the WAR: "/WEB-INF/classes/file-name") by performing the following steps:
1. Get the InputStream:
InputStream fileAsInputStream = new ClassPathResource("file-name").getInputStream();
2. Convert the InputStream to ByteArray:
byte[] fileAsByteArray = IOUtils.toByteArray(fileAsInputStream );
The issue is that the content of the obtained Byte Array is different between Operating Systems, causing inconsistencies in further operations. The reason for this is that the file contains newline characters ("\n" = 0x0A = 10 in Linux, and "\r\n" = 0x0A 0x0D = 10 13 in Windows). The differences across Operating Systems are explained for example in this thread: What are the differences between char literals '\n' and '\r' in Java?
This is an example where the Byte Arrays have different contents:
When app runs on Linux: [114, 115, 97, 10, 69, 110] => 10 is the "\n"
When app runs on Windows: [114, 115, 97, 13, 10, 69, 110] => 13 10 is the "\r\n"
So between these two OSs, the 10 is the same with 13 10. The file is always the same (because it is a Private Key used for establishing the SFTP communication). Differs only the Operating System from which the code is being run.
Is there a solution to obtain the same Byte Array, regardless of the Operating System from which the code is being run?
One working workaround would be to map the newline character to the same byte. So to iterate through the array, and to replace each 10 (for example) with 13 10.
Another solution that was tried is using StandardCharsets.UTF_8, but the elements of the arrays are still different:
IOUtils.toByteArray(new BufferedReader(new InputStreamReader(new ClassPathResource("file-name").getInputStream())), StandardCharsets.UTF_8)

Related

How does inputstream per byte reading work?

I cannot understand how System.in.read() method works.
There is such a code:
public static void main(String[] args) throws IOException {
while (true){
Integer x = System.in.read();
System.out.println(Integer.toString(x, 2));
}
I know that System.in.read() method reads from the inputstream PER ONE BYTE.
So when I enter 'A'(U+0041, one byte is used to store the char) - the program output is:
1000001 (U+0041)
1010 (NL) - it works as expected.
But when I enter 'Я'(U+042F, two bytes are used to store the char) - the output is:
11010000 (byte1)
10101111 (byte2)
1010 (byte3 - NL)
The real code for letter 'Я'(U+042F) is 10000101111.
Why 11010000 10101111 (byte1 + byte2) is not the binary code for letter 'Я'(U+042F)?

This will depend on the external process that is sending data to System.in. It could be a command shell, an IDE, or another process.
In the typical case of a command shell, the shell will have a character encoding configured. (chcp on Windows, locale charmap on Linux.)
The character encoding determines how a graphical character or glyph is coded as a number. For example, a Windows machine might use a "code page" of "Windows-1251" and encode "Я" as one byte (0xCF). Or, it could use UTF-8 and encode "Я" as two bytes (0xD0 0xAF), or UTF-16 and use two different bytes (0x04 0x2F).
Your results show that the process sending data to your Java program is using UTF-8 as an encoding.

Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

We are implementing a feature to support non printable characters of UTF-8in our Database. Our system stores them in the database and retrieves them. We collect input in the form of base 64, convert them into byte array and store it in database. During retrieval, database gives us the byte array and we convert them to base 64 again.
During the retrieval process (after db gives us the byte array), all the attributes are converted to string arrays and later they are converted back to byte array again and this is converted to base 64 again to give it back to the user.
The below piece of code compiles and works properly in our Windows JDK (Java 8 version). But when this is placed in the SuSe Linux environment, we see strange characters.
public class Tewst {
public static void main(String[] args) {
byte[] attributeValues;
String utfString ;
attributeValues = new byte[]{-86, -70, -54, -38, -6};
if (attributeValues != null) {
utfString = new String(attributeValues);
System.out.println("The string is "+utfString);
}
}
}
The output given is
"The string is ªºÊÚú"
Now when the same file is run on SuSe Linux distribution, it gives me:
"The string is �����"
We are using Java 8 in both Windows and Linux. What is the problem that it doesnt execute properly in Linux?
We have also tried utfString = new String(attributeValues,"UTF-8");. It didnt help in any way. What are we missing?

The characters ªºÊÚú are Unicode 00AA 00BA 00CA 00DA 00FA.
In character set ISO-8859-1, that is bytes AA BA CA DA FA.
In decimal, that would be {-86, -70, -54, -38, -6}, as you have in your code.
So, your string is encoded in ISO-8859-1, not UTF-8, which is also why it doesn't work on Linux, because Linux uses UTF-8, while Windows uses ISO-8859-1.
Never use new String(byte[]), unless you're absolutely sure you want the default character set of the JVM, whatever that might be.
Change code to new String(attributeValues, StandardCharsets.ISO_8859_1).
And of course, in the reverse operation, use str.getBytes(StandardCharsets.ISO_8859_1).
Then is should work consistently on various platforms, since code it no longer using platform defaults.

java reads file system file names differently on osx and linux

I have a java program that is almost working perfectly. I'm developing on a mac and pushing to linux for production. When the mac searches the file system and inserts new file names to the database it works great. However, when I push to the linux box and do the search/insert it finds files with some characters as different IE: Béla Fleck. They look identical to me in the database and on the mac AND linux file systems. In fact, the mac and linux boxes have NFS mounts to a 3rd system (linux) where the files reside.
I've dumped the bytes and can see how linux and mac are seeing the string from the file system: Béla Fleck.
linux:
utf8bytes[0] = 0x42
utf8bytes[1] = 0x65
utf8bytes[2] = 0xcc
utf8bytes[3] = 0x81
utf8bytes[4] = 0x6c
utf8bytes[5] = 0x61
utf8bytes[6] = 0x20
utf8bytes[7] = 0x46
utf8bytes[8] = 0x6c
utf8bytes[9] = 0x65
utf8bytes[10] = 0x63
utf8bytes[11] = 0x6b
linux says LANG=en_US.UTF-8
mac:
utf8Bytes[0] = 0x42
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xa9
utf8Bytes[3] = 0x6c
utf8Bytes[4] = 0x61
utf8Bytes[5] = 0x20
utf8Bytes[6] = 0x46
utf8Bytes[7] = 0x6c
utf8Bytes[8] = 0x65
utf8Bytes[9] = 0x63
utf8Bytes[10] = 0x6b
mac says LANG=en_US.UTF-8
tried this, still no joy.
java -Dfile.encoding=UTF-8
I'm using java nio file to get the directory:
java.nio.file.Path path = Paths.get("test");
then walking the path with
Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
and then, since this is a subdir in the test path:
file.getParent().getName(1).toString()
Anyone have any ideas on what is glitching here and how I can fix this?
Thanks.

Some searching revealed that OS X always decomposes file names:
https://apple.stackexchange.com/a/84038
https://stackoverflow.com/a/6153713/1831987
This suggests to me that you may have accidentally switched the outputs: the first byte array is decomposed, so I’m guessing it was taken from a Mac, whereas the second one is from Linux.
In any event, if you want them to be identical for all systems, you can do the decomposition yourself:
String name = file.getParent().getName(1).toString();
name = Normalizer.normalize(name, Normalizer.Form.NFD);

(Not really an answer, just more discussion.)
Those seem to be utf8 characters, but formed in different ways.
c4a9 is é -- This is normally how one would enter an accented letter.
However, it is possible to use a pair of characters:
65cc91 is ȇ, but formed as a combination of e and a "COMBINING INVERTED BREVE". c3aa is the single character ê
Some COLLATIONs can compensate for the differences, but it is up to the application to combine them at key-stroke time.
SELECT CAST(UNHEX('65cc91') AS CHAR) =
CAST(UNHEX('c3aa') AS CHAR) COLLATE utf8_unicode_520_ci; --> 1

Create folder with unicode characters from Oracle database using java

We have a code for creating folders from within database (java / PL/SQL combination).
CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "rtp_java_utils" AS
import java.io.*;
public class rtp_java_utils extends Object
{
public static int mkdir (String path) {
File myFile = new File (path);
if (myFile.mkdir()) return 1; else return 0;
}
}
CREATE OR REPLACE FUNCTION mkdir (p_path IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA
NAME 'rtp_java_utils.mkdir (java.lang.String) return java.lang.int';
Recently we started using Oracle 12c database and now we are having problems with folders containing special characters in their names (for example "š", "č" ot "đ"). Folders are created but without special characters - instead some character combinations are shown. For example, with parameter "d:/test/testing š č đ" created folder is "test Ĺˇ ÄŤ Ä‘". So, 2 characters are used for every special character.
Database is version 12C (12.1.0.2.0) with NLS_CHARACTERSET AL32UTF8.
File system is NTFS with UTF-8 encoding. OS is Windows server 2012.
Some Java system properties are:
- file.encoding UTF-8
- sun.io.unicode.encoding UnicodeLittle
Interesting is, we get the same result (the same false folder name) regardless if NTFS file system encoding is UTF-8 or WINDOWS-1250. We tried both options.
Our guess is, that java is doing some implicit conversions of given folder name from one character set to another, and final result is false folder name. But everything we tried (explicit folder name conversions in PLSQL or java, system parameters change in java...) hasn't worked.
On our previous database everything was working great (database version 10.1.0.5.0 with EE8MSWIN1250 characterset, NTFS file system with windows-1250 encoding).
Does anyone have an idea, what is wrong?
Thank You!

Somewhere in your (not posted) code the bytes of the UTF8 string are used to construct a string with ISO8859 encoding.
See following snippet
String s = "test š č đ";
System.out.println("UTF8 : "
+ Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
System.out.println("ISO8895: "
+ Arrays.toString(s.getBytes(StandardCharsets.ISO_8859_1)));
Files.createDirectories(Paths.get("c:/temp/" + s));
byte[] utf8Bytes = s.getBytes(StandardCharsets.UTF_8);
Files.createDirectories(Paths.get("c:/temp/"
+ new String(utf8Bytes, StandardCharsets.ISO_8859_1)));
}
outout
UTF8 : [116, 101, 115, 116, 32, -59, -95, 32, -60, -115, 32, -60, -111]
ISO8895: [116, 101, 115, 116, 32, 63, 32, 63, 32, 63]
and it this will create in c:/temp the directories
test š č đ
test Å¡ Ä Ä
Try to print the bytes of the sctring you pass to your stored function.

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());

I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.

#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôÃ´
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.

If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.

Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read file as the same Byte Array across operating systems - java

Related

How does inputstream per byte reading work?

Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

java reads file system file names differently on osx and linux

Create folder with unicode characters from Oracle database using java

Java App : Unable to read iso-8859-1 encoded file correctly

Categories

Resources