java reads file system file names differently on osx and linux

java reads file system file names differently on osx and linux - java

I have a java program that is almost working perfectly. I'm developing on a mac and pushing to linux for production. When the mac searches the file system and inserts new file names to the database it works great. However, when I push to the linux box and do the search/insert it finds files with some characters as different IE: Béla Fleck. They look identical to me in the database and on the mac AND linux file systems. In fact, the mac and linux boxes have NFS mounts to a 3rd system (linux) where the files reside.
I've dumped the bytes and can see how linux and mac are seeing the string from the file system: Béla Fleck.
linux:
utf8bytes[0] = 0x42
utf8bytes[1] = 0x65
utf8bytes[2] = 0xcc
utf8bytes[3] = 0x81
utf8bytes[4] = 0x6c
utf8bytes[5] = 0x61
utf8bytes[6] = 0x20
utf8bytes[7] = 0x46
utf8bytes[8] = 0x6c
utf8bytes[9] = 0x65
utf8bytes[10] = 0x63
utf8bytes[11] = 0x6b
linux says LANG=en_US.UTF-8
mac:
utf8Bytes[0] = 0x42
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xa9
utf8Bytes[3] = 0x6c
utf8Bytes[4] = 0x61
utf8Bytes[5] = 0x20
utf8Bytes[6] = 0x46
utf8Bytes[7] = 0x6c
utf8Bytes[8] = 0x65
utf8Bytes[9] = 0x63
utf8Bytes[10] = 0x6b
mac says LANG=en_US.UTF-8
tried this, still no joy.
java -Dfile.encoding=UTF-8
I'm using java nio file to get the directory:
java.nio.file.Path path = Paths.get("test");
then walking the path with
Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
and then, since this is a subdir in the test path:
file.getParent().getName(1).toString()
Anyone have any ideas on what is glitching here and how I can fix this?
Thanks.

Some searching revealed that OS X always decomposes file names:
https://apple.stackexchange.com/a/84038
https://stackoverflow.com/a/6153713/1831987
This suggests to me that you may have accidentally switched the outputs: the first byte array is decomposed, so I’m guessing it was taken from a Mac, whereas the second one is from Linux.
In any event, if you want them to be identical for all systems, you can do the decomposition yourself:
String name = file.getParent().getName(1).toString();
name = Normalizer.normalize(name, Normalizer.Form.NFD);

(Not really an answer, just more discussion.)
Those seem to be utf8 characters, but formed in different ways.
c4a9 is é -- This is normally how one would enter an accented letter.
However, it is possible to use a pair of characters:
65cc91 is ȇ, but formed as a combination of e and a "COMBINING INVERTED BREVE". c3aa is the single character ê
Some COLLATIONs can compensate for the differences, but it is up to the application to combine them at key-stroke time.
SELECT CAST(UNHEX('65cc91') AS CHAR) =
CAST(UNHEX('c3aa') AS CHAR) COLLATE utf8_unicode_520_ci; --> 1

Related

Different codepoints for same character in MacOS and Windows

I have a small piece of code in which I am checking the codepoint for the the character Ü.
Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));
I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.
Output on MacOS
en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220
Output on Windows
en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195
I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü is 220.
For String glyph = "Ü"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.
If I replace String glyph = "Ü"; with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220.
Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?

195 is 0xC3 in hex.
In UTF-8, Ü is encoded as bytes 0xC3 0x9C.
System.getProperty("file.encoding") says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println() is outputting glyph ?? (note 2 ?, meaning 2 chars are present), and that you are able to decode the raw string bytes using the UTF-8 Charset, proves this.
glyph should have a single char whose value is 0x00DC, not 2 chars whose values are 0x00C3 0x009C. getCodepointAt(0) is returning 0x00C3 (195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C get decoded as characters 0x00C3 0x009C instead of as character 0x00DC.
You need to specify the actual file encoding when running Java, eg:
java -Dfile.encoding=UTF-8 ...

Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

We are implementing a feature to support non printable characters of UTF-8in our Database. Our system stores them in the database and retrieves them. We collect input in the form of base 64, convert them into byte array and store it in database. During retrieval, database gives us the byte array and we convert them to base 64 again.
During the retrieval process (after db gives us the byte array), all the attributes are converted to string arrays and later they are converted back to byte array again and this is converted to base 64 again to give it back to the user.
The below piece of code compiles and works properly in our Windows JDK (Java 8 version). But when this is placed in the SuSe Linux environment, we see strange characters.
public class Tewst {
public static void main(String[] args) {
byte[] attributeValues;
String utfString ;
attributeValues = new byte[]{-86, -70, -54, -38, -6};
if (attributeValues != null) {
utfString = new String(attributeValues);
System.out.println("The string is "+utfString);
}
}
}
The output given is
"The string is ªºÊÚú"
Now when the same file is run on SuSe Linux distribution, it gives me:
"The string is �����"
We are using Java 8 in both Windows and Linux. What is the problem that it doesnt execute properly in Linux?
We have also tried utfString = new String(attributeValues,"UTF-8");. It didnt help in any way. What are we missing?

The characters ªºÊÚú are Unicode 00AA 00BA 00CA 00DA 00FA.
In character set ISO-8859-1, that is bytes AA BA CA DA FA.
In decimal, that would be {-86, -70, -54, -38, -6}, as you have in your code.
So, your string is encoded in ISO-8859-1, not UTF-8, which is also why it doesn't work on Linux, because Linux uses UTF-8, while Windows uses ISO-8859-1.
Never use new String(byte[]), unless you're absolutely sure you want the default character set of the JVM, whatever that might be.
Change code to new String(attributeValues, StandardCharsets.ISO_8859_1).
And of course, in the reverse operation, use str.getBytes(StandardCharsets.ISO_8859_1).
Then is should work consistently on various platforms, since code it no longer using platform defaults.

Get list of processes on Windows in a charset-safe way

This post gives a solution to retrieve the list of running processes under Windows. In essence it does:
String cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
Process p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream());
BufferedReader input = new BufferedReader(isr);
then reads the input.
It looks and works great but I was wondering if there is a possibility that the charset used by tasklist might not be the default charset and that this call could fail?
For example this other question about a different executable shows that it could cause some issues.
If that is the case, is there a way to determine what the appropriate charset would be?

Can break this into 2 parts:
The windows part
From java you're executing a Windows command - externally to the jvm in "Windows land". When java Runtime class executes a windows command, it uses the DLL for consoles & so appears to windows as if the command is running in a console
Q: When I run C:\windows\system32\tasklist.exe in a console, what is the character encoding ("code page" in windows terminology) of the result?
windows "chcp" command with no argument gives the active code page number for the console (e.g. 850 for Multilingual-Latin-1, 1252 for Latin-1). See Windows Microsoft Code Pages, Windows OEM Code Pages, Windows ISO Code Pages
The default system code page is originally setup according to your system locale (type systeminfo to see this or Control Panel-> Region and Language).
the windows OS/.NET function getACP() also gives this info
The java part:
How do I decode a java byte stream from the windows code page of "x" (e.g. 850 or 1252)?
the full mapping between windows code page numbers and equivalent java charset names can be derived from here - Code Page Identifiers (Windows)
However, in practice one of the following prefixes can be added to achieve the mapping:
"" (none) for ISO, "IBM" or "x-IBM" for OEM, "windows-" OR "x-windows-" for Microsoft/Windows.
E.g. ISO-8859-1 or IBM850 or windows-1252
Full Solution:
String cmd = System.getenv("windir") + "\\system32\\" + "chcp.com";
Process p = Runtime.getRuntime().exec(cmd);
// Use default charset here - only want digits which are "core UTF8/UTF16";
// ignore text preceding ":"
String windowsCodePage = new Scanner(
new InputStreamReader(p.getInputStream())).skip(".*:").next();
Charset charset = null;
String[] charsetPrefixes =
new String[] {"","windows-","x-windows-","IBM","x-IBM"};
for (String charsetPrefix : charsetPrefixes) {
try {
charset = Charset.forName(charsetPrefix+windowsCodePage);
break;
} catch (Throwable t) {
}
}
// If no match found, use default charset
if (charset == null) charset = Charset.defaultCharset();
cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream(), charset);
BufferedReader input = new BufferedReader(isr);
// Debugging output
System.out.println("matched codepage "+windowsCodePage+" to charset name:"+
charset.name()+" displayName:"+charset.displayName());
String line;
while ((line = input.readLine()) != null) {
System.out.println(line);
}
Thanks for the Q! - was fun.

Actually, the charset used by tasklist is always different from the system default.
On the other hand, it's quite safe to use the default as long as the output is limited to ASCII. Usually executable modules have only ASCII characters in their names.
So to get the correct Strings, you have to convert (ANSI) Windows code page to OEM code page, and pass the latter as charset to InputStreamReader.
It seems there's no comprehensive mapping between the these encodings. The following mapping can be used:
Map<String, String> ansi2oem = new HashMap<String, String>();
ansi2oem.put("windows-1250", "IBM852");
ansi2oem.put("windows-1251", "IBM866");
ansi2oem.put("windows-1252", "IBM850");
ansi2oem.put("windows-1253", "IBM869");
Charset charset = Charset.defaultCharset();
String streamCharset = ansi2oem.get(charset.name());
if (streamCharset) {
streamCharset = charset.name();
}
InputStreamReader isr = new InputStreamReader(p.getInputStream(),
streamCharset);
This approach worked for me with windows-1251 and IBM866 pair.
To get the current OEM encoding used by Windows, you can use GetOEMCP function. The return value depends on Language for non-Unicode programs setting on Administrative tab in Region and Language control panel. Reboot is required to apply the change.
There are two kinds of encodings on Windows: ANSI and OEM.
The former is used by non-Unicode applications running in GUI mode.
The latter is used by Console applications. Console applications cannot display characters that cannot be represented in the current OEM encoding.
Since tasklist is console mode application, its output is always in the current OEM encoding.
For English systems, the pair is usually Windows-1252 and CP850.
As I am in Russia, my system has the following encodings: Windows-1251 and CP866.
If I capture output of tasklist into a file, the file can't display Cyrillic characters correctly:
I get ЏаЁўҐв instead of Привет (Hi!) when viewed in Notepad.
And µTorrent is displayed as зTorrent.
You cannot change the encoding used by tasklist.
However it's possible to change the output encoding of cmd. If you pass /u switch to it, it will output everything in UTF-16 encoding.
cmd /c echo Hi>echo.txt
The size of echo.txt is 4 bytes: two bytes for Hi and two bytes for new line (\r and \n).
cmd /u /c echo Hi>echo.txt
Now the size of echo.txt is 8 bytes: each character is represented with two bytes.

Why not use the Windows API via JNA, instead of spawning processes? Like this:
import com.sun.jna.platform.win32.Kernel32;
import com.sun.jna.platform.win32.Tlhelp32;
import com.sun.jna.platform.win32.WinDef;
import com.sun.jna.platform.win32.WinNT;
import com.sun.jna.win32.W32APIOptions;
import com.sun.jna.Native;
public class ListProcesses {
public static void main(String[] args) {
Kernel32 kernel32 = (Kernel32) Native.loadLibrary(Kernel32.class, W32APIOptions.UNICODE_OPTIONS);
Tlhelp32.PROCESSENTRY32.ByReference processEntry = new Tlhelp32.PROCESSENTRY32.ByReference();
WinNT.HANDLE snapshot = kernel32.CreateToolhelp32Snapshot(Tlhelp32.TH32CS_SNAPPROCESS, new WinDef.DWORD(0));
try {
while (kernel32.Process32Next(snapshot, processEntry)) {
System.out.println(processEntry.th32ProcessID + "\t" + Native.toString(processEntry.szExeFile));
}
}
finally {
kernel32.CloseHandle(snapshot);
}
}
}
I posted a similar answer elsewhere.

There is a much better way to check the running processes, or even to run OS command through java: Process and ProcessBuilder.
As for the Charset, you can always inquire the OS about the supported charsets, and obtain an Encoder or Decoder according to your needs.
[Edit]
Let's break it down; there's no way of knowing in which encoding the bytes of a given String are, so your only choice is to get those bytes, shift the ordering as necessary (if you're ever in such an environment where a process can give you an array of bytes in different ordering, use ByteBuffer to deal with that), and use the multiple CharsetDecoders supported to decode the bytes to reasonable output.
It is overkill and requires you to estimate that a given output could be in UTF-8, UTF-16 or any other encoding. But at least you can decode the given output using one of the possible Charsets, and then try to use the processed output for your needs.
Since we're talking about a process run by the same OS in which the JVM itself is running, it is quite possible that your output will be in one of the Charset encodings returned by the availableCharsets() method.

Java NIO file path issue

I used the following code to get the path
Path errorFilePath = FileSystems.getDefault().getPath(errorFile);
When I try to move a file using the File NIO, I get the error below:
java.nio.file.InvalidPathException: Illegal char <:> at index 2: \C:\Sample\sample.txt
I also tried using URL.encode(errorFile) which results in the same error.

You need to convert the found resource to URI. It works on all platforms and protects you from possible errors with paths. You must not worry about how full path looks like, whether it starts with '\' or other symbols. If you think about such details - you do something wrong.
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
String platformIndependentPath = Paths.get(classloader.getResource(errorFile).toURI()).toString();

The path \C:\Sample\sample.txt must not have a leading \. It should be just C:\Sample\sample.txt

To make it work on both Windows and Linux\OS X consider doing this:
String osAppropriatePath = System.getProperty( "os.name" ).contains( "indow" ) ? filePath.substring(1) : filePath;
If you want to worry about performance I'd store System.getProperty( "os.name" ).contains( "indow" ) as a constant like
private static final boolean IS_WINDOWS = System.getProperty( "os.name" ).contains( "indow" );
and then use:
String osAppropriatePath = IS_WINDOWS ? filePath.substring(1) : filePath;

To be sure to get the right path on Windows or Linux on any drive letter, you could do something like this:
path = path.replaceFirst("^/(.:/)", "$1");
That says: If the beginning of the string is a slash, then a character, then a colon and another slash, replace it with the character, the colon, and the slash (leaving the leading slash off).
If you're on Linux, you shouldn't end up with a colon in your path, and there won't be a match. If you are on Windows, this should work for any drive letter.

Another way to get rid of the leading separator is to create a new file and convert it to a string then:
new File(Platform.getInstallLocation().getURL().getFile()).toString()

try to use like this C:\\Sample\\sample.txt
Note the double backslashes. Because the backslash is a Java String escape character, you must type two of them to represent a single, "real" backslash.
or
Java allows either type of slash to be used on any platform, and translates it appropriately. This means that you could type. C:/Sample/sample.txt
and it will find the same file on Windows. However, we still have the "root" of the path as a problem.
The easiest solution to deal with files on multiple platforms is to always use relative path names. A file name like Sample/sample.txt

Normal Windows Environment
Disclaimer: I haven't tested this on a normal windows environment.
"\\C:\\" needs to be "C:\\"
final Path errorFilePath = Paths.get(FileSystems.getDefault().getPath(errorFile).toString().replace("\\C:\\","C:\\"));
Linux-Like Windows Environment
My Windows box has a Linux-Like environment so I had to change "/C:/" to be "C:\\".
This code was tested to work on a Linux-Like Windows Environment:
final Path errorFilePath = Paths.get(FileSystems.getDefault().getPath(errorFile).toString().replace("/C:/","C:\\"));

Depending on how are you going to use the Path object, you may be able to avoid using Path at all:
// works with normal files but on a deployed JAR gives "java.nio.file.InvalidPathException: Illegal char <:> "
URL urlIcon = MyGui.class.getResource("myIcon.png");
Path pathIcon = new File(urlIcon.getPath()).toPath();
byte bytesIcon[] = Files.readAllBytes(pathIcon);
// works with normal files and with files inside JAR:
InputStream in = MyGui.class.getClassLoader().getResourceAsStream("myIcon.png");
byte bytesIcon[] = new byte[5000];
in.read(bytesIcon);

Creating tar archive with national characters in Java

Do you know some library/way in Java to generate tar archive with file names in proper windows national codepage ( for example cp1250 ).
I tried with Java tar, example code:
final TarEntry entry = new TarEntry( files[i] );
String filename = files[i].getPath().replaceAll( baseDir, "" );
entry.setName( new String( filename.getBytes(), "Cp1250" ) );
out.putNextEntry( entry );
...
It doesn't work. National characters are broken where I extract tar in windows.
I've also found a strange thing, under Linux Polish national characters are shown correctly only when I used ISO-8859-1:
entry.setName( new String( filename.getBytes(), "ISO-8859-1" ) );
Despite the fact that proper Polish codepage is ISO-8859-2, which doesn't work too.
I've also tried Cp852 for windows, no effect.
I know the limitations of tar format, but changing it is not an option.
Thanks for suggestions,

Officially, TAR doesn't support non-ASCII in headers. However, I was able to use UTF-8 encoded filenames on Linux.
You should try this,
String filename = files[i].getName();
byte[] bytes = filename.getBytes("Cp1250")
entry.setName(new String(bytes, "ISO-8859-1"));
out.putNextEntry( entry );
This at least preserves the bytes in Cp1250 in TAR headers.

tar doesn't allow for non-ASCII values in its headers. If you try a different encoding, the result is probably up to what the target platform decides to do with those byte values. It kind of sounds like your target platform's tar program is interpreting the bytes as ISO-8859-1, which is why that 'works'.
Have a look at extended attributes? http://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5&manpath=FreeBSD+8-current
I am no expert here but this seems to be the only official way to put any non-ASCII values in a tar file header.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java reads file system file names differently on osx and linux - java

Related

Different codepoints for same character in MacOS and Windows

Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

Get list of processes on Windows in a charset-safe way

Java NIO file path issue

Creating tar archive with national characters in Java

Categories

Resources