Create folder with unicode characters from Oracle database using java - java

We have a code for creating folders from within database (java / PL/SQL combination).
CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "rtp_java_utils" AS
import java.io.*;
public class rtp_java_utils extends Object
{
public static int mkdir (String path) {
File myFile = new File (path);
if (myFile.mkdir()) return 1; else return 0;
}
}
CREATE OR REPLACE FUNCTION mkdir (p_path IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA
NAME 'rtp_java_utils.mkdir (java.lang.String) return java.lang.int';
Recently we started using Oracle 12c database and now we are having problems with folders containing special characters in their names (for example "š", "č" ot "đ"). Folders are created but without special characters - instead some character combinations are shown. For example, with parameter "d:/test/testing š č đ" created folder is "test š ÄŤ Ä‘". So, 2 characters are used for every special character.
Database is version 12C (12.1.0.2.0) with NLS_CHARACTERSET AL32UTF8.
File system is NTFS with UTF-8 encoding. OS is Windows server 2012.
Some Java system properties are:
- file.encoding UTF-8
- sun.io.unicode.encoding UnicodeLittle
Interesting is, we get the same result (the same false folder name) regardless if NTFS file system encoding is UTF-8 or WINDOWS-1250. We tried both options.
Our guess is, that java is doing some implicit conversions of given folder name from one character set to another, and final result is false folder name. But everything we tried (explicit folder name conversions in PLSQL or java, system parameters change in java...) hasn't worked.
On our previous database everything was working great (database version 10.1.0.5.0 with EE8MSWIN1250 characterset, NTFS file system with windows-1250 encoding).
Does anyone have an idea, what is wrong?
Thank You!

Somewhere in your (not posted) code the bytes of the UTF8 string are used to construct a string with ISO8859 encoding.
See following snippet
String s = "test š č đ";
System.out.println("UTF8 : "
+ Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
System.out.println("ISO8895: "
+ Arrays.toString(s.getBytes(StandardCharsets.ISO_8859_1)));
Files.createDirectories(Paths.get("c:/temp/" + s));
byte[] utf8Bytes = s.getBytes(StandardCharsets.UTF_8);
Files.createDirectories(Paths.get("c:/temp/"
+ new String(utf8Bytes, StandardCharsets.ISO_8859_1)));
}
outout
UTF8 : [116, 101, 115, 116, 32, -59, -95, 32, -60, -115, 32, -60, -111]
ISO8895: [116, 101, 115, 116, 32, 63, 32, 63, 32, 63]
and it this will create in c:/temp the directories
test š č đ
test Å¡ Ä Ä
Try to print the bytes of the sctring you pass to your stored function.

Related

Read file as the same Byte Array across operating systems

I have a Spring Boot app that runs from a WAR file, and I test it in different Operating Systems.
It reads a file from the "resources" directory (this is the path within the WAR: "/WEB-INF/classes/file-name") by performing the following steps:
1. Get the InputStream:
InputStream fileAsInputStream = new ClassPathResource("file-name").getInputStream();
2. Convert the InputStream to ByteArray:
byte[] fileAsByteArray = IOUtils.toByteArray(fileAsInputStream );
The issue is that the content of the obtained Byte Array is different between Operating Systems, causing inconsistencies in further operations. The reason for this is that the file contains newline characters ("\n" = 0x0A = 10 in Linux, and "\r\n" = 0x0A 0x0D = 10 13 in Windows). The differences across Operating Systems are explained for example in this thread: What are the differences between char literals '\n' and '\r' in Java?
This is an example where the Byte Arrays have different contents:
When app runs on Linux: [114, 115, 97, 10, 69, 110] => 10 is the "\n"
When app runs on Windows: [114, 115, 97, 13, 10, 69, 110] => 13 10 is the "\r\n"
So between these two OSs, the 10 is the same with 13 10. The file is always the same (because it is a Private Key used for establishing the SFTP communication). Differs only the Operating System from which the code is being run.
Is there a solution to obtain the same Byte Array, regardless of the Operating System from which the code is being run?
One working workaround would be to map the newline character to the same byte. So to iterate through the array, and to replace each 10 (for example) with 13 10.
Another solution that was tried is using StandardCharsets.UTF_8, but the elements of the arrays are still different:
IOUtils.toByteArray(new BufferedReader(new InputStreamReader(new ClassPathResource("file-name").getInputStream())), StandardCharsets.UTF_8)

Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

We are implementing a feature to support non printable characters of UTF-8in our Database. Our system stores them in the database and retrieves them. We collect input in the form of base 64, convert them into byte array and store it in database. During retrieval, database gives us the byte array and we convert them to base 64 again.
During the retrieval process (after db gives us the byte array), all the attributes are converted to string arrays and later they are converted back to byte array again and this is converted to base 64 again to give it back to the user.
The below piece of code compiles and works properly in our Windows JDK (Java 8 version). But when this is placed in the SuSe Linux environment, we see strange characters.
public class Tewst {
public static void main(String[] args) {
byte[] attributeValues;
String utfString ;
attributeValues = new byte[]{-86, -70, -54, -38, -6};
if (attributeValues != null) {
utfString = new String(attributeValues);
System.out.println("The string is "+utfString);
}
}
}
The output given is
"The string is ªºÊÚú"
Now when the same file is run on SuSe Linux distribution, it gives me:
"The string is �����"
We are using Java 8 in both Windows and Linux. What is the problem that it doesnt execute properly in Linux?
We have also tried utfString = new String(attributeValues,"UTF-8");. It didnt help in any way. What are we missing?
The characters ªºÊÚú are Unicode 00AA 00BA 00CA 00DA 00FA.
In character set ISO-8859-1, that is bytes AA BA CA DA FA.
In decimal, that would be {-86, -70, -54, -38, -6}, as you have in your code.
So, your string is encoded in ISO-8859-1, not UTF-8, which is also why it doesn't work on Linux, because Linux uses UTF-8, while Windows uses ISO-8859-1.
Never use new String(byte[]), unless you're absolutely sure you want the default character set of the JVM, whatever that might be.
Change code to new String(attributeValues, StandardCharsets.ISO_8859_1).
And of course, in the reverse operation, use str.getBytes(StandardCharsets.ISO_8859_1).
Then is should work consistently on various platforms, since code it no longer using platform defaults.

Java convert string hex values to byte[], recreating this obj-c functionality

I am making an app that communicates with a specific Bluetooth Low Energy device. It requires a specific handshake and this is all working perfectly in Objective-C for iOS, however, I am having trouble recreating this functionality in Java
Any thoughts greatly appreciated!
WORKING Objective-C code:
uint8_t bytes[] = {0x04,0x08,0x0F,0x66,0x99,0x41,0x52,0x43,0x55,0xAA};
NSData *data = [NSData dataWithBytes:bytes length:sizeof(bytes)];
[_btDevice writeValue:data forCharacteristic:_dataCommsCharacteristic type:CBCharacteristicWriteWithResponse];
So far for android I have the following as an equivalent:
byte[] handshake = {0x04,0x08,0x0F,0x66,(byte)0x99,0x41,0x52,0x43,0x55,(byte)0xAA};
characteristic.setValue(handshake);
boolean writeStatus = gatt.writeCharacteristic(characteristic);
Log.d(TAG,"Handshake sent: " + writeStatus);
As mentioned, iOS works great, but the equivalent in Java is getting no response from the device, leading me to think that the data being sent is wrong/not recognised
UPDATE
So, after plenty of wrestling with this I have a little more insight into what is going on 'I think!'
As Scary Wombat mentioned below the maximum value of an int is 127 so the 2 values in the array of 0x99 and 0xAA are of course out of this range
The below is where I am at with the values:
byte bytes[] = {0x04,0x08,0x0F,0x66,(byte)0x99,0x41,0x52,0x43,0x55,(byte)0xAA};
Log.d(TAG, Arrays.toString(bytes));
Produces
[4, 8, 15, 102, -103, 65, 82, 67, 85, -86]
However the expected values need to be
[4, 8, 15, 102, 153, 65, 82, 67, 85, 170]
I have tried casting these troublesome bytes implicitly and also tried the below below:
byte bytes[] = {0x04,0x08,0x0F,0x66,(byte)(0x99 & 0xFF),0x41,0x52,0x43,0x55,(byte)(0xAA & 0xFF)};
However the resulting values in the array are always the same.
Please help!!! :)
UPDATE 2
After a day of digging it appears that although the values are logging incorrectly the values perceived by the Bluetooth device SHOULD still be correct, so I have modified this question and continuing over here
Why are you not doing it the same way as for C
In this code
String handshakeString = "0x04,0x08,0x0F,0x66,0x99,0x41,0x52,0x43,0x55,0xAA";
byte[] value = handshakeString.getBytes();
this is just making a text String where the first char is 0 and the second is x etc
try
byte arr[] = {0x04,0x08,0x0F,0x66,0x99,0x41,0x52,0x43,0x55,0xAA};
edit
You maybe need to reconsider values such as 0x99 as in java the byte values are as per javadocs
It has a minimum value of -128 and a maximum value of 127 (inclusive).
See Can we make unsigned byte in Java
String handshakeString = "0x04,0x08,0x0F,0x66,0x99,0x41,0x52,0x43,0x55,0xAA";
byte[] value = handshakeString.getBytes();
will also account the , and so creates to much bytesand will not geht the same bytes AS in your c code.
Try to use a byte[] directly.
byte[] value=new byte[]{0x04,0x08,0x0F,0x66,0x99,0x41,0x52,0x43,0x55,0xAA};

Chinese character 数 encodes into too many bytes

I'm trying to encode some Chinese characters using the GB18030 cp in Java, and I ran into this character 数, which translates to "Number" in Google Translate.
The issue is, it's turning into 10 bytes (!) when encoded:
81 30 81 34 81 30 83 31 ca fd
import java.math.BigInteger;
import java.nio.charset.Charset;
public class Test3
{
public static void main(String[] args)
{
String s = new String("数");
System.out.println( "source file: "+String.format("%x ",
new BigInteger(1, s.getBytes(Charset.forName("GB18030"))) ));
}
}
When I try to decode that using the GB18030, it results in ? characters appearing beside the Chinese Number character (??数). When I try to decode only "CA FD", the last two bytes from above, it correctly decodes to the character.
Google translate notes the above character is Simplified. My source file is also saved in UTF8.
I thought GB18030 has a max of 4 bytes per character? Is there any particular reason this character behaves so strangely? (I'm not Chinese, BTW)
The most likely things are either:
There's an issue with the encoding of your source file, or
You have "invisible" characters prior to the 数 in it.
You can check both of those by completely deleting the string literal on this line:
String s = new String("数");
so it looks like this (note I removed the quotes as well as the character):
String s = new String();
and then adding back "\u6570" to get this:
String s = new String("\u6570");
and seeing if your output changes (as 数 is Unicode code point U+6570 and so that escape sequence should be the same character). If it changes, either there's an encoding problem or you had invisible characters in the string prior to the character. You can probably differentiate the two cases by then adding back just that character (via copy and paste from this page rather than your previous source code). If the problem reappears, it's an encoding issue. If not, you had hidden characters.

Working with special characters derived from a filename in a zip file

This question concerns a Tomcat 7 web application, which is connected to a MySQL (5.5.16) database.
When I open a zip file, That has filenames encoded in windows-1252 charset, the characters seem to be interpreted correctly by Java:
ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );
Enumeration entries = zf.entries();
while( entries.hasMoreElements() ) {
ZipEntry ze = ( ZipEntry ) entries.nextElement();
if( ! ze.isDirectory() ) {
String name = ze.getName();
System.out.println( name ); //prints correct filenames, e.g. café.pdf
}
}
Omitting the Charset object in the ZipFile constructor would cause an exception.
The filenames in the zip file are printed correctly to standard output, including diacritics.
But, when I subsequently try to store the filename in a database, the e-acute is replaced with a question mark (as seen with the mysql console client).
I had no problems inserting special characters from the web application into MySQL before.
When I execute an INSERT with é in Java source code:
statement.executeUpdate( "insert into files (filename) values ('café.pdf')" );
the é shows up well in MySQL.
Also, my log file shows a comma instead of é: caf‚.pfd
Does anyone know what could be happening here?
As you mentioned in the comments section, the incoming data (zipped files' names) can be in different character sets. This is going to be an issue to you, because you are using MySQL+JDBC link, and it gives you a lot of limitations (like one character set per column in MySQL, and only one character set per connection in JDBC).
Therefore, I would recommend you to switch the character sets (look for variables like character_set_server and character_set_connection) on the MySQL side to UTF8, because it will enable you to transfer and store almost any character that you may receive. See here on how to properly set up your MySQL server. Note, that settings the MySQL server might be challenging, so don't hesitate to PM for additional help. JDBC will automatically adjust to the server's character_set_connection variable, so you don't have to change anything in your Java application.
The one thing you WILL have to change in your application is you would have to convert all incoming data to UTF8 in order to send and store it on the MySQL server.
Good luck.
In the table where you store the data, make sure you use the correct collation to be able to store the e-acute character
The issue is resolved. This post suggested that the encoding of filenames in a zip file might not be windows-1252 but rather IBM437. Changing the Charset from:
ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );
to
ZipFile zf = new ZipFile( zipFile, Charset.forName( "IBM437" ) );
gave the desired result: when saving the acquired filename in MySQL, it was stored correctly with é.
What went wrong?
Printing out the filenames contained in the zip file to standard output with
System.out.println( name );
made me wrongly assume that the filenames in the zip file were interpreted well: when I used windows-1252 encoding to open the zip file, the filename was printed to standard output nicely with diacritic: café.pdf. Using other character encodings, different symbols appeared instead of the é.
But when printing the Unicode value of the é-char with the help of this answer, I was able to see that when opening the zip file with windows-1252 encoding, the actual Unicode value was NOT \u00e9 (latin small letter e with acute), but \u201a (single low-9 quotation mark). When I opened the ZipFile with IBM437 charset the correct Unicode value DID appear.
Of course when printing a String to standard output with PrintStream, the PrintStream is also associated with a certain character encoding. From the PrintStream Javadoc:
All characters printed by a PrintStream are converted into bytes using the platform's default character encoding.
I am working on Windows XP.
When I created a new PrintStream
out = new PrintStream( System.out, true, "IBM437" );
everything made sense: opening the zip file with IBM437 character encoding, and using the new PrintStream, é was printed correctly.
There Ain't No Such Thing As Plain Text.

Categories