In my previous question, I pointed out the following problem:
byte[] a = new byte[] { 0x00, 0x00, 0x00, 0x25 };
String stringValue = new String(a, "273");
int firstResult = ByteBuffer.wrap(a).getInt();
System.out.println("First Integer Result: "+firstResult);
byte[] valueEncoded = Arrays.copyOfRange(stringValue.getBytes("273"), 0, 4);
System.out.println("Hex valueEncoded:");
for (byte b : valueEncoded){ System.out.format("0x%x ", b); }
int finalResult = ByteBuffer.wrap(valueEncoded).getInt();
System.out.print(System.lineSeparator());
System.out.println("Final Integer Result: "+finalResult);
This results in:
First Integer Result: 37
Hex valueEncoded:
0x0 0x0 0x0 0x15
Final Integer Result: 21
My expected result is 37, but the finalResult is 21. I got an answer with an explanation why this is the case:
bytes to string: EBCDIC 0x25 -> UTF-16 0x000A string to bytes: UTF-16
0x000A -> EBCDIC 0x15
0x000A is the standard line terminator on many systems, but it's
generally output as "move to beginning of next line". Converting to
IBM 273 is 'probably' because the text is destined for output on a
device that uses that code page, and perhaps such devices want NL
rather than LF for starting a new line.
My question is:
Can I somehow code a workaround in order to get the expected result while still using Strings (UTF-16) and decoding/encoding back and forth?
As mentioned in my previous post this is needed since we persist (Strings) to an DB2 table which then will be read by an IBM machine.
I was digging through several stackoverflow questions but could not find a solution to my problem:
Encoding strangeness with Cp500 (LF & NEL)
This answer is recommending the system property ibm.swapLF but this not an option for me since I use OpenJDK 1.8.0_292-b10 and not any IBM JDK. Besides that my codepage is not supported by this property.
EDIT: I did some loggers, hope it helps.
Related
I open file with notepad, write there: "ą" save and close.
I try to read this file in two ways
First:
InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
int result = inputStream.read();
System.out.println(result);
System.out.println((char) result);
196
Ä
Second:
InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
Reader reader = new InputStreamReader(inputStream);
int result = reader.read();
System.out.println(result);
System.out.println((char) result);
261
ą
Questions:
1) In binary mode, this letter is saved as 196? Why not as 261?
2) This letter is saved as 196 in which encoding?
I try to understand why there are differences
UTF-8 encodes values from range U+0080 - U+07FF as two bytes in form 110xxxxx 10xxxxxx (more at wiki). So there are only xxxxx xxxxxx 11 bytes available for value.
ą is indexed as U+0105 where 0105 is hexadecimal value (as decimal it is 261). As binary it can be represented as
01 05 (hex)
00000001 00000101 (bin)
xxx xxxxxxxx <- values for U+0080 - U+07FF range encode only those bits
001 00000101 <- which means `x` will be replaced by only this part
So UTF-8 encoding will add 110xxxxx 10xxxxxx mask which means it will combine
110xxxxx 10xxxxxx
00100 000101
into (two bytes):
11000100 10000101
Now, InputStream reads data as raw bytes. So when you call inputStream.read(); first time you are getting 11000100 which is 196 in decimal. Calling inputStream.read(); second time would return 10000101 which is 133 in decimal.
Readers ware introduced in Java 1.1 so we could avoid this kind of mess in our code. Instead we can specify what encoding Reader should use (or let it use default one) to get properly encoded values like in this case 00000001 00000101 (without mask) which is equal to 0105 in hexadecimal form and 261 in decimal form.
In short
use Readers (with properly specified encoding) if you want to read data as text,
use Streams if you want to read data as raw bytes.
Because you read these two letters in different encodings, you can check your encoding via InputStreamReader::getEncoding.
String s = "ą";
char iso_8859_1 = new String(s.getBytes(), "iso-8859-1").charAt(0);
char utf_8 = new String(s.getBytes(), "utf-8").charAt(0);
System.out.println((int) iso_8859_1 + " " + iso_8859_1);
System.out.println((int) utf_8 + " " + utf_8);
The output is
196 Ä
261 ą
Try using an InputStreamReader with UTF-8 encoding, which matches the encoding used to write the file from Notepad++:
// this will use UTF-8 encoding by default
BufferedReader in = Files.newBufferedReader(Paths.get("file.txt"));
String str;
if ((str = in.readLine()) != null) {
System.out.println(str);
}
in.close();
I don't have an exact/reproducible answer for why you are seeing the output you see, but if you are reading with the wrong encoding, you won't necessarily see what you saved. For example, if the single character ą were encoded with two bytes, but you read as ASCII, then you might get back two characters, which would not match your original file.
You are getting decimal value of LATIN letters
You need to save the file with UTF-8 encoding standard.
Make sure when you are reading them with similar standards.
0x0105 261 LATIN SMALL LETTER A WITH OGONEK ą
0x00C4 196 LATIN CAPITAL LETTER A WITH DIAERESIS �
Refer this:-https://www.ssec.wisc.edu/~tomw/java/unicode.html
I made a little project that converts a hexadecimal string into an ASCII string. When i convert te value then i send it to a client. But my client doesn't reconised the value.
I searched why and i saw that when i convert the ASCII string back to hexadecimal, then i get a little bit differend value back .. So i think something has going wrong when i sended the data .. But i don't no how to fix my problem ..
I also tried to convert the hex first to dec and then to ascii , also i tried the more noob whay , just send a command with for example this :
char p = 3;
char d = 4;
bw3.write(p + "" + c + "");
So this is the code i get now :
ServerSocket welcomeSocket2 = new ServerSocket(9999);
Socket socket2 = welcomeSocket2.accept();
OutputStream os3 = socket2.getOutputStream();
OutputStreamWriter osw3 = new OutputStreamWriter(os3);
BufferedWriter bw3 = new BufferedWriter(osw3);
String hex4 = "00383700177a0102081c4200000000000001a999c338030201000a080000000000000000184802000007080444544235508001000002080104";
StringBuilder output4 = new StringBuilder();
for (int i =0; i< hex4.length(); i +=2){
String str4 = hex4.substring(i, i+2);
int outputdecimal = Integer.parseInt(str4,16);
char hexchar = (char)outputdecimal;
System.out.println(str4);
output4.append(hexchar);
}
bw3.write(output4.toString());
bw3.flush();
What i also noticed is that when i send a command that is only 4 bytes long or 10 then everything is going good. I receive my converted ascii code good. The command that i now wanne send is 58 bytes long.
ASCII is not capable to represent all possible data expressed in hexadecimal.
Therefore, as long as you'll try to convert your hexa to ASCII, nothing you try will ever work.
Your hexadecimal contain purely binary, computer-y opaque data. ASCII is what you use to represent text. There are some binary data that are made out of ASCII and therefore can be represented in ASCII. And there are all the other data than these ones. Those will always end up wrong when you try to convert hexadecimal to ASCII. This is simply because ASCII is meant to be unable to do that, by the very definition of what it is.
How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?
I have tried something like:
Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));
but it does not convert to a valid code point.
That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).
Update
Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
assert src.charAt(j) == '=';
bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);
System.out.println(str);
Output
Газета
In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.
All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.
Here is some sample code that shows one way this can be achieved:
public static void main(String[] args) throws Exception {
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
// Parse string into hex string tokens.
String[] tokens = Arrays.stream(src.split("="))
.filter(s -> s.length() != 0)
.toArray(String[]::new);
// Convert the hex string representations to a byte array.
byte[] utf8bytes = new byte[tokens.length];
for (int i = 0; i < utf8bytes.length; i++) {
utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
}
// Convert UTF-8 bytes to Java String.
String str = new String(utf8bytes, StandardCharsets.UTF_8);
// Display string + individual unicode code points.
System.out.println(str);
str.codePoints().forEach(System.out::println);
}
Output:
Газета
1043
1072
1079
1077
1090
1072
I am receiving a string text via USB communication in android in form of extended ASCII characters like
String receivedText = "5286T11ɬ ªË ¦¿¯¾ ¯¾ ɬ ¨¬°:A011605286 ª¿ª ¾®:12:45 ¸Í®°:(9619441121)ª¿ª:-, ®¹¿¦Í°¾ ¡ ®¹¿¦Í°¾ ª¨À, ¾¦¿µ²À ¸Í, ¾¦¿µ²À ªÂ°Íµ °¿®¾°Í͸:- ¡Í°Éª:-, ¬¾¹°, ¸¾¤¾Í°Â¼ ªÂ°Íµ~";
Now these character represents a string in hindi.
I am not getting how to convert this received string into hindi equivalent text.
Any one knows how to convert this into equivalent hindi text using java
Following is the piece of code which I am using to convert byte array to byte string
public String byteArrayToByteString(byte[] arayValue, int size) {
byte ch = 0x00;
int i = 0;
if (arayValue == null || arayValue.length <= 0)
return null;
String pseudo[] = { "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"A", "B", "C", "D", "E", "F" };
StringBuffer out = new StringBuffer();
while (i < size) {
ch = (byte) (arayValue[i] & 0xF0); // Strip off high nibble
ch = (byte) (ch >>> 4); // shift the bits down
ch = (byte) (ch & 0x0F); // must do this is high order bit is on!
out.append(pseudo[(int) ch]); // convert the nibble to a String
// Character
ch = (byte) (arayValue[i] & 0x0F); // Strip off low nibble
out.append(pseudo[(int) ch]); // convert the nibble to a String
// Character
i++;
}
String rslt = new String(out);
return rslt;
}
Let me know if this helps in finding solution
EDIT:
Its an UTF-16 encoding and the characters in receivedText string is in form of extended ASCII for hindi characters
New Edit
I have new characters
String value = "?®Á?Ƕ ¡??°¿¯¾";
Which says मुकेश in hindi and dangaria in hindi. Google translator is not translating dangaria in hindi so I cannot provide you hindi version of it.
I talked to the person who is encoding he said that he removed 2 bits from the input before encoding i.e. if \u0905 represents अ in hindi then he removed \u09 from the input and converted remaining 05 in extended hexadecimal form.
So the new input string I provided you is decoded in form of above explanation. i.e. \u09 is been removed and rest is converted into extended ascii and then sent to device using USB.
Let me know if this explanation helps you in finding out solution
I've been playing around with this a bit and have an idea of what you might need to do. It looks like the value for receivedText that you have in your posting is encoded in windows-1252 for some reason. It was probably from pasting it into this post perhaps. Providing the raw byte values would be better to avoid any encoding errors. Regardless, I was able to get that String into the following Unicode Devanagari characters:
5286T11फए ऋभ इडऒठ ऒठ फए उएओ:A011605286 ऋडऋ ठऍ:12:45 चयऍओ:(9619441121)ऋडऋ:-, ऍछडइयओठ ँ ऍछडइयओठ ऋउढ, ठइडगऑढ चय, ठइडगऑढ ऋतओयग ओडऍठओययच:- ँयओफऋ:-, एठछओ, चठअठयओतञ ऋतओयग~
With the following code:
final String receivedText = "5286T11ɬ ªË ¦¿¯¾ ¯¾ ɬ ¨¬°:A011605286 ª¿ª ¾®:12:45 ¸Í®°:(9619441121)ª¿ª:-, ®¹¿¦Í°¾ ¡ ®¹¿¦Í°¾ ª¨À, ¾¦¿µ²À ¸Í, ¾¦¿µ²À ªÂ°Íµ °¿®¾°Í͸:- ¡Í°Éª:-, ¬¾¹°, ¸¾¤¾Í°Â¼ ªÂ°Íµ~";
final Charset fromCharset = Charset.forName("x-ISCII91");
final CharBuffer decoded = fromCharset.decode(ByteBuffer.wrap(receivedText.getBytes("windows-1252")));
final Charset toCharset = Charset.forName("UTF-16");
final byte[] encoded = toCharset.encode(decoded).array();
System.out.println(new String(encoded, toCharset.displayName()));
Whether or not those are the expected characters is something you would need to tell me :)
Also, I'm not sure if the x-ISCII91 character encoding is available in Android.
Generally, for a byte array that you know to be a string value, you can use the following.
Assuming byte[] someBytes:
String stringFromBytes = new String(someBytes, "UTF-16");
You may replace "UTF-16" with the approprate charset, which you can find after some experimentation. This link detailing java's supported character encodings may be of help.
From the details you have provided I would suggest considering the following:
If you're reading a file from a USB drive, android might have existing frameworks that will help you do this in a more standard way.
If you most certainly need to read in and manipulate the bytes from the USB port directly, make sure that you are familiar with the API/protocol of the data you are reading. It may be that some of the bytes are control messages or something similar that cannot be converted to strings, and you will need to identify exactly where in the byte stream the string begins (and ends).
hindi = new String(receivedText.getBytes(), "UTF-16");
But this does not really look like hindi.. are you sure it is encoded as UTF-16?
Edit:
String charset = "UTF-8";
hindi = new String(hindi.getBytes(Charset.forName(charset)), "UTF-16");
Replace UTF-8 with the actual charsed that resulted in your loooong String.
I have dirty data. Sometimes it contains characters like this. I use this data to make queries like
WHERE a.address IN ('mydatahere')
For this character I get
org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '
How can I filter out characters like this? I use Java.
Thanks.
When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:
use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
print Encode::decode('UTF-8', $_);
}
This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with � (U+FFFD, Unicode replacement character).
If you run this script on good UTF-8 input, output should be identical to input.
If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.
This is Perl one-liner version of this same script:
perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
EDIT: Added Java-only solution.
This is an example how to do this in Java:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
public class UtfFix {
public static void main(String[] args) throws InterruptedException, CharacterCodingException {
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
ByteBuffer bb = ByteBuffer.wrap(new byte[] {
(byte) 0xD0, (byte) 0x9F, // 'П'
(byte) 0xD1, (byte) 0x80, // 'р'
(byte) 0xD0, // corrupted UTF-8, was 'и'
(byte) 0xD0, (byte) 0xB2, // 'в'
(byte) 0xD0, (byte) 0xB5, // 'е'
(byte) 0xD1, (byte) 0x82 // 'т'
});
CharBuffer parsed = decoder.decode(bb);
System.out.println(parsed);
// this prints: Пр?вет
}
}
You can encode and then decode it to/from UTF-8:
String label = "look into my eyes 〠.〠";
Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();
System.out.println(label);
output:
look into my eyes ?.?
edit: I think this might only work on Java 6.
You can filter surrogate characters with this regex:
String str = "𠀀"; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0
Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(values[i].replaceAll(
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
, ""));
}
or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(Pattern.matches(
".*(" +
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
+ ").*"
, values[i]));
}
For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.
May be this will help someone as it helped me.
public static String removeBadChars(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
for(int i=0;i<s.length();i++){
if (Character.isHighSurrogate(s.charAt(i))) continue;
sb.append(s.charAt(i));
}
return sb.toString();
}
In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
It's pre-processing though and sometimes you don't have that luxury.
$str = preg_replace('/[[:^print:]]/', '', $str);