Encode/decode hex to utf-8 string - java

Working on web application which accepts all UTF-8 character's including greek characters following are strings that i want to convert to hex.
Following are different language string which are not working in my current code
ЫЙБПАРО Εγκυκλοπαίδεια éaös Größe Größe
Following are hex conversions by javascript function mentioned below
42b41941141f41042041e 3953b33ba3c53ba3bb3bf3c03b13af3b43b53b93b1 e961f673 4772c3192c2b6c3192c217865 4772f6df65
Javascript function to convert above string to hex
function encode(string) {
var str= "";
var length = string.length;
for (var i = 0; i < length; i++){
str+= string.charCodeAt(i).toString(16);
}
return str;
}
Here it is not giving any error to convert but at java side I'm unable to parse such string used following java code to convert hex
public String HexToString(String hex){
StringBuilder finalString = new StringBuilder();
StringBuilder tempString = new StringBuilder();
for( int i=0; i<hex.length()-1; i+=2 ){
String output = hex.substring(i, (i + 2));
int decimal = Integer.parseInt(output, 16);
finalString.append((char)decimal);
tempString.append(decimal);
}
return finalString.toString();
}
It throws error while parsing above hex string giving parse exception.
Suggest me the solution

Javascript works with 16-bit unicode characters, therefore charCodeAt might return any number between 0 and 65535. When you encode it to hex you get strings from 1 to 4 chars, and if you simply concatenate these, there's no way for the other party to find out what characters have been encoded.
You can work around this by adding delimiters to your encoded string:
function encode(string) {
return string.split("").map(function(c) {
return c.charCodeAt(0).toString(16);
}).join('-');
}
alert(encode('größe Εγκυκλοπαίδεια 维'))

Related

Convert Unicode to UTF-8

My question may already have been answered on StackoverFlow, but I can't find it.
My problem is simple: I request data via an API, and the data returned have unicode characters, for example:
"SpecialOffer":[{"title":"Offre Vente Priv\u00e9e 1 jour 2019 2020"}]
I need to convert the "\u00e9e" to "é".
I cant't make a "replaceAll", because I cannot know all the characters that there will be in advance.
I try this :
byte[] utf8 = reponse.getBytes("UTF-8")
String string = new String(utf8, "UTF-8");
But the string still has "\u00e9e"
Also this :
byte[] utf8 = reponse.getBytes(StandardCharsets.UTF_8);
String string = new String(utf8, StandardCharsets.UTF_8);
Also tried this :
string = string.replace("\\\\", "\\");
byte[] utf8Bytes = null;
String convertedString = null;
utf8Bytes = string.getBytes("UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;
convertedString = new String(utf8Bytes, "UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;;
System.out.println(convertedString);
return convertedString;
But it doesn't work either.
I tested other methods but I think I deleted everything like that didn't work so I can't show them to you here.
I am sure there is a very simple method, but I should not search with the right vocabulary on the internet. Can you help me please ?
I wish you a very good day, and thank you very much in advance.
The String.getBytes method requires a valid Charset [1]
From the javadoc [2] the valid cases are
US-ASCII
ISO-8859-1
UTF-8
UTF-16BE
UTF-16LE
UTF-16
So you need to use UTF-8 in the getBytes method.
[1] https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-
[2] https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html
You can use small json library
String jsonstring = "{\"SpecialOffer\":[{\"title\":\"Offre Vente Priv\\u00e9e 1 jour 2019 2020\"}]}";
JsonValue json = JsonParser.parse(jsonstring);
String value = json.asObject()
.first("SpecialOffer").asArray().get(0)
.asObject().first("title").asStringLiteral().stringValue();
System.out.println(" result: " + value);
or
String text = "Offre Vente Priv\\u00e9e 1 jour 2019 2020";
System.out.println(" result: " + JsonEscaper.unescape(text));
The problem that I had not seen, is that the API did not return me "\u00e9e" but "\\u00e9e" as it was a character sequence and not a unicode character!
So I have to recreate all the unicodes, and everything works fine!
int i=0, len=s.length();
char c;
StringBuffer sb = new StringBuffer(len);
while (i < len) {
c = s.charAt(i++);
if (c == '\\') {
if (i < len) {
c = s.charAt(i++);
if (c == 'u') {
// TODO: check that 4 more chars exist and are all hex digits
c = (char) Integer.parseInt(s.substring(i, i+4), 16);
i += 4;
} // add other cases here as desired...
}
} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
return sb.toString();
Find this solution here:
Java: How to create unicode from string "\u00C3" etc

Need help in converting EBCDIC to Hexadecimal

I am writing an hive UDF to convert the EBCDIC character to Hexadecimal.
Ebcdic characters are present in hive table.Currently I am able to convert it, bit it is ignoring few characters while conversion.
Example:
This is the EBCDIC value stored in table:
AGNSAñA¦ûÃÃÂõÂjÂq  à ()
Converted hexadecimal:
c1c7d5e2000a5cd4f6ef99187d07067203a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
What I want as output:
c1c7d5e200010a5cd4f6ef99187d0706720103a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
It is ignoring to convert the below EBCDIC characters:
01 - It is start of heading
10 - It is a escape
15 - New line.
Below is the code I have tried so far:
public class EbcdicToHex extends UDF {
public String evaluate(String edata) throws UnsupportedEncodingException {
byte[] ebcdiResult = getEBCDICRawData(edata);
String hexResult = getHexData(ebcdiResult);
return hexResult;
}
public byte[] getEBCDICRawData (String edata) throws UnsupportedEncodingException {
byte[] result = null;
String ebcdic_encoding = "IBM-037";
result = edata.getBytes(ebcdic_encoding);
return result;
}
public String getHexData(byte[] result){
String output = asHex(result);
return output;
}
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}
While converting, its ignoring few EBCDIC characters. How to make them also converted to hexadecimal?
I think the problem lies elsewhere, I created a small testcase where I create a String based on those 3 bytes you claim to be ignored, but in my output they do seem to be converted correctly:
private void run(String[] args) throws Exception {
byte[] bytes = new byte[] {0x01, 0x10, 0x15};
String str = new String(bytes, "IBM-037");
byte[] result = getEBCDICRawData(str);
for(byte b : result) {
System.out.print(Integer.toString(( b & 0xff ) + 0x100, 16).substring(1) + " ");
}
System.out.println();
System.out.println(evaluate(str));
}
Output:
01 10 15
011015
Based on this it seems both your getEBCDICRawData and evaluate method seem to be working correctly and makes me believe your String value may already be incorrect to start with. Could it be the String is already missing those characters? Or perhaps a long shot, but maybe the charset is incorrect? There are different EBCDIC charsets, so maybe the String is composed using a different one? Although I doubt this would make much difference for the 01, 10 and 15 bytes.
As a final remark, but probably unrelated to your problem, I usually prefer to use the encode/decode functions on the charset object to do such conversions:
String charset = "IBM-037";
Charset cs = Charset.forName(charset);
ByteBuffer bb = cs.encode(str);
CharBuffer cb = cs.decode(bb);

Java: How to convert String of ASCII to String of characters?

I am sending a message from one device to another using MQTT client/broker. The message is exchanged (sent and received) between the two devices as String succesfully.
However, on the MQTT-Broker (i.e.: the server) the message characters are received as ASCII numbers within a string.
For example if I send:
"This is a test"
On the broker it show:
"84,104,105,115,32,105,115,32,97,32,116,101,115,116,10"
Using Java, I need a way to convert this string of ASCII back to string on the server for further process.
How to do that ? thanks
Convert the string to a byte[] and create a new string using the byte[]
String str = "84,104,105,115,32,105,115,32,97,32,116,101,115,116,10";
String[] chars = str.split(",");
byte[] bytes = new byte[chars.length];
for (int i = 0; i < chars.length; i++) {
bytes[i] = Byte.parseByte(chars[i]);
}
return new String(bytes);
You can break the string using StringTokenizer with delimiter as comma and then iterate on each of them and use Character.toString ((char) i);
You can use stream as well for this, if you can use java-8
String str = Stream.of("84,104,105,115,32,105,115,32,97,32,116,101,115,116,10".split(","))
.map(ch -> (char) Integer.valueOf(ch).intValue())
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
System.out.println(str); // This is a test

byte array to Hindi Unicode Value

Hi I have a small function which prints byte to Hindi which is stored as Unicode. My function is like
public static void byteArrayToPrintableHindi(byte[] iData) {
String value = "";
String unicode = "\\u";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < iData.length; i++) {
if (i % 2 == 0) {
value = value.concat(unicode.concat(String.format("%02X", iData[i])));
sb.append(String.format("%02X", iData[i]));
} else {
value = value.concat(String.format("%02X", iData[i]));
}
}
System.out.println("value = "+value);
System.out.println("\u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F");
}
and the output is
value = \u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F
चुड़ामणि
I am expecting the value to print
चुड़ामणि
I don't know why it is not printing the desired output.
You're misunderstanding how \uXXXX escape codes work. When the Java compiler reads your source code, it interprets those escape codes and translates them to Unicode characters. You cannot at runtime build a string that consists of \uXXXX codes and expect Java to automatically translate that into Unicode characters - that's not how it works. It only works with literal \uXXXX codes in your source code.
You can simply do this:
public static void byteArrayToPrintableHindi(byte[] iData) throws UnsupportedEncodingException {
String value = new String(iData, "UTF-16");
System.out.println("value = "+value);
System.out.println("\u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F");
}
assuming that the data is UTF-16-encoded.

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f)

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f) using Java?
EDIT:
I guess the question is not very clear... Basically what I want is this:
Given string s="blalbla" I want to get string "\uXXX\uYYYY"
You will need to extract each code point/unit from the String and encode it yourself. The following works for all Strings even if the individual linguistic characters within the String are composed of digraphs or ligatures.
public String getUnicodeEscapes(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char codeUnit = aString.charAt(ctr);
String hexString = Integer.toHexString(codeUnit);
String padAmount = "0000".substring(hexString.length());
buffer.append("\\u");
buffer.append(padAmount);
buffer.append(hexString);
}
return buffer.toString();
}
else
{
return null;
}
}
The above produces output as dictated by the Java Language Specification on Unicode escapes, i.e. it produces output of the form \uxxxx for each UTF-16 code unit. It addresses supplementary characters by producing a pair of code units represented as \uxxxx\uyyyy.
The originally posted code has been modified to produce Unicode codepoints in the format U+FFFFF:
public String getUnicodeCodepoints(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char ch = aString.charAt(ctr);
if (Character.isLowSurrogate(ch))
{
continue;
}
else
{
int codePoint = aString.codePointAt(ctr);
String hexString = Integer.toHexString(codePoint);
String zeroPad = Character.isHighSurrogate(ch) ? "00000" : "0000";
String padAmount = zeroPad.substring(hexString.length());
buffer.append(" U+");
buffer.append(padAmount);
buffer.append(hexString);
}
}
return buffer.toString();
}
else
{
return null;
}
}
The gruntwork is done by the String.codePointAt() method which returns the Unicode codepoint at a particular index. For a String instance composed of combinational characters, the length of the String instance will not be the length of the number of visible characters, but the number of actual Unicode codepoints. For example, क and ् combine to form क् in Devanagari, and the above function will rightfully return U+0915 U+094d without any fuss as String.length() will return 2 for the combined character. Strings with supplementary characters will be with single codepoints for the individual characters - 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 (the page will not display this String literal correctly, but you can copy this just fine; it should be Javascript but written using the supplementary character set for Mathematical alphanumeric symbols) will return U+1d4a5 U+1d4b6 U+1d4cb U+1d4b6 U+1d4c8 U+1d4b8 U+1d4c7 U+1d4be U+1d4c5 U+1d4c9.
public static void main(String[] args) {
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f"));
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
System.out.println(s);
} catch (CharacterCodingException e) {
e.printStackTrace();
}
}
I'm not aware of a build-in solution, so:
StringBuilder builder = new StringBuilder();
for(int i=0; i<yourString.length(); i++) {
builder.append(String.format("\\u%04x", yourString.charAt(i)));
}
String encoded = builder.toString();
Edit: sry, I thought you wanted to get the String encoded to \uXXXX expressions ...
You didn't saying what encoding you are after, but based on the tag I'm assuming you want the UTF-8 encoding. Here's how:
byte[] utf8 =
"\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f".getBytes("UTF-8");
You can then write a simple loop to output the bytes in utf8 in hexadecimal or decimal ... or do something else with them.
System.out.println ("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f");
works like a charm for me:
Служебная

Categories