byte array to Hindi Unicode Value - java

Hi I have a small function which prints byte to Hindi which is stored as Unicode. My function is like
public static void byteArrayToPrintableHindi(byte[] iData) {
String value = "";
String unicode = "\\u";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < iData.length; i++) {
if (i % 2 == 0) {
value = value.concat(unicode.concat(String.format("%02X", iData[i])));
sb.append(String.format("%02X", iData[i]));
} else {
value = value.concat(String.format("%02X", iData[i]));
}
}
System.out.println("value = "+value);
System.out.println("\u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F");
}
and the output is
value = \u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F
चुड़ामणि
I am expecting the value to print
चुड़ामणि
I don't know why it is not printing the desired output.

You're misunderstanding how \uXXXX escape codes work. When the Java compiler reads your source code, it interprets those escape codes and translates them to Unicode characters. You cannot at runtime build a string that consists of \uXXXX codes and expect Java to automatically translate that into Unicode characters - that's not how it works. It only works with literal \uXXXX codes in your source code.
You can simply do this:
public static void byteArrayToPrintableHindi(byte[] iData) throws UnsupportedEncodingException {
String value = new String(iData, "UTF-16");
System.out.println("value = "+value);
System.out.println("\u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F");
}
assuming that the data is UTF-16-encoded.

Related

Convert Unicode to UTF-8

My question may already have been answered on StackoverFlow, but I can't find it.
My problem is simple: I request data via an API, and the data returned have unicode characters, for example:
"SpecialOffer":[{"title":"Offre Vente Priv\u00e9e 1 jour 2019 2020"}]
I need to convert the "\u00e9e" to "é".
I cant't make a "replaceAll", because I cannot know all the characters that there will be in advance.
I try this :
byte[] utf8 = reponse.getBytes("UTF-8")
String string = new String(utf8, "UTF-8");
But the string still has "\u00e9e"
Also this :
byte[] utf8 = reponse.getBytes(StandardCharsets.UTF_8);
String string = new String(utf8, StandardCharsets.UTF_8);
Also tried this :
string = string.replace("\\\\", "\\");
byte[] utf8Bytes = null;
String convertedString = null;
utf8Bytes = string.getBytes("UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;
convertedString = new String(utf8Bytes, "UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;;
System.out.println(convertedString);
return convertedString;
But it doesn't work either.
I tested other methods but I think I deleted everything like that didn't work so I can't show them to you here.
I am sure there is a very simple method, but I should not search with the right vocabulary on the internet. Can you help me please ?
I wish you a very good day, and thank you very much in advance.
The String.getBytes method requires a valid Charset [1]
From the javadoc [2] the valid cases are
US-ASCII
ISO-8859-1
UTF-8
UTF-16BE
UTF-16LE
UTF-16
So you need to use UTF-8 in the getBytes method.
[1] https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-
[2] https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html
You can use small json library
String jsonstring = "{\"SpecialOffer\":[{\"title\":\"Offre Vente Priv\\u00e9e 1 jour 2019 2020\"}]}";
JsonValue json = JsonParser.parse(jsonstring);
String value = json.asObject()
.first("SpecialOffer").asArray().get(0)
.asObject().first("title").asStringLiteral().stringValue();
System.out.println(" result: " + value);
or
String text = "Offre Vente Priv\\u00e9e 1 jour 2019 2020";
System.out.println(" result: " + JsonEscaper.unescape(text));
The problem that I had not seen, is that the API did not return me "\u00e9e" but "\\u00e9e" as it was a character sequence and not a unicode character!
So I have to recreate all the unicodes, and everything works fine!
int i=0, len=s.length();
char c;
StringBuffer sb = new StringBuffer(len);
while (i < len) {
c = s.charAt(i++);
if (c == '\\') {
if (i < len) {
c = s.charAt(i++);
if (c == 'u') {
// TODO: check that 4 more chars exist and are all hex digits
c = (char) Integer.parseInt(s.substring(i, i+4), 16);
i += 4;
} // add other cases here as desired...
}
} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
return sb.toString();
Find this solution here:
Java: How to create unicode from string "\u00C3" etc

hadoop mapper input deal with hex values

I have list of tweet as the input to the hdfs, and try to perform a map-reduce task. This is my mapper implementation:
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
String[] fields = value.toString().split("\t");
StringBuilder sb = new StringBuilder();
for (int i = 1; i < fields.length; i++) {
if (i > 1) {
sb.append("\t");
}
sb.append(fields[i]);
}
tid.set(fields[0]);
content.set(sb.toString());
context.write(tid, content);
} catch(DecoderException e) {
e.printStackTrace();
}
}
As you can see, I tried to split the input by "\t", but the input (value.toString()) looks like this when I print it out:
2014\x091880284777\x09argento_un\x090\x090\x09RT #topmusic619: #RETWEET THIS!!!!!\x5CnFOLLOW ME &amp
; EVERYONE ELSE THAT RETWEETS THIS FOR 35+ FOLLOWERS\x5Cn#TeamFollowBack #Follow2BeFollowed #TajF\xE2\x80\xA6
here is another example:
2014\x0934447260\x09RBEKP\x090\x090\x09\xE2\x80\x9C#LENEsipper: Wild lmfaooo RT #Yerrp08: L**o some
n***a nutt up while gettin twerked
I noted that \x09 should be a tab character (ASCII 09 is tab), So I tried to use apache Hex:
String tmp = value.toString();
byte[] bytes = Hex.decodeHex(tmp.toCharArray());
But the decodeHex function returns null.
This is weird, since some of the characters are in hex while others are not. How can I decode them?
Edit:
Also note that besides tab, emojis are also encoded as hex values.

Encode/decode hex to utf-8 string

Working on web application which accepts all UTF-8 character's including greek characters following are strings that i want to convert to hex.
Following are different language string which are not working in my current code
ЫЙБПАРО Εγκυκλοπαίδεια éaös Größe Größe
Following are hex conversions by javascript function mentioned below
42b41941141f41042041e 3953b33ba3c53ba3bb3bf3c03b13af3b43b53b93b1 e961f673 4772c3192c2b6c3192c217865 4772f6df65
Javascript function to convert above string to hex
function encode(string) {
var str= "";
var length = string.length;
for (var i = 0; i < length; i++){
str+= string.charCodeAt(i).toString(16);
}
return str;
}
Here it is not giving any error to convert but at java side I'm unable to parse such string used following java code to convert hex
public String HexToString(String hex){
StringBuilder finalString = new StringBuilder();
StringBuilder tempString = new StringBuilder();
for( int i=0; i<hex.length()-1; i+=2 ){
String output = hex.substring(i, (i + 2));
int decimal = Integer.parseInt(output, 16);
finalString.append((char)decimal);
tempString.append(decimal);
}
return finalString.toString();
}
It throws error while parsing above hex string giving parse exception.
Suggest me the solution
Javascript works with 16-bit unicode characters, therefore charCodeAt might return any number between 0 and 65535. When you encode it to hex you get strings from 1 to 4 chars, and if you simply concatenate these, there's no way for the other party to find out what characters have been encoded.
You can work around this by adding delimiters to your encoded string:
function encode(string) {
return string.split("").map(function(c) {
return c.charCodeAt(0).toString(16);
}).join('-');
}
alert(encode('größe Εγκυκλοπαίδεια 维'))

How to parse UTF-8 representation to String in Java?

Given the following code:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
Does the JDK already provide some classes for doing this ?
Is there a libray that does this ? (preferably under maven)
I have tried with ByteArrayOutputStream with no success.
This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).
package sample;
import java.io.UnsupportedEncodingException;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";
String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
byte[] utf8 = new byte[arr.length];
int index=0;
for (String ch : arr) {
utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
}
String newStr = new String(utf8, "UTF-8");
System.out.println(newStr);
}
catch (UnsupportedEncodingException e) {
// handle the UTF-8 conversion exception
}
}
}
Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.
package sample;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";
ArrayList<Byte> arrList = new ArrayList<Byte>();
String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");
for (String c : codes) {
int code = Integer.parseInt(c,HEXADECIMAL);
byte[] bytes = intToByteArray(code);
for (byte b : bytes) {
if (b != 0) arrList.add(b);
}
}
byte[] utf8 = new byte[arrList.size()];
for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);
str = new String(utf8, "UTF-8");
System.out.println(str);
}
catch (UnsupportedEncodingException e) {
// handle the exception when
}
}
// Takes a 4 byte integer and and extracts each byte
public static final byte[] intToByteArray(int value) {
return new byte[] {
(byte) (value >>> 24),
(byte) (value >>> 16),
(byte) (value >>> 8),
(byte) (value)
};
}
}
Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?
If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:
String result = "\u0068\u0065\u006c\u006c\u006f\u000a";
If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.
I'm sure there must be a better way, but using just the JDK:
public static String handleEscapes(final String s)
{
final java.util.Properties props = new java.util.Properties();
props.setProperty("foo", s);
final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
try
{
props.store(baos, null);
final String tmp = baos.toString().replace("\\\\", "\\");
props.load(new java.io.StringReader(tmp));
}
catch(final java.io.IOException ioe) // shouldn't happen
{ throw new RuntimeException(ioe); }
return props.getProperty("foo");
}
uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).
(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f)

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f) using Java?
EDIT:
I guess the question is not very clear... Basically what I want is this:
Given string s="blalbla" I want to get string "\uXXX\uYYYY"
You will need to extract each code point/unit from the String and encode it yourself. The following works for all Strings even if the individual linguistic characters within the String are composed of digraphs or ligatures.
public String getUnicodeEscapes(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char codeUnit = aString.charAt(ctr);
String hexString = Integer.toHexString(codeUnit);
String padAmount = "0000".substring(hexString.length());
buffer.append("\\u");
buffer.append(padAmount);
buffer.append(hexString);
}
return buffer.toString();
}
else
{
return null;
}
}
The above produces output as dictated by the Java Language Specification on Unicode escapes, i.e. it produces output of the form \uxxxx for each UTF-16 code unit. It addresses supplementary characters by producing a pair of code units represented as \uxxxx\uyyyy.
The originally posted code has been modified to produce Unicode codepoints in the format U+FFFFF:
public String getUnicodeCodepoints(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char ch = aString.charAt(ctr);
if (Character.isLowSurrogate(ch))
{
continue;
}
else
{
int codePoint = aString.codePointAt(ctr);
String hexString = Integer.toHexString(codePoint);
String zeroPad = Character.isHighSurrogate(ch) ? "00000" : "0000";
String padAmount = zeroPad.substring(hexString.length());
buffer.append(" U+");
buffer.append(padAmount);
buffer.append(hexString);
}
}
return buffer.toString();
}
else
{
return null;
}
}
The gruntwork is done by the String.codePointAt() method which returns the Unicode codepoint at a particular index. For a String instance composed of combinational characters, the length of the String instance will not be the length of the number of visible characters, but the number of actual Unicode codepoints. For example, क and ् combine to form क् in Devanagari, and the above function will rightfully return U+0915 U+094d without any fuss as String.length() will return 2 for the combined character. Strings with supplementary characters will be with single codepoints for the individual characters - 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 (the page will not display this String literal correctly, but you can copy this just fine; it should be Javascript but written using the supplementary character set for Mathematical alphanumeric symbols) will return U+1d4a5 U+1d4b6 U+1d4cb U+1d4b6 U+1d4c8 U+1d4b8 U+1d4c7 U+1d4be U+1d4c5 U+1d4c9.
public static void main(String[] args) {
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f"));
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
System.out.println(s);
} catch (CharacterCodingException e) {
e.printStackTrace();
}
}
I'm not aware of a build-in solution, so:
StringBuilder builder = new StringBuilder();
for(int i=0; i<yourString.length(); i++) {
builder.append(String.format("\\u%04x", yourString.charAt(i)));
}
String encoded = builder.toString();
Edit: sry, I thought you wanted to get the String encoded to \uXXXX expressions ...
You didn't saying what encoding you are after, but based on the tag I'm assuming you want the UTF-8 encoding. Here's how:
byte[] utf8 =
"\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f".getBytes("UTF-8");
You can then write a simple loop to output the bytes in utf8 in hexadecimal or decimal ... or do something else with them.
System.out.println ("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f");
works like a charm for me:
Служебная

Categories