Python encoded utf-8 string \xc4\x91 in Java - java

How to get proper Java string from Python created string 'Oslobo\xc4\x91enja'?
How to decode it? I've tryed I think everything, looked everywhere, I've been stuck for 2 days with this problem. Please help!
Here is the Python's web service method that returns JSON from which Java client with Google Gson parses it.
def list_of_suggestions(entry):
input = entry.encode('utf-8')
"""Returns list of suggestions from auto-complete search"""
json_result = { 'suggestions': [] }
resp = urllib2.urlopen('https://maps.googleapis.com/maps/api/place/autocomplete/json?input=' + urllib2.quote(input) + '&location=45.268605,19.852924&radius=3000&components=country:rs&sensor=false&key=blahblahblahblah')
# make json object from response
json_resp = json.loads(resp.read())
if json_resp['status'] == u'OK':
for pred in json_resp['predictions']:
if pred['description'].find('Novi Sad') != -1 or pred['description'].find(u'Нови Сад') != -1:
obj = {}
obj['name'] = pred['description'].encode('utf-8').encode('string-escape')
obj['reference'] = pred['reference'].encode('utf-8').encode('string-escape')
json_result['suggestions'].append(obj)
return str(json_result)
Here is solution on Java client
private String python2JavaStr(String pythonStr) throws UnsupportedEncodingException {
int charValue;
byte[] bytes = pythonStr.getBytes();
ByteBuffer decodedBytes = ByteBuffer.allocate(pythonStr.length());
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '\\' && bytes[i + 1] == 'x') {
// \xc4 => c4 => 196
charValue = Integer.parseInt(pythonStr.substring(i + 2, i + 4), 16);
decodedBytes.put((byte) charValue);
i += 3;
} else
decodedBytes.put(bytes[i]);
}
return new String(decodedBytes.array(), "UTF-8");
}

You are returning the string version of the python data structure.
Return an actual JSON response instead; leave the values as Unicode:
if json_resp['status'] == u'OK':
for pred in json_resp['predictions']:
desc = pred['description']
if u'Novi Sad' in desc or u'Нови Сад' in desc:
obj = {
'name': pred['description'],
'reference': pred['reference']
}
json_result['suggestions'].append(obj)
return json.dumps(json_result)
Now Java does not have to interpret Python escape codes, and can parse valid JSON instead.

Python escapes unicode characters by converting their UTF-8 bytes into a series of \xVV values, where VV is the hex value of the byte. This is very different from the java unicode escapes, which are just a single \uVVVV per character, where VVVV is hex UTF-16 encoding.
Consider:
\xc4\x91
In decimal, those hex values are:
196 145
then (in Java):
byte[] bytes = { (byte) 196, (byte) 145 };
System.out.println("result: " + new String(bytes, "UTF-8"));
prints:
result: đ

Related

Convert Unicode to UTF-8

My question may already have been answered on StackoverFlow, but I can't find it.
My problem is simple: I request data via an API, and the data returned have unicode characters, for example:
"SpecialOffer":[{"title":"Offre Vente Priv\u00e9e 1 jour 2019 2020"}]
I need to convert the "\u00e9e" to "é".
I cant't make a "replaceAll", because I cannot know all the characters that there will be in advance.
I try this :
byte[] utf8 = reponse.getBytes("UTF-8")
String string = new String(utf8, "UTF-8");
But the string still has "\u00e9e"
Also this :
byte[] utf8 = reponse.getBytes(StandardCharsets.UTF_8);
String string = new String(utf8, StandardCharsets.UTF_8);
Also tried this :
string = string.replace("\\\\", "\\");
byte[] utf8Bytes = null;
String convertedString = null;
utf8Bytes = string.getBytes("UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;
convertedString = new String(utf8Bytes, "UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;;
System.out.println(convertedString);
return convertedString;
But it doesn't work either.
I tested other methods but I think I deleted everything like that didn't work so I can't show them to you here.
I am sure there is a very simple method, but I should not search with the right vocabulary on the internet. Can you help me please ?
I wish you a very good day, and thank you very much in advance.
The String.getBytes method requires a valid Charset [1]
From the javadoc [2] the valid cases are
US-ASCII
ISO-8859-1
UTF-8
UTF-16BE
UTF-16LE
UTF-16
So you need to use UTF-8 in the getBytes method.
[1] https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-
[2] https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html
You can use small json library
String jsonstring = "{\"SpecialOffer\":[{\"title\":\"Offre Vente Priv\\u00e9e 1 jour 2019 2020\"}]}";
JsonValue json = JsonParser.parse(jsonstring);
String value = json.asObject()
.first("SpecialOffer").asArray().get(0)
.asObject().first("title").asStringLiteral().stringValue();
System.out.println(" result: " + value);
or
String text = "Offre Vente Priv\\u00e9e 1 jour 2019 2020";
System.out.println(" result: " + JsonEscaper.unescape(text));
The problem that I had not seen, is that the API did not return me "\u00e9e" but "\\u00e9e" as it was a character sequence and not a unicode character!
So I have to recreate all the unicodes, and everything works fine!
int i=0, len=s.length();
char c;
StringBuffer sb = new StringBuffer(len);
while (i < len) {
c = s.charAt(i++);
if (c == '\\') {
if (i < len) {
c = s.charAt(i++);
if (c == 'u') {
// TODO: check that 4 more chars exist and are all hex digits
c = (char) Integer.parseInt(s.substring(i, i+4), 16);
i += 4;
} // add other cases here as desired...
}
} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
return sb.toString();
Find this solution here:
Java: How to create unicode from string "\u00C3" etc

Need help in converting EBCDIC to Hexadecimal

I am writing an hive UDF to convert the EBCDIC character to Hexadecimal.
Ebcdic characters are present in hive table.Currently I am able to convert it, bit it is ignoring few characters while conversion.
Example:
This is the EBCDIC value stored in table:
AGNSAñA¦ûÃÃÂõÂjÂq  à ()
Converted hexadecimal:
c1c7d5e2000a5cd4f6ef99187d07067203a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
What I want as output:
c1c7d5e200010a5cd4f6ef99187d0706720103a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
It is ignoring to convert the below EBCDIC characters:
01 - It is start of heading
10 - It is a escape
15 - New line.
Below is the code I have tried so far:
public class EbcdicToHex extends UDF {
public String evaluate(String edata) throws UnsupportedEncodingException {
byte[] ebcdiResult = getEBCDICRawData(edata);
String hexResult = getHexData(ebcdiResult);
return hexResult;
}
public byte[] getEBCDICRawData (String edata) throws UnsupportedEncodingException {
byte[] result = null;
String ebcdic_encoding = "IBM-037";
result = edata.getBytes(ebcdic_encoding);
return result;
}
public String getHexData(byte[] result){
String output = asHex(result);
return output;
}
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}
While converting, its ignoring few EBCDIC characters. How to make them also converted to hexadecimal?
I think the problem lies elsewhere, I created a small testcase where I create a String based on those 3 bytes you claim to be ignored, but in my output they do seem to be converted correctly:
private void run(String[] args) throws Exception {
byte[] bytes = new byte[] {0x01, 0x10, 0x15};
String str = new String(bytes, "IBM-037");
byte[] result = getEBCDICRawData(str);
for(byte b : result) {
System.out.print(Integer.toString(( b & 0xff ) + 0x100, 16).substring(1) + " ");
}
System.out.println();
System.out.println(evaluate(str));
}
Output:
01 10 15
011015
Based on this it seems both your getEBCDICRawData and evaluate method seem to be working correctly and makes me believe your String value may already be incorrect to start with. Could it be the String is already missing those characters? Or perhaps a long shot, but maybe the charset is incorrect? There are different EBCDIC charsets, so maybe the String is composed using a different one? Although I doubt this would make much difference for the 01, 10 and 15 bytes.
As a final remark, but probably unrelated to your problem, I usually prefer to use the encode/decode functions on the charset object to do such conversions:
String charset = "IBM-037";
Charset cs = Charset.forName(charset);
ByteBuffer bb = cs.encode(str);
CharBuffer cb = cs.decode(bb);

How to convert this hex string to unicode in java?

I get below result from a web service:
"\\x52\\x50\\x1F\\x1F\\x44\\x46\\x57\\x47"
I need to get the strings in unicode characters, which i think would be:
"\u0052\u0050\u001F\u001F\u0044\u0046\u0057\u0047"
i.e. "RPDFWG"
I cannot use replace("\\x", "\u00"); because it says "\u00" is not a valid unicode
This code works for me:
try {
String orig = "\\x52\\x50\\x1F\\x1F\\x44\\x46\\x57\\x47";
byte[] bytes = new byte[orig.length() / 4];
for (int i = 0; i < orig.length(); i += 4) {
bytes[i / 4] = (byte) Integer.parseInt(orig.substring(i + 2, i + 4), 16);
}
System.out.println(new String(bytes, "UTF-8"));
}
catch (Exception e) {
e.printStackTrace();
}
You might want to change the encoding to ISO-8859-1, or just plain ASCII -- I can't tell from your example what encoding is relevant here.

Encode/decode hex to utf-8 string

Working on web application which accepts all UTF-8 character's including greek characters following are strings that i want to convert to hex.
Following are different language string which are not working in my current code
ЫЙБПАРО Εγκυκλοπαίδεια éaös Größe Größe
Following are hex conversions by javascript function mentioned below
42b41941141f41042041e 3953b33ba3c53ba3bb3bf3c03b13af3b43b53b93b1 e961f673 4772c3192c2b6c3192c217865 4772f6df65
Javascript function to convert above string to hex
function encode(string) {
var str= "";
var length = string.length;
for (var i = 0; i < length; i++){
str+= string.charCodeAt(i).toString(16);
}
return str;
}
Here it is not giving any error to convert but at java side I'm unable to parse such string used following java code to convert hex
public String HexToString(String hex){
StringBuilder finalString = new StringBuilder();
StringBuilder tempString = new StringBuilder();
for( int i=0; i<hex.length()-1; i+=2 ){
String output = hex.substring(i, (i + 2));
int decimal = Integer.parseInt(output, 16);
finalString.append((char)decimal);
tempString.append(decimal);
}
return finalString.toString();
}
It throws error while parsing above hex string giving parse exception.
Suggest me the solution
Javascript works with 16-bit unicode characters, therefore charCodeAt might return any number between 0 and 65535. When you encode it to hex you get strings from 1 to 4 chars, and if you simply concatenate these, there's no way for the other party to find out what characters have been encoded.
You can work around this by adding delimiters to your encoded string:
function encode(string) {
return string.split("").map(function(c) {
return c.charCodeAt(0).toString(16);
}).join('-');
}
alert(encode('größe Εγκυκλοπαίδεια 维'))

How to convert Java String into byte[]?

Is there any way to convert Java String to a byte[] (not the boxed Byte[])?
In trying this:
System.out.println(response.split("\r\n\r\n")[1]);
System.out.println("******");
System.out.println(response.split("\r\n\r\n")[1].getBytes().toString());
and I'm getting separate outputs. Unable to display 1st output as it is a gzip string.
<A Gzip String>
******
[B#38ee9f13
The second is an address. Is there anything I'm doing wrong? I need the result in a byte[] to feed it to gzip decompressor, which is as follows.
String decompressGZIP(byte[] gzip) throws IOException {
java.util.zip.Inflater inf = new java.util.zip.Inflater();
java.io.ByteArrayInputStream bytein = new java.io.ByteArrayInputStream(gzip);
java.util.zip.GZIPInputStream gzin = new java.util.zip.GZIPInputStream(bytein);
java.io.ByteArrayOutputStream byteout = new java.io.ByteArrayOutputStream();
int res = 0;
byte buf[] = new byte[1024];
while (res >= 0) {
res = gzin.read(buf, 0, buf.length);
if (res > 0) {
byteout.write(buf, 0, res);
}
}
byte uncompressed[] = byteout.toByteArray();
return (uncompressed.toString());
}
The object your method decompressGZIP() needs is a byte[].
So the basic, technical answer to the question you have asked is:
byte[] b = string.getBytes();
byte[] b = string.getBytes(Charset.forName("UTF-8"));
byte[] b = string.getBytes(StandardCharsets.UTF_8); // Java 7+ only
However the problem you appear to be wrestling with is that this doesn't display very well. Calling toString() will just give you the default Object.toString() which is the class name + memory address. In your result [B#38ee9f13, the [B means byte[] and 38ee9f13 is the memory address, separated by an #.
For display purposes you can use:
Arrays.toString(bytes);
But this will just display as a sequence of comma-separated integers, which may or may not be what you want.
To get a readable String back from a byte[], use:
String string = new String(byte[] bytes, Charset charset);
The reason the Charset version is favoured, is that all String objects in Java are stored internally as UTF-16. When converting to a byte[] you will get a different breakdown of bytes for the given glyphs of that String, depending upon the chosen charset.
String example = "Convert Java String";
byte[] bytes = example.getBytes();
Simply:
String abc="abcdefghight";
byte[] b = abc.getBytes();
Try using String.getBytes(). It returns a byte[] representing string data.
Example:
String data = "sample data";
byte[] byteData = data.getBytes();
You can use String.getBytes() which returns the byte[] array.
You might wanna try return new String(byteout.toByteArray(Charset.forName("UTF-8")))
I know I'm a little late tothe party but thisworks pretty neat (our professor gave it to us)
public static byte[] asBytes (String s) {
String tmp;
byte[] b = new byte[s.length() / 2];
int i;
for (i = 0; i < s.length() / 2; i++) {
tmp = s.substring(i * 2, i * 2 + 2);
b[i] = (byte)(Integer.parseInt(tmp, 16) & 0xff);
}
return b; //return bytes
}
i had to conwert a int to decimal 3 byte 129 to 1 2 9
Byte data
int i1 = 129
int i3 = (i1 / 100);
i1 = i1 - i3*100;
int i2 = (i1 / 10);
i1 = i1 - i2*10;
data [1]= (byte) i1
data [2]= (byte) i2
data [3]= (byte) i3
It is not necessary to change java as a String parameter. You have to change the c code to receive a String without a pointer and in its code:
Bool DmgrGetVersion (String szVersion);
Char NewszVersion [200];
Strcpy (NewszVersion, szVersion.t_str ());
.t_str () applies to builder c ++ 2010

Categories