I am writing an hive UDF to convert the EBCDIC character to Hexadecimal.
Ebcdic characters are present in hive table.Currently I am able to convert it, bit it is ignoring few characters while conversion.
Example:
This is the EBCDIC value stored in table:
AGNSAñA¦ûÃÃÂõÂjÂq  à ()
Converted hexadecimal:
c1c7d5e2000a5cd4f6ef99187d07067203a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
What I want as output:
c1c7d5e200010a5cd4f6ef99187d0706720103a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
It is ignoring to convert the below EBCDIC characters:
01 - It is start of heading
10 - It is a escape
15 - New line.
Below is the code I have tried so far:
public class EbcdicToHex extends UDF {
public String evaluate(String edata) throws UnsupportedEncodingException {
byte[] ebcdiResult = getEBCDICRawData(edata);
String hexResult = getHexData(ebcdiResult);
return hexResult;
}
public byte[] getEBCDICRawData (String edata) throws UnsupportedEncodingException {
byte[] result = null;
String ebcdic_encoding = "IBM-037";
result = edata.getBytes(ebcdic_encoding);
return result;
}
public String getHexData(byte[] result){
String output = asHex(result);
return output;
}
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}
While converting, its ignoring few EBCDIC characters. How to make them also converted to hexadecimal?
I think the problem lies elsewhere, I created a small testcase where I create a String based on those 3 bytes you claim to be ignored, but in my output they do seem to be converted correctly:
private void run(String[] args) throws Exception {
byte[] bytes = new byte[] {0x01, 0x10, 0x15};
String str = new String(bytes, "IBM-037");
byte[] result = getEBCDICRawData(str);
for(byte b : result) {
System.out.print(Integer.toString(( b & 0xff ) + 0x100, 16).substring(1) + " ");
}
System.out.println();
System.out.println(evaluate(str));
}
Output:
01 10 15
011015
Based on this it seems both your getEBCDICRawData and evaluate method seem to be working correctly and makes me believe your String value may already be incorrect to start with. Could it be the String is already missing those characters? Or perhaps a long shot, but maybe the charset is incorrect? There are different EBCDIC charsets, so maybe the String is composed using a different one? Although I doubt this would make much difference for the 01, 10 and 15 bytes.
As a final remark, but probably unrelated to your problem, I usually prefer to use the encode/decode functions on the charset object to do such conversions:
String charset = "IBM-037";
Charset cs = Charset.forName(charset);
ByteBuffer bb = cs.encode(str);
CharBuffer cb = cs.decode(bb);
Working on web application which accepts all UTF-8 character's including greek characters following are strings that i want to convert to hex.
Following are different language string which are not working in my current code
ЫЙБПАРО Εγκυκλοπαίδεια éaös Größe Größe
Following are hex conversions by javascript function mentioned below
42b41941141f41042041e 3953b33ba3c53ba3bb3bf3c03b13af3b43b53b93b1 e961f673 4772c3192c2b6c3192c217865 4772f6df65
Javascript function to convert above string to hex
function encode(string) {
var str= "";
var length = string.length;
for (var i = 0; i < length; i++){
str+= string.charCodeAt(i).toString(16);
}
return str;
}
Here it is not giving any error to convert but at java side I'm unable to parse such string used following java code to convert hex
public String HexToString(String hex){
StringBuilder finalString = new StringBuilder();
StringBuilder tempString = new StringBuilder();
for( int i=0; i<hex.length()-1; i+=2 ){
String output = hex.substring(i, (i + 2));
int decimal = Integer.parseInt(output, 16);
finalString.append((char)decimal);
tempString.append(decimal);
}
return finalString.toString();
}
It throws error while parsing above hex string giving parse exception.
Suggest me the solution
Javascript works with 16-bit unicode characters, therefore charCodeAt might return any number between 0 and 65535. When you encode it to hex you get strings from 1 to 4 chars, and if you simply concatenate these, there's no way for the other party to find out what characters have been encoded.
You can work around this by adding delimiters to your encoded string:
function encode(string) {
return string.split("").map(function(c) {
return c.charCodeAt(0).toString(16);
}).join('-');
}
alert(encode('größe Εγκυκλοπαίδεια 维'))
How to get proper Java string from Python created string 'Oslobo\xc4\x91enja'?
How to decode it? I've tryed I think everything, looked everywhere, I've been stuck for 2 days with this problem. Please help!
Here is the Python's web service method that returns JSON from which Java client with Google Gson parses it.
def list_of_suggestions(entry):
input = entry.encode('utf-8')
"""Returns list of suggestions from auto-complete search"""
json_result = { 'suggestions': [] }
resp = urllib2.urlopen('https://maps.googleapis.com/maps/api/place/autocomplete/json?input=' + urllib2.quote(input) + '&location=45.268605,19.852924&radius=3000&components=country:rs&sensor=false&key=blahblahblahblah')
# make json object from response
json_resp = json.loads(resp.read())
if json_resp['status'] == u'OK':
for pred in json_resp['predictions']:
if pred['description'].find('Novi Sad') != -1 or pred['description'].find(u'Нови Сад') != -1:
obj = {}
obj['name'] = pred['description'].encode('utf-8').encode('string-escape')
obj['reference'] = pred['reference'].encode('utf-8').encode('string-escape')
json_result['suggestions'].append(obj)
return str(json_result)
Here is solution on Java client
private String python2JavaStr(String pythonStr) throws UnsupportedEncodingException {
int charValue;
byte[] bytes = pythonStr.getBytes();
ByteBuffer decodedBytes = ByteBuffer.allocate(pythonStr.length());
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '\\' && bytes[i + 1] == 'x') {
// \xc4 => c4 => 196
charValue = Integer.parseInt(pythonStr.substring(i + 2, i + 4), 16);
decodedBytes.put((byte) charValue);
i += 3;
} else
decodedBytes.put(bytes[i]);
}
return new String(decodedBytes.array(), "UTF-8");
}
You are returning the string version of the python data structure.
Return an actual JSON response instead; leave the values as Unicode:
if json_resp['status'] == u'OK':
for pred in json_resp['predictions']:
desc = pred['description']
if u'Novi Sad' in desc or u'Нови Сад' in desc:
obj = {
'name': pred['description'],
'reference': pred['reference']
}
json_result['suggestions'].append(obj)
return json.dumps(json_result)
Now Java does not have to interpret Python escape codes, and can parse valid JSON instead.
Python escapes unicode characters by converting their UTF-8 bytes into a series of \xVV values, where VV is the hex value of the byte. This is very different from the java unicode escapes, which are just a single \uVVVV per character, where VVVV is hex UTF-16 encoding.
Consider:
\xc4\x91
In decimal, those hex values are:
196 145
then (in Java):
byte[] bytes = { (byte) 196, (byte) 145 };
System.out.println("result: " + new String(bytes, "UTF-8"));
prints:
result: đ
How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f) using Java?
EDIT:
I guess the question is not very clear... Basically what I want is this:
Given string s="blalbla" I want to get string "\uXXX\uYYYY"
You will need to extract each code point/unit from the String and encode it yourself. The following works for all Strings even if the individual linguistic characters within the String are composed of digraphs or ligatures.
public String getUnicodeEscapes(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char codeUnit = aString.charAt(ctr);
String hexString = Integer.toHexString(codeUnit);
String padAmount = "0000".substring(hexString.length());
buffer.append("\\u");
buffer.append(padAmount);
buffer.append(hexString);
}
return buffer.toString();
}
else
{
return null;
}
}
The above produces output as dictated by the Java Language Specification on Unicode escapes, i.e. it produces output of the form \uxxxx for each UTF-16 code unit. It addresses supplementary characters by producing a pair of code units represented as \uxxxx\uyyyy.
The originally posted code has been modified to produce Unicode codepoints in the format U+FFFFF:
public String getUnicodeCodepoints(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char ch = aString.charAt(ctr);
if (Character.isLowSurrogate(ch))
{
continue;
}
else
{
int codePoint = aString.codePointAt(ctr);
String hexString = Integer.toHexString(codePoint);
String zeroPad = Character.isHighSurrogate(ch) ? "00000" : "0000";
String padAmount = zeroPad.substring(hexString.length());
buffer.append(" U+");
buffer.append(padAmount);
buffer.append(hexString);
}
}
return buffer.toString();
}
else
{
return null;
}
}
The gruntwork is done by the String.codePointAt() method which returns the Unicode codepoint at a particular index. For a String instance composed of combinational characters, the length of the String instance will not be the length of the number of visible characters, but the number of actual Unicode codepoints. For example, क and ् combine to form क् in Devanagari, and the above function will rightfully return U+0915 U+094d without any fuss as String.length() will return 2 for the combined character. Strings with supplementary characters will be with single codepoints for the individual characters - 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 (the page will not display this String literal correctly, but you can copy this just fine; it should be Javascript but written using the supplementary character set for Mathematical alphanumeric symbols) will return U+1d4a5 U+1d4b6 U+1d4cb U+1d4b6 U+1d4c8 U+1d4b8 U+1d4c7 U+1d4be U+1d4c5 U+1d4c9.
public static void main(String[] args) {
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f"));
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
System.out.println(s);
} catch (CharacterCodingException e) {
e.printStackTrace();
}
}
I'm not aware of a build-in solution, so:
StringBuilder builder = new StringBuilder();
for(int i=0; i<yourString.length(); i++) {
builder.append(String.format("\\u%04x", yourString.charAt(i)));
}
String encoded = builder.toString();
Edit: sry, I thought you wanted to get the String encoded to \uXXXX expressions ...
You didn't saying what encoding you are after, but based on the tag I'm assuming you want the UTF-8 encoding. Here's how:
byte[] utf8 =
"\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f".getBytes("UTF-8");
You can then write a simple loop to output the bytes in utf8 in hexadecimal or decimal ... or do something else with them.
System.out.println ("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f");
works like a charm for me:
Служебная
I am expecting
System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8"));
to output:
Hello%20World
(20 is ASCII Hex code for space)
However, what I get is:
Hello+World
Am I using the wrong method? What is the correct method I should be using?
This behaves as expected. The URLEncoder implements the HTML Specifications for how to encode URLs in HTML forms.
From the javadocs:
This class contains static methods for
converting a String to the
application/x-www-form-urlencoded MIME
format.
and from the HTML Specification:
application/x-www-form-urlencoded
Forms submitted with this content type
must be encoded as follows:
Control names and values are escaped. Space characters are replaced
by `+'
You will have to replace it, e.g.:
System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8").replace("+", "%20"));
A space is encoded to %20 in URLs, and to + in forms submitted data (content type application/x-www-form-urlencoded). You need the former.
Using Guava:
dependencies {
compile 'com.google.guava:guava:23.0'
// or, for Android:
compile 'com.google.guava:guava:23.0-android'
}
You can use UrlEscapers:
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
Don't use String.replace, this would only encode the space. Use a library instead.
This class perform application/x-www-form-urlencoded-type encoding rather than percent encoding, therefore replacing with + is a correct behaviour.
From javadoc:
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Encode Query params
org.apache.commons.httpclient.util.URIUtil
URIUtil.encodeQuery(input);
OR if you want to escape chars within URI
public static String escapeURIPathParam(String input) {
StringBuilder resultStr = new StringBuilder();
for (char ch : input.toCharArray()) {
if (isUnsafe(ch)) {
resultStr.append('%');
resultStr.append(toHex(ch / 16));
resultStr.append(toHex(ch % 16));
} else{
resultStr.append(ch);
}
}
return resultStr.toString();
}
private static char toHex(int ch) {
return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
}
private static boolean isUnsafe(char ch) {
if (ch > 128 || ch < 0)
return true;
return " %$&+,/:;=?#<>#%".indexOf(ch) >= 0;
}
Hello+World is how a browser will encode form data (application/x-www-form-urlencoded) for a GET request and this is the generally accepted form for the query part of a URI.
http://host/path/?message=Hello+World
If you sent this request to a Java servlet, the servlet would correctly decode the parameter value. Usually the only time there are issues here is if the encoding doesn't match.
Strictly speaking, there is no requirement in the HTTP or URI specs that the query part to be encoded using application/x-www-form-urlencoded key-value pairs; the query part just needs to be in the form the web server accepts. In practice, this is unlikely to be an issue.
It would generally be incorrect to use this encoding for other parts of the URI (the path for example). In that case, you should use the encoding scheme as described in RFC 3986.
http://host/Hello%20World
More here.
If you want to encode URI path components, you can also use standard JDK functions, e.g.
public static String encodeURLPathComponent(String path) {
try {
return new URI(null, null, path, null).toASCIIString();
} catch (URISyntaxException e) {
// do some error handling
}
return "";
}
The URI class can also be used to encode different parts of or whole URIs.
Just been struggling with this too on Android, managed to stumble upon Uri.encode(String, String) while specific to android (android.net.Uri) might be useful to some.
static String encode(String s, String allow)
https://developer.android.com/reference/android/net/Uri.html#encode(java.lang.String, java.lang.String)
The other answers either present a manual string replacement, URLEncoder which actually encodes for HTML format, Apache's abandoned URIUtil, or using Guava's UrlEscapers. The last one is fine, except it doesn't provide a decoder.
Apache Commons Lang provides the URLCodec, which encodes and decodes according to URL format rfc3986.
String encoded = new URLCodec().encode(str);
String decoded = new URLCodec().decode(str);
If you are already using Spring, you can also opt to use its UriUtils class as well.
Although quite old, nevertheless a quick response:
Spring provides UriUtils - with this you can specify how to encoded and which part is it related from an URI, e.g.
encodePathSegment
encodePort
encodeFragment
encodeUriVariables
....
I use them cause we already using Spring, i.e. no additonal library is required!
If you are using jetty then org.eclipse.jetty.util.URIUtil will solve the issue.
String encoded_string = URIUtil.encodePath(not_encoded_string).toString();
This worked for me
org.apache.catalina.util.URLEncoder ul = new org.apache.catalina.util.URLEncoder().encode("MY URL");
It's not one-liner, but you can use:
URL url = new URL("https://some-host.net/dav/files/selling_Rosetta Stone Case Study.png.aes");
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
System.out.println(uri.toString());
This will give you an output:
https://some-host.net/dav/files/selling_Rosetta%20Stone%20Case%20Study.png.aes
"+" is correct. If you really need %20, then replace the Plusses yourself afterwards.
Warning: This answer is heavily disputed (+8 vs. -6), so take this with a grain of salt.
I was already using Feign so UriUtils was available to me but Spring UrlUtils was not.
<!-- https://mvnrepository.com/artifact/io.github.openfeign/feign-core -->
<dependency>
<groupId>io.github.openfeign</groupId>
<artifactId>feign-core</artifactId>
<version>11.8</version>
</dependency>
My Feign test code:
import feign.template.UriUtils;
System.out.println(UriUtils.encode("Hello World"));
Outputs:
Hello%20World
As the class suggests, it encodes URIs and not URLs but the OP asked about URIs and not URLs.
System.out.println(UriUtils.encode("https://some-host.net/dav/files/selling_Rosetta Stone Case Study.png.aes"));
Outputs:
https%3A%2F%2Fsome-host.net%2Fdav%2Ffiles%2Fselling_Rosetta%20Stone%20Case%20Study.png.aes
Try below approach:
Add a new dependency
<!-- https://mvnrepository.com/artifact/org.apache.tomcat/tomcat-catalina -->
<dependency>
<groupId>org.apache.tomcat</groupId>
<artifactId>tomcat-catalina</artifactId>
<version>10.0.13</version>
</dependency>
Now do as follows:
String str = "Hello+World"; // For "Hello World", decoder is not required
// import java.net.URLDecoder;
String newURL = URLDecoder.decode(str, StandardCharsets.UTF_8);
// import org.apache.catalina.util.URLEncoder;
System.out.println(URLEncoder.DEFAULT.encode(newURL, StandardCharsets.UTF_8));
You'll get the output as:
Hello%20World
Check out the java.net.URI class.
USE MyUrlEncode.URLencoding(String url , String enc) to handle the problem
public class MyUrlEncode {
static BitSet dontNeedEncoding = null;
static final int caseDiff = ('a' - 'A');
static {
dontNeedEncoding = new BitSet(256);
int i;
for (i = 'a'; i <= 'z'; i++) {
dontNeedEncoding.set(i);
}
for (i = 'A'; i <= 'Z'; i++) {
dontNeedEncoding.set(i);
}
for (i = '0'; i <= '9'; i++) {
dontNeedEncoding.set(i);
}
dontNeedEncoding.set('-');
dontNeedEncoding.set('_');
dontNeedEncoding.set('.');
dontNeedEncoding.set('*');
dontNeedEncoding.set('&');
dontNeedEncoding.set('=');
}
public static String char2Unicode(char c) {
if(dontNeedEncoding.get(c)) {
return String.valueOf(c);
}
StringBuffer resultBuffer = new StringBuffer();
resultBuffer.append("%");
char ch = Character.forDigit((c >> 4) & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
resultBuffer.append(ch);
ch = Character.forDigit(c & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
resultBuffer.append(ch);
return resultBuffer.toString();
}
private static String URLEncoding(String url,String enc) throws UnsupportedEncodingException {
StringBuffer stringBuffer = new StringBuffer();
if(!dontNeedEncoding.get('/')) {
dontNeedEncoding.set('/');
}
if(!dontNeedEncoding.get(':')) {
dontNeedEncoding.set(':');
}
byte [] buff = url.getBytes(enc);
for (int i = 0; i < buff.length; i++) {
stringBuffer.append(char2Unicode((char)buff[i]));
}
return stringBuffer.toString();
}
private static String URIEncoding(String uri , String enc) throws UnsupportedEncodingException { //对请求参数进行编码
StringBuffer stringBuffer = new StringBuffer();
if(dontNeedEncoding.get('/')) {
dontNeedEncoding.clear('/');
}
if(dontNeedEncoding.get(':')) {
dontNeedEncoding.clear(':');
}
byte [] buff = uri.getBytes(enc);
for (int i = 0; i < buff.length; i++) {
stringBuffer.append(char2Unicode((char)buff[i]));
}
return stringBuffer.toString();
}
public static String URLencoding(String url , String enc) throws UnsupportedEncodingException {
int index = url.indexOf('?');
StringBuffer result = new StringBuffer();
if(index == -1) {
result.append(URLEncoding(url, enc));
}else {
result.append(URLEncoding(url.substring(0 , index),enc));
result.append("?");
result.append(URIEncoding(url.substring(index+1),enc));
}
return result.toString();
}
}
Am I using the wrong method? What is the correct method I should be using?
Yes, this method java.net.URLEncoder.encode wasn't made for converting " " to "20%" according to spec (source).
The space character " " is converted into a plus sign "+".
Even this is not the correct method, you can modify this to: System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8").replaceAll("\\+", "%20"));have a nice day =).
use character-set "ISO-8859-1" for URLEncoder