I am writing an hive UDF to convert the EBCDIC character to Hexadecimal.
Ebcdic characters are present in hive table.Currently I am able to convert it, bit it is ignoring few characters while conversion.
Example:
This is the EBCDIC value stored in table:
AGNSAñA¦ûÃÃÂõÂjÂq  à ()
Converted hexadecimal:
c1c7d5e2000a5cd4f6ef99187d07067203a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
What I want as output:
c1c7d5e200010a5cd4f6ef99187d0706720103a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
It is ignoring to convert the below EBCDIC characters:
01 - It is start of heading
10 - It is a escape
15 - New line.
Below is the code I have tried so far:
public class EbcdicToHex extends UDF {
public String evaluate(String edata) throws UnsupportedEncodingException {
byte[] ebcdiResult = getEBCDICRawData(edata);
String hexResult = getHexData(ebcdiResult);
return hexResult;
}
public byte[] getEBCDICRawData (String edata) throws UnsupportedEncodingException {
byte[] result = null;
String ebcdic_encoding = "IBM-037";
result = edata.getBytes(ebcdic_encoding);
return result;
}
public String getHexData(byte[] result){
String output = asHex(result);
return output;
}
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}
While converting, its ignoring few EBCDIC characters. How to make them also converted to hexadecimal?
I think the problem lies elsewhere, I created a small testcase where I create a String based on those 3 bytes you claim to be ignored, but in my output they do seem to be converted correctly:
private void run(String[] args) throws Exception {
byte[] bytes = new byte[] {0x01, 0x10, 0x15};
String str = new String(bytes, "IBM-037");
byte[] result = getEBCDICRawData(str);
for(byte b : result) {
System.out.print(Integer.toString(( b & 0xff ) + 0x100, 16).substring(1) + " ");
}
System.out.println();
System.out.println(evaluate(str));
}
Output:
01 10 15
011015
Based on this it seems both your getEBCDICRawData and evaluate method seem to be working correctly and makes me believe your String value may already be incorrect to start with. Could it be the String is already missing those characters? Or perhaps a long shot, but maybe the charset is incorrect? There are different EBCDIC charsets, so maybe the String is composed using a different one? Although I doubt this would make much difference for the 01, 10 and 15 bytes.
As a final remark, but probably unrelated to your problem, I usually prefer to use the encode/decode functions on the charset object to do such conversions:
String charset = "IBM-037";
Charset cs = Charset.forName(charset);
ByteBuffer bb = cs.encode(str);
CharBuffer cb = cs.decode(bb);
Related
I have text with contents
12 13 14
The text has 8 spaces between values 12 and 13 and 13 and 14
My java method is receiving the text as inputstream thru an argument and storing each contents in a byte array, and further then convert each byte to a character
public class FileUpload implements RequestStreamHandler{
String fileObjKeyName = "sample1.txt";
String bucketName="";
/**
* #param args
*/
#Override
public void handleRequest(InputStream inputStream, OutputStream outputStream, Context context) throws IOException {
LambdaLogger logger = context.getLogger();
byte[] bytes = IOUtils.toByteArray(inputStream);
StringBuilder sb = new StringBuilder();
StringBuilder sb1 = new StringBuilder();
sb.append("[ ");
sb1.append("[ ");
for (byte b : bytes) {
sb.append(b);
char ch = (char) b;
sb1.append(ch);
}
sb.append("]");
sb1.append("] ");
logger.log(sb.toString());
logger.log(sb1.toString());
}
}
The Decimal representation for the each bytes are correctly printed as below
[ 4950323232323232323249513232323232323232324952]
However when converted to character, only one decimal value '32' (for spaces) between the values are getting converted, skipping all remaining in between spaces bytes.
[ 12 13 14]
Can anyone suggest, the reason for this.
How you convert byte to string? it will be same. see below code:
public static void main(String[] args) {
byte[] bytes = "12 13 14".getBytes();
System.out.println(Arrays.toString(bytes));
String str = new String(bytes,StandardCharsets.UTF_8);
System.out.println(str);
}
Your example shows that you're using AWS, for which you will often check the results and the produced logs online, with a tool that supports HTML.
And in HTML, when you write several consecutive spaces, they are displayed as only one.
Your String object, withing Java, does contain the 8 spaces. But when you give it to a logger to be eventually displayed in a webpage, the spaces are collapsed and displayed as only one.
This is easy to prove: just add the following code at the end of your method:
String s = sb1.toString();
logger.log("s length: " + s.length());
for(int i = 0; i < s.length(); i++) {
logger.log("s[" + i + "]: " + s.charAt(i));
}
It demonstrates the length and exact content of the String. If you're not seeing that exact content when displaying the String, it is the fault of the tool that displays it.
How to get proper Java string from Python created string 'Oslobo\xc4\x91enja'?
How to decode it? I've tryed I think everything, looked everywhere, I've been stuck for 2 days with this problem. Please help!
Here is the Python's web service method that returns JSON from which Java client with Google Gson parses it.
def list_of_suggestions(entry):
input = entry.encode('utf-8')
"""Returns list of suggestions from auto-complete search"""
json_result = { 'suggestions': [] }
resp = urllib2.urlopen('https://maps.googleapis.com/maps/api/place/autocomplete/json?input=' + urllib2.quote(input) + '&location=45.268605,19.852924&radius=3000&components=country:rs&sensor=false&key=blahblahblahblah')
# make json object from response
json_resp = json.loads(resp.read())
if json_resp['status'] == u'OK':
for pred in json_resp['predictions']:
if pred['description'].find('Novi Sad') != -1 or pred['description'].find(u'Нови Сад') != -1:
obj = {}
obj['name'] = pred['description'].encode('utf-8').encode('string-escape')
obj['reference'] = pred['reference'].encode('utf-8').encode('string-escape')
json_result['suggestions'].append(obj)
return str(json_result)
Here is solution on Java client
private String python2JavaStr(String pythonStr) throws UnsupportedEncodingException {
int charValue;
byte[] bytes = pythonStr.getBytes();
ByteBuffer decodedBytes = ByteBuffer.allocate(pythonStr.length());
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '\\' && bytes[i + 1] == 'x') {
// \xc4 => c4 => 196
charValue = Integer.parseInt(pythonStr.substring(i + 2, i + 4), 16);
decodedBytes.put((byte) charValue);
i += 3;
} else
decodedBytes.put(bytes[i]);
}
return new String(decodedBytes.array(), "UTF-8");
}
You are returning the string version of the python data structure.
Return an actual JSON response instead; leave the values as Unicode:
if json_resp['status'] == u'OK':
for pred in json_resp['predictions']:
desc = pred['description']
if u'Novi Sad' in desc or u'Нови Сад' in desc:
obj = {
'name': pred['description'],
'reference': pred['reference']
}
json_result['suggestions'].append(obj)
return json.dumps(json_result)
Now Java does not have to interpret Python escape codes, and can parse valid JSON instead.
Python escapes unicode characters by converting their UTF-8 bytes into a series of \xVV values, where VV is the hex value of the byte. This is very different from the java unicode escapes, which are just a single \uVVVV per character, where VVVV is hex UTF-16 encoding.
Consider:
\xc4\x91
In decimal, those hex values are:
196 145
then (in Java):
byte[] bytes = { (byte) 196, (byte) 145 };
System.out.println("result: " + new String(bytes, "UTF-8"));
prints:
result: đ
Is there any way to convert Java String to a byte[] (not the boxed Byte[])?
In trying this:
System.out.println(response.split("\r\n\r\n")[1]);
System.out.println("******");
System.out.println(response.split("\r\n\r\n")[1].getBytes().toString());
and I'm getting separate outputs. Unable to display 1st output as it is a gzip string.
<A Gzip String>
******
[B#38ee9f13
The second is an address. Is there anything I'm doing wrong? I need the result in a byte[] to feed it to gzip decompressor, which is as follows.
String decompressGZIP(byte[] gzip) throws IOException {
java.util.zip.Inflater inf = new java.util.zip.Inflater();
java.io.ByteArrayInputStream bytein = new java.io.ByteArrayInputStream(gzip);
java.util.zip.GZIPInputStream gzin = new java.util.zip.GZIPInputStream(bytein);
java.io.ByteArrayOutputStream byteout = new java.io.ByteArrayOutputStream();
int res = 0;
byte buf[] = new byte[1024];
while (res >= 0) {
res = gzin.read(buf, 0, buf.length);
if (res > 0) {
byteout.write(buf, 0, res);
}
}
byte uncompressed[] = byteout.toByteArray();
return (uncompressed.toString());
}
The object your method decompressGZIP() needs is a byte[].
So the basic, technical answer to the question you have asked is:
byte[] b = string.getBytes();
byte[] b = string.getBytes(Charset.forName("UTF-8"));
byte[] b = string.getBytes(StandardCharsets.UTF_8); // Java 7+ only
However the problem you appear to be wrestling with is that this doesn't display very well. Calling toString() will just give you the default Object.toString() which is the class name + memory address. In your result [B#38ee9f13, the [B means byte[] and 38ee9f13 is the memory address, separated by an #.
For display purposes you can use:
Arrays.toString(bytes);
But this will just display as a sequence of comma-separated integers, which may or may not be what you want.
To get a readable String back from a byte[], use:
String string = new String(byte[] bytes, Charset charset);
The reason the Charset version is favoured, is that all String objects in Java are stored internally as UTF-16. When converting to a byte[] you will get a different breakdown of bytes for the given glyphs of that String, depending upon the chosen charset.
String example = "Convert Java String";
byte[] bytes = example.getBytes();
Simply:
String abc="abcdefghight";
byte[] b = abc.getBytes();
Try using String.getBytes(). It returns a byte[] representing string data.
Example:
String data = "sample data";
byte[] byteData = data.getBytes();
You can use String.getBytes() which returns the byte[] array.
You might wanna try return new String(byteout.toByteArray(Charset.forName("UTF-8")))
I know I'm a little late tothe party but thisworks pretty neat (our professor gave it to us)
public static byte[] asBytes (String s) {
String tmp;
byte[] b = new byte[s.length() / 2];
int i;
for (i = 0; i < s.length() / 2; i++) {
tmp = s.substring(i * 2, i * 2 + 2);
b[i] = (byte)(Integer.parseInt(tmp, 16) & 0xff);
}
return b; //return bytes
}
i had to conwert a int to decimal 3 byte 129 to 1 2 9
Byte data
int i1 = 129
int i3 = (i1 / 100);
i1 = i1 - i3*100;
int i2 = (i1 / 10);
i1 = i1 - i2*10;
data [1]= (byte) i1
data [2]= (byte) i2
data [3]= (byte) i3
It is not necessary to change java as a String parameter. You have to change the c code to receive a String without a pointer and in its code:
Bool DmgrGetVersion (String szVersion);
Char NewszVersion [200];
Strcpy (NewszVersion, szVersion.t_str ());
.t_str () applies to builder c ++ 2010
Given the following code:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
Does the JDK already provide some classes for doing this ?
Is there a libray that does this ? (preferably under maven)
I have tried with ByteArrayOutputStream with no success.
This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).
package sample;
import java.io.UnsupportedEncodingException;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";
String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
byte[] utf8 = new byte[arr.length];
int index=0;
for (String ch : arr) {
utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
}
String newStr = new String(utf8, "UTF-8");
System.out.println(newStr);
}
catch (UnsupportedEncodingException e) {
// handle the UTF-8 conversion exception
}
}
}
Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.
package sample;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";
ArrayList<Byte> arrList = new ArrayList<Byte>();
String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");
for (String c : codes) {
int code = Integer.parseInt(c,HEXADECIMAL);
byte[] bytes = intToByteArray(code);
for (byte b : bytes) {
if (b != 0) arrList.add(b);
}
}
byte[] utf8 = new byte[arrList.size()];
for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);
str = new String(utf8, "UTF-8");
System.out.println(str);
}
catch (UnsupportedEncodingException e) {
// handle the exception when
}
}
// Takes a 4 byte integer and and extracts each byte
public static final byte[] intToByteArray(int value) {
return new byte[] {
(byte) (value >>> 24),
(byte) (value >>> 16),
(byte) (value >>> 8),
(byte) (value)
};
}
}
Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?
If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:
String result = "\u0068\u0065\u006c\u006c\u006f\u000a";
If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.
I'm sure there must be a better way, but using just the JDK:
public static String handleEscapes(final String s)
{
final java.util.Properties props = new java.util.Properties();
props.setProperty("foo", s);
final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
try
{
props.store(baos, null);
final String tmp = baos.toString().replace("\\\\", "\\");
props.load(new java.io.StringReader(tmp));
}
catch(final java.io.IOException ioe) // shouldn't happen
{ throw new RuntimeException(ioe); }
return props.getProperty("foo");
}
uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).
(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)
How to convert a Java String to an ASCII byte array?
Using the getBytes method, giving it the appropriate Charset (or Charset name).
Example:
String s = "Hello, there.";
byte[] b = s.getBytes(StandardCharsets.US_ASCII);
If more control is required (such as throwing an exception when a character outside the 7 bit US-ASCII is encountered) then CharsetDecoder can be used:
private static byte[] strictStringToBytes(String s, Charset charset) throws CharacterCodingException {
ByteBuffer x = charset.newEncoder().onMalformedInput(CodingErrorAction.REPORT).encode(CharBuffer.wrap(s));
byte[] b = new byte[x.remaining()];
x.get(b);
return b;
}
Before Java 7 it is possible to use: byte[] b = s.getBytes("US-ASCII");. The enum StandardCharsets, the encoder as well as the specialized getBytes(Charset) methods have been introduced in Java 7.
If you are a guava user there is a handy Charsets class:
String s = "Hello, world!";
byte[] b = s.getBytes(Charsets.US_ASCII);
Apart from not hard-coding arbitrary charset name in your source code it has a much bigger advantage: Charsets.US_ASCII is of Charset type (not String) so you avoid checked UnsupportedEncodingException thrown only from String.getBytes(String), but not from String.getBytes(Charset).
In Java 7 there is equivalent StandardCharsets class.
There is only one character wrong in the code you tried:
Charset characterSet = Charset.forName("US-ASCII");
String string = "Wazzup";
byte[] bytes = String.getBytes(characterSet);
^
Notice the upper case "String". This tries to invoke a static method on the string class, which does not exist. Instead you need to invoke the method on your string instance:
byte[] bytes = string.getBytes(characterSet);
The problem with other proposed solutions is that they will either drop characters that cannot be directly mapped to ASCII, or replace them with a marker character like ?.
You might desire to have for example accented characters converted to that same character without the accent. There are a couple of tricks to do this (including building a static mapping table yourself or leveraging existing 'normalization' defined for unicode), but those methods are far from complete.
Your best bet is using the junidecode library, which cannot be complete either but incorporates a lot of experience in the most sane way of transliterating Unicode to ASCII.
String s = "ASCII Text";
byte[] bytes = s.getBytes("US-ASCII");
If you happen to need this in Android and want to make it work with anything older than FroYo, you can also use EncodingUtils.getAsciiBytes():
byte[] bytes = EncodingUtils.getAsciiBytes("ASCII Text");
In my string I have Thai characters (TIS620 encoded) and German umlauts. The answer from agiles put me on the right path. Instead of .getBytes() I use now
int len = mString.length(); // Length of the string
byte[] dataset = new byte[len];
for (int i = 0; i < len; ++i) {
char c = mString.charAt(i);
dataset[i]= (byte) c;
}
Convert string to ascii values.
String test = "ABCD";
for ( int i = 0; i < test.length(); ++i ) {
char c = test.charAt( i );
int j = (int) c;
System.out.println(j);
}
I found the solution. Actually Base64 class is not available in Android. Link is given below for more information.
byte[] byteArray;
byteArray= json.getBytes(StandardCharsets.US_ASCII);
String encoded=Base64.encodeBytes(byteArray);
userLogin(encoded);
Here is the link for Base64 class: http://androidcodemonkey.blogspot.com/2010/03/how-to-base64-encode-decode-android.html
To convert String to ASCII byte array:
String s1 = "Hello World!";
byte[] byteArray = s1.getBytes(StandardCharsets.US_ASCII);
// Now byteArray is [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
To convert ASCII byte array to String:
String s2 = new String(byteArray, StandardCharsets.US_ASCII));
Try this:
/**
* #(#)demo1.java
*
*
* #author
* #version 1.00 2012/8/30
*/
import java.util.*;
public class demo1
{
Scanner s=new Scanner(System.in);
String str;
int key;
void getdata()
{
System.out.println ("plase enter a string");
str=s.next();
System.out.println ("plase enter a key");
key=s.nextInt();
}
void display()
{
char a;
int j;
for ( int i = 0; i < str.length(); ++i )
{
char c = str.charAt( i );
j = (int) c + key;
a= (char) j;
System.out.print(a);
}
public static void main(String[] args)
{
demo1 obj=new demo1();
obj.getdata();
obj.display();
}
}
}