How to parse UTF-8 representation to String in Java?

How to parse UTF-8 representation to String in Java? - java

Given the following code:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
Does the JDK already provide some classes for doing this ?
Is there a libray that does this ? (preferably under maven)
I have tried with ByteArrayOutputStream with no success.

This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).
package sample;
import java.io.UnsupportedEncodingException;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";
String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
byte[] utf8 = new byte[arr.length];
int index=0;
for (String ch : arr) {
utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
}
String newStr = new String(utf8, "UTF-8");
System.out.println(newStr);
}
catch (UnsupportedEncodingException e) {
// handle the UTF-8 conversion exception
}
}
}
Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.
package sample;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";
ArrayList<Byte> arrList = new ArrayList<Byte>();
String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");
for (String c : codes) {
int code = Integer.parseInt(c,HEXADECIMAL);
byte[] bytes = intToByteArray(code);
for (byte b : bytes) {
if (b != 0) arrList.add(b);
}
}
byte[] utf8 = new byte[arrList.size()];
for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);
str = new String(utf8, "UTF-8");
System.out.println(str);
}
catch (UnsupportedEncodingException e) {
// handle the exception when
}
}
// Takes a 4 byte integer and and extracts each byte
public static final byte[] intToByteArray(int value) {
return new byte[] {
(byte) (value >>> 24),
(byte) (value >>> 16),
(byte) (value >>> 8),
(byte) (value)
};
}
}

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?
If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:
String result = "\u0068\u0065\u006c\u006c\u006f\u000a";
If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

I'm sure there must be a better way, but using just the JDK:
public static String handleEscapes(final String s)
{
final java.util.Properties props = new java.util.Properties();
props.setProperty("foo", s);
final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
try
{
props.store(baos, null);
final String tmp = baos.toString().replace("\\\\", "\\");
props.load(new java.io.StringReader(tmp));
}
catch(final java.io.IOException ioe) // shouldn't happen
{ throw new RuntimeException(ioe); }
return props.getProperty("foo");
}
uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).
(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)

Related

Need help in converting EBCDIC to Hexadecimal

I am writing an hive UDF to convert the EBCDIC character to Hexadecimal.
Ebcdic characters are present in hive table.Currently I am able to convert it, bit it is ignoring few characters while conversion.
Example:
This is the EBCDIC value stored in table:
AGNSAÃ±AÂ¦Ã»ÃÃÂÃÂµÂjÂqÂÂÂÂ Â Ã ()
Converted hexadecimal:
c1c7d5e2000a5cd4f6ef99187d07067203a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
What I want as output:
c1c7d5e200010a5cd4f6ef99187d0706720103a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
It is ignoring to convert the below EBCDIC characters:
01 - It is start of heading
10 - It is a escape
15 - New line.
Below is the code I have tried so far:
public class EbcdicToHex extends UDF {
public String evaluate(String edata) throws UnsupportedEncodingException {
byte[] ebcdiResult = getEBCDICRawData(edata);
String hexResult = getHexData(ebcdiResult);
return hexResult;
}
public byte[] getEBCDICRawData (String edata) throws UnsupportedEncodingException {
byte[] result = null;
String ebcdic_encoding = "IBM-037";
result = edata.getBytes(ebcdic_encoding);
return result;
}
public String getHexData(byte[] result){
String output = asHex(result);
return output;
}
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}
While converting, its ignoring few EBCDIC characters. How to make them also converted to hexadecimal?

I think the problem lies elsewhere, I created a small testcase where I create a String based on those 3 bytes you claim to be ignored, but in my output they do seem to be converted correctly:
private void run(String[] args) throws Exception {
byte[] bytes = new byte[] {0x01, 0x10, 0x15};
String str = new String(bytes, "IBM-037");
byte[] result = getEBCDICRawData(str);
for(byte b : result) {
System.out.print(Integer.toString(( b & 0xff ) + 0x100, 16).substring(1) + " ");
}
System.out.println();
System.out.println(evaluate(str));
}
Output:
01 10 15
011015
Based on this it seems both your getEBCDICRawData and evaluate method seem to be working correctly and makes me believe your String value may already be incorrect to start with. Could it be the String is already missing those characters? Or perhaps a long shot, but maybe the charset is incorrect? There are different EBCDIC charsets, so maybe the String is composed using a different one? Although I doubt this would make much difference for the 01, 10 and 15 bytes.
As a final remark, but probably unrelated to your problem, I usually prefer to use the encode/decode functions on the charset object to do such conversions:
String charset = "IBM-037";
Charset cs = Charset.forName(charset);
ByteBuffer bb = cs.encode(str);
CharBuffer cb = cs.decode(bb);

How to get characters to a string in Java?

I have a text file which contains the "Captured Network Packets' Headers" as hexadecimal values like this...
FC-C8-97-62-88-5F-74-DE-2B-C8-C7-E5-08-00-45-00-00-28-4E-C4-40-00-80-06-BD-65-C0-A8-01-03-AD-C2-7F-38-C9-96-01-BB-F8-01-7F-5F-B6-8A-15-22-50-10-40-42-72-8C-00-00.
I need to convert them to decimal values... I did little as here..
InputStream input = new FileInputStream("data.txt");
OutputStream output = new FileOutputStream ("converteddata.txt");
int data = input.read();
while (data != -1)
{
char ch = (char) data;
output.write(ch);
data=input.read();
}
input.close();
output.close();
Now, my problem is... how to get each hexadecimal string which would have '2' characters..? (such as "AD" or 5F etc. in order to convert them in to decimal values).
I know that C++ has a function "fgetc()" No..? I need similar solution. Anybody can suggest a good way..? (Sorry, I'm a beginner for Java but know c++ much better)
Thanks in advance.

Try Long.parseLong("<hex string>", 16); to convert a hexadecimal string to a long value.

Try this:
String strHex = "FC-C8-97-62-88-5F-74-DE-2B-C8-C7-E5-08-00-45-00-00-28-4E-C4-40-00-80-06-BD-65-C0-A8-01-03-AD-C2-7F-38-C9-96-01-BB-F8-01-7F-5F-B6-8A-15-22-50-10-40-42-72-8C-00-00";
String[] hexParts = strHex.split("-");
for (String myStr : hexParts) {
// System.out.println(toHex(myStr));
System.out.println(toDecimal(myStr));
}
// getting For Decimal values from Hex string
public int toDecimal(String str){
return Integer.parseInt(str.trim(), 16 );
}
// getting For Hex values
public String toHex(String arg) {
return String.format("%x", new BigInteger(1, arg.getBytes(/*YOUR_CHARSET?*/)));
}

Here is a sample code. Please optimize for real time uses.
public static void main(String[] args) throws IOException {
OutputStream output = new FileOutputStream ("converteddata.txt");
BufferedReader br = new BufferedReader(new FileReader(new File("data.txt")));
String r = null;
while((r=br.readLine())!=null) {
String [] str = r.split("-");
for (String string : str) {
Long l = Long.parseLong(string.trim(), 16);
output.write(String.valueOf(l).getBytes());
output.write("\n".getBytes());
}
}
br.close();
output.close();
}

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f)

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f) using Java?
EDIT:
I guess the question is not very clear... Basically what I want is this:
Given string s="blalbla" I want to get string "\uXXX\uYYYY"

You will need to extract each code point/unit from the String and encode it yourself. The following works for all Strings even if the individual linguistic characters within the String are composed of digraphs or ligatures.
public String getUnicodeEscapes(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char codeUnit = aString.charAt(ctr);
String hexString = Integer.toHexString(codeUnit);
String padAmount = "0000".substring(hexString.length());
buffer.append("\\u");
buffer.append(padAmount);
buffer.append(hexString);
}
return buffer.toString();
}
else
{
return null;
}
}
The above produces output as dictated by the Java Language Specification on Unicode escapes, i.e. it produces output of the form \uxxxx for each UTF-16 code unit. It addresses supplementary characters by producing a pair of code units represented as \uxxxx\uyyyy.
The originally posted code has been modified to produce Unicode codepoints in the format U+FFFFF:
public String getUnicodeCodepoints(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char ch = aString.charAt(ctr);
if (Character.isLowSurrogate(ch))
{
continue;
}
else
{
int codePoint = aString.codePointAt(ctr);
String hexString = Integer.toHexString(codePoint);
String zeroPad = Character.isHighSurrogate(ch) ? "00000" : "0000";
String padAmount = zeroPad.substring(hexString.length());
buffer.append(" U+");
buffer.append(padAmount);
buffer.append(hexString);
}
}
return buffer.toString();
}
else
{
return null;
}
}
The gruntwork is done by the String.codePointAt() method which returns the Unicode codepoint at a particular index. For a String instance composed of combinational characters, the length of the String instance will not be the length of the number of visible characters, but the number of actual Unicode codepoints. For example, क and ् combine to form क् in Devanagari, and the above function will rightfully return U+0915 U+094d without any fuss as String.length() will return 2 for the combined character. Strings with supplementary characters will be with single codepoints for the individual characters - 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 (the page will not display this String literal correctly, but you can copy this just fine; it should be Javascript but written using the supplementary character set for Mathematical alphanumeric symbols) will return U+1d4a5 U+1d4b6 U+1d4cb U+1d4b6 U+1d4c8 U+1d4b8 U+1d4c7 U+1d4be U+1d4c5 U+1d4c9.

public static void main(String[] args) {
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f"));
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
System.out.println(s);
} catch (CharacterCodingException e) {
e.printStackTrace();
}
}

I'm not aware of a build-in solution, so:
StringBuilder builder = new StringBuilder();
for(int i=0; i<yourString.length(); i++) {
builder.append(String.format("\\u%04x", yourString.charAt(i)));
}
String encoded = builder.toString();
Edit: sry, I thought you wanted to get the String encoded to \uXXXX expressions ...

You didn't saying what encoding you are after, but based on the tag I'm assuming you want the UTF-8 encoding. Here's how:
byte[] utf8 =
"\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f".getBytes("UTF-8");
You can then write a simple loop to output the bytes in utf8 in hexadecimal or decimal ... or do something else with them.

System.out.println ("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f");
works like a charm for me:
Служебная

How to convert a Java String to an ASCII byte array?

How to convert a Java String to an ASCII byte array?

Using the getBytes method, giving it the appropriate Charset (or Charset name).
Example:
String s = "Hello, there.";
byte[] b = s.getBytes(StandardCharsets.US_ASCII);
If more control is required (such as throwing an exception when a character outside the 7 bit US-ASCII is encountered) then CharsetDecoder can be used:
private static byte[] strictStringToBytes(String s, Charset charset) throws CharacterCodingException {
ByteBuffer x = charset.newEncoder().onMalformedInput(CodingErrorAction.REPORT).encode(CharBuffer.wrap(s));
byte[] b = new byte[x.remaining()];
x.get(b);
return b;
}
Before Java 7 it is possible to use: byte[] b = s.getBytes("US-ASCII");. The enum StandardCharsets, the encoder as well as the specialized getBytes(Charset) methods have been introduced in Java 7.

If you are a guava user there is a handy Charsets class:
String s = "Hello, world!";
byte[] b = s.getBytes(Charsets.US_ASCII);
Apart from not hard-coding arbitrary charset name in your source code it has a much bigger advantage: Charsets.US_ASCII is of Charset type (not String) so you avoid checked UnsupportedEncodingException thrown only from String.getBytes(String), but not from String.getBytes(Charset).
In Java 7 there is equivalent StandardCharsets class.

There is only one character wrong in the code you tried:
Charset characterSet = Charset.forName("US-ASCII");
String string = "Wazzup";
byte[] bytes = String.getBytes(characterSet);
^
Notice the upper case "String". This tries to invoke a static method on the string class, which does not exist. Instead you need to invoke the method on your string instance:
byte[] bytes = string.getBytes(characterSet);

The problem with other proposed solutions is that they will either drop characters that cannot be directly mapped to ASCII, or replace them with a marker character like ?.
You might desire to have for example accented characters converted to that same character without the accent. There are a couple of tricks to do this (including building a static mapping table yourself or leveraging existing 'normalization' defined for unicode), but those methods are far from complete.
Your best bet is using the junidecode library, which cannot be complete either but incorporates a lot of experience in the most sane way of transliterating Unicode to ASCII.

String s = "ASCII Text";
byte[] bytes = s.getBytes("US-ASCII");

If you happen to need this in Android and want to make it work with anything older than FroYo, you can also use EncodingUtils.getAsciiBytes():
byte[] bytes = EncodingUtils.getAsciiBytes("ASCII Text");

In my string I have Thai characters (TIS620 encoded) and German umlauts. The answer from agiles put me on the right path. Instead of .getBytes() I use now
int len = mString.length(); // Length of the string
byte[] dataset = new byte[len];
for (int i = 0; i < len; ++i) {
char c = mString.charAt(i);
dataset[i]= (byte) c;
}

Convert string to ascii values.
String test = "ABCD";
for ( int i = 0; i < test.length(); ++i ) {
char c = test.charAt( i );
int j = (int) c;
System.out.println(j);
}

I found the solution. Actually Base64 class is not available in Android. Link is given below for more information.
byte[] byteArray;
byteArray= json.getBytes(StandardCharsets.US_ASCII);
String encoded=Base64.encodeBytes(byteArray);
userLogin(encoded);
Here is the link for Base64 class: http://androidcodemonkey.blogspot.com/2010/03/how-to-base64-encode-decode-android.html

To convert String to ASCII byte array:
String s1 = "Hello World!";
byte[] byteArray = s1.getBytes(StandardCharsets.US_ASCII);
// Now byteArray is [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
To convert ASCII byte array to String:
String s2 = new String(byteArray, StandardCharsets.US_ASCII));

Try this:
/**
* #(#)demo1.java
*
*
* #author
* #version 1.00 2012/8/30
*/
import java.util.*;
public class demo1
{
Scanner s=new Scanner(System.in);
String str;
int key;
void getdata()
{
System.out.println ("plase enter a string");
str=s.next();
System.out.println ("plase enter a key");
key=s.nextInt();
}
void display()
{
char a;
int j;
for ( int i = 0; i < str.length(); ++i )
{
char c = str.charAt( i );
j = (int) c + key;
a= (char) j;
System.out.print(a);
}
public static void main(String[] args)
{
demo1 obj=new demo1();
obj.getdata();
obj.display();
}
}
}

Decode base64Url in Java

https://web.archive.org/web/20110422225659/https://en.wikipedia.org/wiki/Base64#URL_applications
talks about base64Url - Decode
a modified Base64 for URL variant exists, where no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_'
I created the following function:
public static String base64UrlDecode(String input) {
String result = null;
BASE64Decoder decoder = new BASE64Decoder();
try {
result = decoder.decodeBuffer(input.replace('-','+').replace('/','_')).toString();
}
catch (IOException e) {
System.out.println(e.getMessage());
}
return result;
}
it returns a very small set of characters that don't even resemble to the expected results.
any ideas?

Java8+
import java.util.Base64;
return Base64.getUrlEncoder().encodeToString(bytes);

Base64 encoding is part of the JDK since Java 8. URL safe encoding is also supported with java.util.Base64.getUrlEncoder(), and the "=" padding can be skipped by additionally using the java.util.Base64.Encoder.withoutPadding() method:
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public String encode(String raw) {
return Base64.getUrlEncoder()
.withoutPadding()
.encodeToString(raw.getBytes(StandardCharsets.UTF_8));
}

With the usage of Base64 from Apache Commons, who can be configured to URL safe, I created the following function:
import org.apache.commons.codec.binary.Base64;
public static String base64UrlDecode(String input) {
String result = null;
Base64 decoder = new Base64(true);
byte[] decodedBytes = decoder.decode(input);
result = new String(decodedBytes);
return result;
}
The constructor Base64(true) makes the decoding URL-safe.

In the Android SDK, there's a dedicated flag in the Base64 class: Base64.URL_SAFE, use it like so to decode to a String:
import android.util.Base64;
byte[] byteData = Base64.decode(body, Base64.URL_SAFE);
str = new String(byteData, "UTF-8");

Guava now has Base64 decoding built in.
https://google.github.io/guava/releases/17.0/api/docs/com/google/common/io/BaseEncoding.html

public static byte[] encodeUrlSafe(byte[] data) {
byte[] encode = Base64.encode(data);
for (int i = 0; i < encode.length; i++) {
if (encode[i] == '+') {
encode[i] = '-';
} else if (encode[i] == '/') {
encode[i] = '_';
}
}
return encode;
}
public static byte[] decodeUrlSafe(byte[] data) {
byte[] encode = Arrays.copyOf(data, data.length);
for (int i = 0; i < encode.length; i++) {
if (encode[i] == '-') {
encode[i] = '+';
} else if (encode[i] == '_') {
encode[i] = '/';
}
}
return Base64.decode(encode);
}

Right off the bat, it looks like your replace() is backwards; that method replaces the occurrences of the first character with the second, not the other way around.

#ufk's answer works, but you don't actually need to set the urlSafe flag when you're just decoding.
urlSafe is only applied to encode operations. Decoding seamlessly
handles both modes.
Also, there are some static helpers to make it shorter and more explicit:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
public static String base64UrlDecode(String input) {
StringUtils.newStringUtf8(Base64.decodeBase64(input));
}
Docs
newStringUtf8()
decodeBase64()

This class can help:
import android.util.Base64;
public class Encryptor {
public static String encode(String input) {
return Base64.encodeToString(input.getBytes(), Base64.URL_SAFE);
}
public static String decode(String encoded) {
return new String(Base64.decode(encoded.getBytes(), Base64.URL_SAFE));
}
}

I know the answer is already there, but still, if someone wants...
import java.util.Base64; public
class Base64BasicEncryptionExample {
publicstaticvoid main(String[] args) {
// Getting encoder
Base64.Encoder encoder = Base64.getUrlEncoder();
// Encoding URL
String eStr = encoder.encodeToString
("http://www.javatpoint.com/javatutorial/".getBytes());
System.out.println("Encoded URL: "+eStr);
// Getting decoder
Base64.Decoder decoder = Base64.getUrlDecoder();
// Decoding URl
String dStr = new String(decoder.decode(eStr));
System.out.println("Decoded URL: "+dStr);
}
}
Took help from: https://www.javatpoint.com/java-base64-encode-decode

In Java try the method Base64.encodeBase64URLSafeString() from Commons Codec library for encoding.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse UTF-8 representation to String in Java? - java

Related

Need help in converting EBCDIC to Hexadecimal

How to get characters to a string in Java?

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f)

How to convert a Java String to an ASCII byte array?

Decode base64Url in Java

Categories

Resources