Convert UTF-8 to Shift-JIS - java

I have written the simple conversion code to convert to Japanese character from UTF-8.
private static String convertUTF8ToShiftJ(String uft8Strg) {
String shftJStrg = null;
try {
byte[] b = uft8Strg.getBytes(UTF_8);
shftJStrg = new String(b, Charset.forName("SHIFT-JIS"));
logger.info("Converted to the string :" + shftJStrg);
} catch (Exception e) {
e.printStackTrace();
return uft8Strg;
}
return shftJStrg;
}
But it gives the output error,
convertUTF8ToShiftJ START !!
uft8Strg=*** abc000.sh ����started�
*** abc000.sh ��中�executing...�
*** abc000.sh ����ended��*
Do anybody have any idea that where I made a mistake or need some additional logic, it would be really helpful!

You String is already a String, so your method is "wrong". UTF8 is an encoding that is a byte[] and can be converted to a String in Java.
It should read:
private static byte[] convertUTF8ToShiftJ(byte[] uft8) {
If you want to convert UTF8 byte[] to JIS byte[]:
private static byte[] convertUTF8ToShiftJ(byte[] uft8) {
String s = new String(utf8, StandardCharsets.UTF_8);
return s.getBytes( Charset.forName("SHIFT-JIS"));
}
A String can be converted to a byte[] later, by mystring.getBytes(encoding)
Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for more detail.

It seems you have a conceptual misunderstanding about String encodings.
See for example Byte Encodings and Strings.
Converting a String from one encoding to another encoding doesn't make sense,
because String is a thing independent of encoding.
However, a String can be represented by byte arrays in various encodings
(like for example UTF-8 or Shift-JIS).
Therefore, it would make sense to convert a UTF-8 encoded byte array
to a Shift-JIS encoded byte array.
private static byte[] convertUTF8ToShiftJ(byte[] utf8Bytes) throws IllegalCharsetNameException {
String s = new String(utf8Bytes, StandardCharsets.UTF_8);
byte[] shftJBytes = s.getBytes(Charset.forName("SHIFT-JIS"));
return shftJBytes;
}

Related

decode base64 utf-8 string java

I have this string
"=?UTF-8?B?VGLNBGNDQA==?="
to decode in a standard java String.
I wrote this quick and dirty main to get the String, but I'm having troubles
String s = "=?UTF-8?B?VGLNBGNDQA==?=";
s = s.split("=\\?UTF-8\\?B\\?")[1].split("\\?=")[0];
System.out.println(s);
byte[] decoded = Base64.getDecoder().decode(s);
String x = new String(decoded, "UTF8");
System.out.println(decoded);
System.out.println(x);
It is actually printing a strange string
"Tb�cC#"
I do not know what is the text behind the encoded string, but I can assume my program works, since I can convert without problems any other encoded string, for example
"=?UTF-8?B?SGlfR3V5cyE="
That is "Hi_Guys!".
Should I assume that string is malformed?

Java SE 8 Base64 class: encoding of byte[] parameter

Background: I'm working with the java.util.Base64 class that's new with Java 1.8.
In the documentation, it specifies that the encodeToString takes a byte array (there are some other options, but byte[] is the one I'm using). However, the doc doesn't specify how the byte array needs to be encoded. Here's my functional code:
import java.util.Base64;
import java.util.Base64.Encoder;
public class Test64 {
public static void main(String[] args){
try{
System.out.println(print64("This should be base64"));
} catch(Exception e) {
e.printStackTrace();
}
}
public static String print64(String test) throws Exception {
String test64 = "";
byte[] testBytes = test.getBytes("US-ASCII");
Base64.Encoder encoder64 = Base64.getUrlEncoder();
test64 = encoder64.encodeToString(testBytes);
return test64;
}
}
The question I have is whether the Base64 encodeToString will accept a byte[] with ANY encoding. I've tried US-ASCII and UTF-8, and those both work, but I'm hoping for a general conclusion.
Link to Javadoc for Base64.Encoder
The documentation does not specify an encoding, so any byte[] data will work. Base64 conversion is numerical, not character-oriented, so whoever interprets the Base64 number will have to know what it means. So as long as your documentation is clear how to interpret the bytes, you could use the Base64 string for any data serialization.

How can I check if a String is encodable in some encoding?

The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ...
#Test
public void testEncoding() throws UnsupportedEncodingException {
final String czech = "Řízeček a šampáňo a žízeň";
// okay
final byte[] bytesInLatin2 = czech.getBytes("ISO8859-2");
// different bytes, but okay
final byte[] bytesInWin1250 = czech.getBytes("Windows-1250");
// different bytes, but okay
final byte[] bytesInUtf8 = czech.getBytes("UTF-8");
// nonsense; Ř,č,... are not in Latin1 code set!!!
final byte[] bytesInLatin1 = czech.getBytes("ISO8859-1");
System.out.println(Arrays.toString(bytesInLatin2));
System.out.println(Arrays.toString(bytesInWin1250));
System.out.println(Arrays.toString(bytesInUtf8));
System.out.println(Arrays.toString(bytesInLatin1));
System.out.flush();
final String latin2 = new String(bytesInLatin2, "ISO8859-2");
final String win1250 = new String(bytesInWin1250, "Windows-1250");
final String utf8 = new String(bytesInUtf8, "UTF-8");
final String latin1 = new String(bytesInLatin1, "ISO8859-1");
Assert.assertEquals("latin2", czech, latin2);
Assert.assertEquals("win1250", czech, win1250);
Assert.assertEquals("utf8", czech, utf8);
Assert.assertEquals("latin1", czech, latin1); // this test will fail!
}
There are many situations where the data are finally corrupted because of this behaviour of Java. Is there any library available to validate Strings if they are encodable with some encoding?
I suspect you're looking for CharsetEncoder.canEncode(CharSequence).
Charset latin2 = Charset.forName("ISO8859-2");
boolean validInLatin2 = latin2.newEncoder().canEncode(czech);
...
As an alternative to Jon Skeet's suggestion, you can also use CharsetEncoder class to do the encoding directly (with the encode method), but first call the onMalformedInput and onUnmappableCharacter methods to specify what the encoder should do when it encounters bad input.
That way most of the time you're just doing a simple encode call, but if anything goes wrong you'll get an exception.

Why those calls to base64 classes return different results?

My code:
private static String convertToBase64(String string)
{
final byte[] encodeBase64 =
org.apache.commons.codec.binary.Base64.encodeBase64(string
.getBytes());
System.out.println(Hex.encodeHexString(encodeBase64));
final byte[] data = string.getBytes();
final String encoded =
javax.xml.bind.DatatypeConverter.printBase64Binary(data);
System.out.println(encoded);
return encoded;
}
Now I'm calling it: convertToBase64("stackoverflow"); and get following result:
6333526859327476646d56795a6d787664773d3d
c3RhY2tvdmVyZmxvdw==
Why I get different results?
I think Hex.encodeHexString will encode your String to hexcode, and the second one is a normal String
From the API doc of Base64.encodeBase64():
byte[] containing Base64 characters in their UTF-8 representation.
So instead
System.out.println(Hex.encodeHexString(encodeBase64));
you should write
System.out.println(new String(encodeBase64, "UTF-8"));
BTW: You should never use the String.getBytes() version without explicit encoding, because the result depends on the default platform encoding (for Windows this is usually "Cp1252" and Linux "UTF-8").

converting byte[] to string

I am having a bytearray of byte[] type having the length 17 bytes, i want to convert this to string and want to give this string for another comparison but the output i am getting is not in the format to validate, i am using the below method to convert.I want to output as string which is easy to validate and give this same string for comparison.
byte[] byteArray = new byte[] {0,127,-1,-2,-54,123,12,110,89,0,0,0,0,0,0,0,0};
String value = new String(byteArray);
System.out.println(value);
Output : ���{nY
What encoding is it? You should define it explicitly:
new String(byteArray, Charset.forName("UTF-32")); //or whichever you use
Otherwise the result is unpredictable (from String.String(byte[]) constructor JavaDoc):
Constructs a new String by decoding the specified array of bytes using the platform's default charset
BTW I have just tried it with UTF-8, UTF-16 and UTF-32 - all produce bogus results. The long series of 0 makes me believe that this isn't actually a text. Where do you get this data from?
UPDATE: I have tried it with all character sets available on my machine:
for (Map.Entry<String, Charset> entry : Charset.availableCharsets().entrySet())
{
final String value = new String(byteArray, entry.getValue());
System.out.println(entry.getKey() + ": " + value);
}
and no encoding produces anything close to human-readable text... Your input is not text.
Use as follows:
byte[] byteArray = new byte[] {0,127,-1,-2,-54,123,12,110,89,0,0,0,0,0,0,0,0};
String value = Arrays.toString(byteArray);
System.out.println(value);
Your output will be
[0,127,-1,-2,-54,123,12,110,89,0,0,0,0,0,0,0,0]
Is it actually encoded text? If so, specify the encoding.
However, the data you've got doesn't look like it's actually meant to be text. It just looks like arbitrary binary data to me. If it isn't really text, I'd recommend converting it to hex or base64, depending on requirements. There's a good public domain base64 encoder you can use.
String text = Base64.encodeBytes(byteArray);
And decoding:
byte[] data = Base64.decode(text):
not 100% sure if I get you right. Is this what you want?
String s = null;
StringBuffer buf = new StringBuffer("");
byte[] byteArray = new byte[] {0,127,-1,-2,-54,123,12,110,89,0,0,0,0,0,0,0,0};
for(byte b : byteArray) {
s = String.valueOf(b);
buf.append(s + ",");
}
String value = new String(buf);
System.out.println(value);
Maybe you should specify a charset:
String value = new String(byteArray, "UTF-8");

Categories