Convert UTF-16 unicode characters to UTF-8 in java - java

When I got JSON then there are \u003c and \u003e instead of < and >. I want to convert them back to utf-8 in java. any help will be highly appreciated. Thanks.

try {
// Convert from Unicode to UTF-8
String string = "\u003c";
byte[] utf8 = string.getBytes("UTF-8");
// Convert from UTF-8 to Unicode
string = new String(utf8, "UTF-8");
} catch (UnsupportedEncodingException e) {
}
refer http://www.exampledepot.com/egs/java.lang/unicodetoutf8.html

You can try converting the string into a byte array
byte[] utfString = str.getBytes("UTF-8") ;
and convert that back to a string object by specifying the UTF-8 encoding like
str = new String(utfString,"UTF-8") ;

You can also try this
String s = "Hello World!";
String convertedInUTF8 = new String(s, StandardCharsets.US_ASCII);

Related

Converting byte array to String Java

I wish to convert a byte array to String but as I do so, my String has 00 before every digit got from the array.
I should have got the following result: 49443a3c3532333437342e313533373936313835323237382e303e
But I have the following:
Please help me, how can I get the nulls away?
I have tried the following ways to convert:
xxxxId is the byteArray
String xxxIdString = new String(Hex.encodeHex(xxxxId));
Thank you!
Try something like this:
String s = new String(bytes);
s = s.replace("\0", "")
It's also posible, that the string will end after the first '\0' received, if thats the case, first iterate through the array and replace '\0' with something like '\n' and do this:
String s = new String(bytes);
s = s.replace("\n", "")
EDIT:
use this for a BYTE-ARRAY:
String s = new String(bytes, StandardCharsets.UTF_8);
use this for a CHAR:
String s = new String(bytes);
Try below code:
byte[] bytes = {...}
String str = new String(bytes, "UTF-8"); // for UTF-8 encoding
please have a look here- How to convert byte array to string and vice versa?
In order to convert Byte array into String format correctly, we have to explicitly create a String object and assign the Byte array to it.
String example = "This is an example";
byte[] bytes = example.getBytes();
String s = new String(bytes);

Java convert encoding

I have a string which used to be an xml tag where mojibakes are contained:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know that exactly the same string but in correct encoding is:
<Applicant_Place_Born>Москва</Applicant_Place_Born>
I know this because using Tcl utility I can convert it into proper string:
# The original string
set s "Москва"
# substituting the html escapes
set t "Ð\x9cоÑ\x81ква"
# decode from utf-8 into Unicode
encoding convertfrom utf-8 "Ð\x9cоÑ\x81ква"
Москва
I tried different variations of this:
System.out.println(new String(original.getBytes("UTF-8"), "CP1251"));
but I always got other mojibakes or question marks instead of characters.
Q: How can I do the same as Tcl does but using Java code?
EDIT:
I have tried #Joop Eggen's approach:
import org.apache.commons.lang3.StringEscapeUtils;
public class s {
static String s;
public static void main(String[] args) {
try {
System.setProperty("file.encoding", "CP1251");
System.out.println("JVM encoding: " + System.getProperty("file.encoding"));
s = "Москва";
System.out.println("Original text: " + s);
s = StringEscapeUtils.unescapeHtml4(s);
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
s = new String(b, "UTF-16BE");
System.out.println("Result: " + s);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The converted string was something Chineese:
JVM encoding: CP1251
Original text: Москва
Result: 킜킾톁킺킲킰
A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.
A patch would be
String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");
Now s should be a correct Unicode String.
System.out.println(s);
If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.
The characters in s are actually bytes of UTF-16BE I guess
By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)
You were pretty close. However, getBytes is used to encode UTF-8 rather than decode. What you want is something along the lines of
String string = "Ð\x9cоÑ\x81ква";
byte[] bytes = string.getBytes("UTF-8");
System.out.println(new String(bytes, "UTF-8"));

Convert String to bytes, and back again

I have a string cityName which I decoded into bytes as follows:
byte[] cityBytes = cityName.getBytes("UTF-8");
...and stored the bytes somewhere. When I retrieve those bytes, how can I decode them back into a string?
Use the String(byte[], Charset) or String(byte[], String) constructor.
byte[] rawBytes = /* whatevs */
try
{
String decoded = new String(rawBytes, Charset.forName("UTF-8"));
// or
String decoded = new String(rawBytes, "UTF-8");
// best, if you're using Java 7 (thanks to #ColinD):
String decoded = new String(rawBytes, StandardCharsets.UTF_8);
}
catch (UnsupportedEncodingException e)
{
// see http://stackoverflow.com/a/6030187/139010
throw new AssertionError("UTF-8 not supported");
}
The String class has a few constructors that accept an array of bytes, including one that takes an array of bytes and a String representation of a charset and another that takes a Charset object. There are also constructors that take the offset and length of the String as arguments, if the String is only a small section of the byte array.
Like this:
String cityName = new String(cityByte,"UTF-8");
String s = new String(cityByte, "UTF-8");
Try this: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html
String(byte[] bytes, String charsetName)

How to read UTF-8 characters from the file as bytes?

I am not able to read a UTF-8 characters from the file as bytes.
the UTF-8 characters are displaying as questionmarak(?) while converting to character from the bytes.
Below code snippet shows file reading.
Please tell me how can we read UTF-8 chanracters from a file.
and plz tell me what is the problem with byte array reading process?
public static void getData {
FormFile file = actionForm.getFile("UTF-8");
byte[] mybt;
try
{
byte[] fileContents = file.getFileData();
StringBuffer sb = new StringBuffer();
for(int i=0;i<fileContents.length;i++){
sb.append((char)fileContents[i]);
}
System.out.println(sb.toString());
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
Output ::??Docum??ents (input file content is : "ÞDocumÿents" , it contains some spanish characters. )
This is the problem:
for(int i=0;i<fileContents.length;i++){
sb.append((char)fileContents[i]);
}
You're converting each byte to a char just by casting it. That's effectively using ISO-Latin-1.
To read text from an InputStream, you adapt it via InputStreamReader, specifying the character encoding.
The simplest way of reading the whole of a file into a string would be to use Guava:
String text = Files.toString(file, Charsets.UTF_8);
Or to convert a byte array:
String text = new String(fileContents, "UTF-8");

Encoding in Java

I have an C# function which i want to translate in Java code. I have a problem here:
Encoding enc = Encoding.GetEncoding("Windows-1252");
bytZeichenBenutzer = enc.GetBytes(strBenutzer.Substring(intLoopCount, 1).ToCharArray());
How to do that in Java? I can't find anything similar only stuff that works with UTF-8.
You can use the getBytes(String) or getBytes(Charset) methods:
String myString = getMyStringFromSomeWhere();
byte[] utf8Bytes = myString.getBytes("UTF-8");
// or
Charset myCharset = Charset.forName("Windows-1252");
byte[] windowsBytes = myString.getBytes(myCharset);
String s = "hhh";
try {
s.getBytes("Windows-1252");
} catch(UnsupportedEncodingException e) {
e.printStackTrace();
}
You can do:
byte[] a = "some string".getBytes("Windows-1252");

Categories