How to convert byte[] from ANSI to UTF-8

How to convert byte[] from ANSI to UTF-8 - java

byte[] data;
ResultSet resultSet
data = resultSet.getBytes("xml");//It is XML of ANSI type from database(ms-sql);
I am trying to convert XML to UTF-8 type.
Please help me figure this out.

You might use the constructor of String that takes an encoding; for example, new String(data, "UTF-8");
It's probably easiest to just call resultSet.getString("xml");

You probably should just feed the bytes directly to an XML parser. XML almost requires the encoding to be specified, and the parser will figure it out on its own.

Related

safest way to read clob into xml parser

I'm getting an input stream from a Clob in oracle 11 (using the the oracle 11 jdbc driver), and passing the input stream to an xml parser in Java:
java.sql.Clob clob = resultSet.getClob("myClob");
InputStream is = clob.getAsciiStream();
MyDom dom = MyDomParser.parse(is);
Wondering if using a CharacterStream would be safer? e.g instead:
Reader r = clob.getCharacterStream();
MyDom dom = MyDomParser.parse(r);
My thinking is that getCharacterStream() might be doing some encoding that helps guarantee nice UTF-8 is returned. Not sure if there is any real difference between the two ways shown here of reading the clob.

Not much difference, getCharacterStream is better for unicode data. Check the link
http://community.actian.com/wiki/Manipulating_SQL_CLOB_data_with_JDBC

throw exception when string is not encoded in UTF-8

I've got method where one of input attributes is String xml. I just want to create control for encoding of that xml. If any character is in other encoding that UTF-8, error will be thrown.
can you please tell me the easiest way how to create and test it?
I've used something like this:
String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"));
Document doc = builder.parse(IOUtils.toInputStream(xml, "UTF-8"));
added letters like Ľ,Š,Ť,Ž,ľ,š,ť,ž and save it as cp1250 file.
but no error.
what am I doing wrong?

This cannot be done natively in Java. A file is just a string of bytes, they can be interpreted however you feel like, Java by default has no way to add meaning. I recommend using this library (no I didn't write it):
http://code.google.com/p/juniversalchardet/
Follow these instructions (copy pasted from that link):
How to use it
Construct an instance of org.mozilla.universalchardet.UniversalDetector.
Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
Notify the detector of the end of data by calling UniversalDetector.dataEnd().
Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"));
If this IOUtils is org.apache.commons.io.IOUtils then its Javadoc says
"Get the contents of an InputStream as a String using the default character encoding of the platform."
As you are saving as cp1250, I guess cp1250 is also your platform character encoding. What your code would be doing is
Read the file as a byte stream
Convert the byte stream to chars using cp1250 (platform encoding)
Transform the chars to Java internal representation (UTF-16)
Convert from UTF-16 to UTF-8
Create XML document
That will always work as cp1250 really is your file encoding, UTF-16 has every character in cp1250 and UTF-8 has every character in UTF-16.
If you want to read the bytes as UTF-8 and avoid automatic conversions, you should use one of the two-parameter variant of IOUtils.toString():
public static String toString(InputStream input, Charset encoding)
public static String toString(InputStream input, String encoding)
So I would try:
// Helper import: I always forget if the constant is "UTF8" or "UTF-8"
import org.apache.commons.lang.CharEncoding;
String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"), CharEncoding.UTF_8);
Document doc = builder.parse(IOUtils.toInputStream(xml, CharEncoding.UTF_8));
The rule of thumb here is: NEVER do any byte-to-string / string-to-byte conversion without specifying the source / destination encoding.
A minor rule of thumb would be: Unless you need to use some other encoding, use UTF-8 everywhere.
Both of those rules of thumb are independent of your programming language of choice.

Base64 InputStream to String

I have been trying to get an input stream reading a file, which isa plain text and has embeded some images and another files in base64 and write it again in a String. But keeping the encoding, I mean, I want to have in the String something like:
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIf
IiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7
I have been trying with the classes Base64InputStream and more from packages as org.apache.commons.codec but I just can not fiugure it out. Any kind of help would be really appreciated. Thanks in advance!
Edit
Piece of code using a reader:
BufferedReader br= new BufferedReader(new InputStreamReader(bodyPart.getInputStream()));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
Getting as a result something like: .DIC;ÿÛC;("(;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;ÿÀ##"ÿÄ

Have you tried doing this:
final byte[] bytes64bytes = Base64.encodeBase64(IOUtils.toByteArray(is));
final String content = new String(bytes64bytes);

A text file containing some base64 data can be read with the charset of the rest of the file.
Base64 encoding is a mean to encode bytes in a limited set of characters that are unchanged with almost all char encodings, for example ASCII or UTF-8.
Base64 isn't a charset encoding, you don't have to specify you have some base64 encoded data when reading a file into a string.
So if your text file is generally UTF-8 (that's probable), you can read it without problem even if it contains a base64 encoded stream. Simply use a basic reader and don't use a Base64InputStream if you don't want to decode it.
When opening a file with a reader, you have to specify the encoding. If you don't know it, I suggest you test with the probable ones, like UTF-8, US-ASCII or ISO-8859-1.

If you have a normal InputStream object than You can directly get Base64 encoded stream from it using apache common library class Base64InputStream constructor

I found the solution, inspired by this post getting base64 content string of an image from a mimepart in Java
I think it is kind of stupid decode and encode again the base64 code, but it is the only way I found to manage this issue. If someone could give a better solution, it would be also really appreciated.
Thanks

How to implement rawurldecode in Java?

I'd like to convert PHP code to Java, that is to decode a string stored as an encoded URI format.
That is, change
This%20is%20a%20%2Burl%2B%21
into
This is a +url+!
I've looked at java.net.URI, but there are no suitable examples, and it seems that anything to be decoded by it needs to be in a proper URI format. I'd like to convert a string that isn't in proper format, but contains HTML encoding.

java.net.URLDecoder.decode("This%20is%20a%20%2Burl%2B%21", "UTF-8");
UTF-8 is of course just an example. Use whatever your input encoding is.

You could use URLDecoder (doc here). It just decodes an x-www-form-urlencoded String.
String decodedString = URLDecoder.decode("This%20is%20a%20%2Burl%2B%21");
System.out.println(decodedString);

Issue encoding java->xls

This is not a pure java question and can also be related to HTML
I've written a java servlet that queries a database table and shows the
result as a html table. The user can also ask to receive the result as
an Excel sheet.
Im creating the Excel sheet by printing the same html table, but with
the content-type of "application/vnd.ms-excel". The Excel file is
created fine.
The problem is that the tables may contain non-english data so I want
to use a UTF-8 encoding.
PrintWriter out = response.getWriter();
response.setContentType("application/vnd.ms-excel:ISO-8859-1");
//response.setContentType("application/vnd.ms-excel:UTF-8");
response.setHeader("cache-control", "no-cache");
response.setHeader("Content-Disposition", "attachment; filename=file.xls");
out.print(src);
out.flush();
The non-english characters appear as garbage (áéíóú)
Also I tried converting to bytes from String
byte[] arrByte = src.getBytes("ISO-8859-1");
String result = new String(arrByte, "UTF-8");
But I Still getting garbage, What can I do?.
Thanks
UPDATE: if I open the excel file in notepad + + the type of file encoding is "UTF-8 without BOM", if I change the encoding to "UTF-8" and then open the file in Excel, the characters "áéíóú" look good.

Excel is a binary format, not a text format, so you should not need to set any encoding, since it simply doesn't apply. Whatever system you are using to build the excel file (e.g. Apache Poi) will take care of the encoding of text within the excel file.
You should not try to convert the recieved bytes to a string, just store them in a byte array or write them out to a file.
EDIT: from the comment, it doesn't sound as if you are using a "real" binary excel file, but a tab delimited text file (CSV). In that case, make sure you use consistent encoding, e.g UTF-8 throughout.
Also, before calling response.getWriter(), call setContentType first.
See HttpServletResponse.getPrintWriter()
EDIT: You can try writing the BOM. It's normally not required, but file format handling in Office is far from normal...
Java doesn't really have support for the BOM. You'll have to fake it. It means that you need to use the response outputStream rather than writer, since you need to write raw bytes (the BOM). So you change your code to this:
response.setContentType("application/vnd.ms-excel:UTF-8");
// set other headers also, "cache-control" etc..
OutputStream outputStream = response.getOutputStream();
outputStream.write(0xEF); // 1st byte of BOM
outputStream.write(0xBB);
outputStream.write(0xBF); // last byte of BOM
// now get a PrintWriter to stream the chars.
PrintWriter out = new PrintWriter(new OutputStreamWriter(outputStream,"UTF-8"));
out.print(src);

Do you get "garbage" when you print result to standard output?
Edit (code in code tags from the comment below):
response.setContentType("application/vnd.ms-excel; charset=UTF-8")

Try using the ServletResponse.setCharacterEncoding(java.lang.String charset) method.
response.setCharacterEncoding("UTF-8");

I had the same issue.. i fixed it with using print() instead of write()
outputStream.print('\ufeff');

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert byte[] from ANSI to UTF-8 - java

byte[] data; ResultSet resultSet data = resultSet.getBytes("xml");//It is XML of ANSI type from database(ms-sql); I am trying to convert XML to UTF-8 type. Please help me figure this out.

You might use the constructor of String that takes an encoding; for example, new String(data, "UTF-8"); It's probably easiest to just call resultSet.getString("xml");

You probably should just feed the bytes directly to an XML parser. XML almost requires the encoding to be specified, and the parser will figure it out on its own.

Related

safest way to read clob into xml parser

throw exception when string is not encoded in UTF-8

Base64 InputStream to String

How to implement rawurldecode in Java?

Issue encoding java->xls

Categories

Resources