Converting String to UTF-8 and save it in a file [duplicate] - java

This question already has answers here:
How to write a UTF-8 file with Java?
(10 answers)
Closed 7 years ago.
I have a String in Java, this string represents the content of a XML file (That I'm generating in other process), I have a problem with the codification, in the header of the XML I have UTF-8 but when I tried to parse it I gets an error related with the codification, exactly:
Byte not valid 2 pf the sequence UTF-8 of 4 bytes
So, I opened the file with Notepad++ and it says it's with ANSI codification. I was thinking in convert the String to UTF-8 before save it in the file, I made this with:
byte[] encoded = content.getBytes(StandardCharsets.UTF_8);
But then,how I save it in the file?I want the user be able to open the XML file in any text editor, but now I have bytes.How I save it?

The following should do
// Ensure that the stated encoding in the XML is UTF-8:
// $1______________________ $2_____ $3_
content = content.replaceFirst("(<\\?xml[^>]+encoding=\")([^\"]*)(\")",
"$1UTF-8$3");
byte[] encoded = content.getBytes(StandardCharsets.UTF_8);
Files.writeBytes(Paths.get("... .xml"), encoded);
For editing one needs a UTF-8 capable editor (JEdit, Notepad++) - under Windows.
Notepad++ should recognize the file, you could reload it with the right encoding.

Try Files.write(Paths.get("output.xml"), encoded);.

Related

Hexadecimal to Bytes in Java [duplicate]

This question already has answers here:
Convert a string representation of a hex dump to a byte array using Java?
(25 answers)
Closed 1 year ago.
I'm working on a Word file manipulator (DOCX format to be specific) and it is working fine but at this phase I'm expected to take a file from SAP software, I take the file in the form of bytes that look something like 504B030414000600080000002100DFA4D26C5A0100002005000013000.
However I try to use this code to read the bytes received, put them in an input stream and open them with Apache POI's functions:
byte[] byteArr = "504B030414000600080000002100DFA4D26C5A01000020050000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A0000200000000000000".getBytes();
InputStream fis = new ByteArrayInputStream(byteArr);
return new XWPFDocument(OPCPackage.open(fis));
The last line brings me an error that the file gives isn't OOXML.
How to transform my received bytes to something relevant in Java?
Using getBytes is for the String type. Because this is hexadecimal, you will have to use DatatypeConverter.parseHexBinary.
This question has more information, and even more options to choose from:
Convert a string representation of a hex dump to a byte array using Java?
Now, having said that, I have not been able to convert the hex string provided from your question into a good document.
Running this function:
try (final FileOutputStream fos = new FileOutputStream(new File("C:/", "Test Document.docx")))
{
final byte[] b = DatatypeConverter.parseHexBinary(
"504B030414000600080000002100DFA4D26C5A01000020050000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A0000200000000000000");
fos.write(b);
}
... results in the file below:
The [Content_Types].xml in there is promising (if you open other valid documents with 7-Zip you will see that in the archive). However, I cannot open this file with MS-Office, LibreOffice, or 7-Zip.
If I had to guess, I would say this particular file has become corrupted, or parts of it gone missing.

Reading yenc encoded data into java

I have this text file that i encoded with an yenc encoder and i intended to decode it with a java yenc decoder. The yenc decoder isn't the problem.
I read the encoded file in and then print the lines on the console:
But characters get mispresented:
THIS
q˜œ‹–74˜“›ŸJsnJJJJ
TURNS INTO q?˜?œ‹–74˜“›Ÿ?JsnJJJJ
Not only when I print it on the console, but also when writing it to a file.
So the decoded file will not be the encoded file in the beginning.
How can i solve this?
I really have no idea.
I tried replacing the "?" but that didn't have any effect.

Java's UTF-8 encoding

I have this code:
BufferedWriter w = Files.newWriter(file, Charsets.UTF_8);
w.newLine();
StringBuilder sb = new StringBuilder();
sb.append("\"").append("éééé").append("\";")
w.write(sb.toString());
But it ain't work. In the end my file hasn't an UTF-8 encoding. I tried to do this when writing:
w.write(new String(sb.toString().getBytes(Charsets.US_ASCII), "UTF8"));
It made question marks appear everywhere in the file...
I found that there was a bug regarding the recognition of the initial BOM charcater (http://bugs.java.com/view_bug.do?bug_id=4508058), so I tried using the BOMInputStream class. But bomIn.hasBOM() always returns false, so I guess my problem is not BOM related maybe?
Do you know how I can make my file encoded in UTF-8? Was the problem solved in Java 8?
You're writing UTF-8 correctly in your first example (although you're redundantly creating a String from a String)
The problem is that the viewer or tool you're using to view the file doesn't read the file as UTF-8.
Don't mix in ASCII, that just converts all the non-ASCII bytes to question marks.

Character encoding in csv

We have a requirement of picking the data from Oracle DB table and dump that data into a csv file and a plain pipe seperated text file. Give a link to user on application so user can view the generated csv/text files.
As lot of parsing was involved so we wrote a Unix shell script and are calling it from out Struts/J2ee application.
Earlier we were loosing the Chinese and Roman chars in the generated files and the generated file were having us-ascii charset(cheked using-> file -i). Later we used NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and this gave us utf-8 format files.
But still the characters were gibberish, so again we tried iconv command and converted utf-8 files to utf-16le charset.
iconv -f utf-8 -t utf-16le $recordFile > $tempFile
This works fine for the generated text file. But with CSV the Chinese and Roman chars are still not correct. Now if we open this csv file in a Notepad and give a newline by pressing Enter key from keyboard, save it. Open it with MS-Excel, all characters are coming fine including the Chinese and Romans but now the text is in single line for each row instead of columns.
Not sure what's going on.
Java code
PrintWriter out = servletResponse.getWriter();
servletResponse.setContentType("application/vnd.ms-excel; charset=UTF-8");
servletResponse.setCharacterEncoding("UTF-8");
servletResponse.setHeader("Content-Disposition","attachment; filename="+ fileName.toString());
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.close();
Please let me know if i missed out any details.
Thanks to all for taking out time to go through this.
Was able to solve it out. First as mentioned by Aaron removed UTF-16LE encoding to avoid future issues and encoded files to UTF-8. Changed the PrintWriter in Java code to OutputStream and was able to see the correct characters in my text file.
CSV was still showing garbage. Came to know that we need to prepend EF BB BF at the beginning of file as the BOM aware software like MS-Excel needs it. So changing the Java code as below did the trick for csv.
OutputStream out = servletResponse.getOutputStream();
os.write(239); //0xEF
os.write(187); //0xBB
out.write(191); //0xBF
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.flush();
out.close();
As always with Unicode problems, every single step of the transformation chain must work perfectly. If you make a mistake in one place, data will be silently corrupted. There is no easy way to figure out where it happens, you have to debug the code or write unit tests.
The Java code above only works if the file actually contains UTF-8 encoded data; it doesn't "magically" figure out what's in the file and converts it to UTF-8. So if the file already contains garbage, you just slap a "this is UTF-8" label on it but it's still garbage.
That means for you that you need to create test cases which take known test data and move that through every step of the chain: Inserting into database, reading from the database, writing to CSV, writing to the text file, reading those files and download to the user.
For each step, you need to write unit tests which takes a known Unicode string like abc öäü and processes it and then check the result. To make it easier to input in Java code, use "abc \u00f6\u00e4\u00fc" You may also want to add spaces at the beginning and end of the string to see whether they are properly preserved or not.
file -i doesn't help you much here since it just makes a guess what the file contains. There is no indicator (data or metadata) in a text file which says "this is UTF-8". UTF-16 supports a BOM header for this but almost no one uses UTF-16, so many tools don't support it (properly).

Issue encoding java->xls

This is not a pure java question and can also be related to HTML
I've written a java servlet that queries a database table and shows the
result as a html table. The user can also ask to receive the result as
an Excel sheet.
Im creating the Excel sheet by printing the same html table, but with
the content-type of "application/vnd.ms-excel". The Excel file is
created fine.
The problem is that the tables may contain non-english data so I want
to use a UTF-8 encoding.
PrintWriter out = response.getWriter();
response.setContentType("application/vnd.ms-excel:ISO-8859-1");
//response.setContentType("application/vnd.ms-excel:UTF-8");
response.setHeader("cache-control", "no-cache");
response.setHeader("Content-Disposition", "attachment; filename=file.xls");
out.print(src);
out.flush();
The non-english characters appear as garbage (áéíóú)
Also I tried converting to bytes from String
byte[] arrByte = src.getBytes("ISO-8859-1");
String result = new String(arrByte, "UTF-8");
But I Still getting garbage, What can I do?.
Thanks
UPDATE: if I open the excel file in notepad + + the type of file encoding is "UTF-8 without BOM", if I change the encoding to "UTF-8" and then open the file in Excel, the characters "áéíóú" look good.
Excel is a binary format, not a text format, so you should not need to set any encoding, since it simply doesn't apply. Whatever system you are using to build the excel file (e.g. Apache Poi) will take care of the encoding of text within the excel file.
You should not try to convert the recieved bytes to a string, just store them in a byte array or write them out to a file.
EDIT: from the comment, it doesn't sound as if you are using a "real" binary excel file, but a tab delimited text file (CSV). In that case, make sure you use consistent encoding, e.g UTF-8 throughout.
Also, before calling response.getWriter(), call setContentType first.
See HttpServletResponse.getPrintWriter()
EDIT: You can try writing the BOM. It's normally not required, but file format handling in Office is far from normal...
Java doesn't really have support for the BOM. You'll have to fake it. It means that you need to use the response outputStream rather than writer, since you need to write raw bytes (the BOM). So you change your code to this:
response.setContentType("application/vnd.ms-excel:UTF-8");
// set other headers also, "cache-control" etc..
OutputStream outputStream = response.getOutputStream();
outputStream.write(0xEF); // 1st byte of BOM
outputStream.write(0xBB);
outputStream.write(0xBF); // last byte of BOM
// now get a PrintWriter to stream the chars.
PrintWriter out = new PrintWriter(new OutputStreamWriter(outputStream,"UTF-8"));
out.print(src);
Do you get "garbage" when you print result to standard output?
Edit (code in code tags from the comment below):
response.setContentType("application/vnd.ms-excel; charset=UTF-8")
Try using the ServletResponse.setCharacterEncoding(java.lang.String charset) method.
response.setCharacterEncoding("UTF-8");
I had the same issue.. i fixed it with using print() instead of write()
outputStream.print('\ufeff');

Categories