Hexadecimal to Bytes in Java [duplicate] - java

This question already has answers here:
Convert a string representation of a hex dump to a byte array using Java?
(25 answers)
Closed 1 year ago.
I'm working on a Word file manipulator (DOCX format to be specific) and it is working fine but at this phase I'm expected to take a file from SAP software, I take the file in the form of bytes that look something like 504B030414000600080000002100DFA4D26C5A0100002005000013000.
However I try to use this code to read the bytes received, put them in an input stream and open them with Apache POI's functions:
byte[] byteArr = "504B030414000600080000002100DFA4D26C5A01000020050000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A0000200000000000000".getBytes();
InputStream fis = new ByteArrayInputStream(byteArr);
return new XWPFDocument(OPCPackage.open(fis));
The last line brings me an error that the file gives isn't OOXML.
How to transform my received bytes to something relevant in Java?

Using getBytes is for the String type. Because this is hexadecimal, you will have to use DatatypeConverter.parseHexBinary.
This question has more information, and even more options to choose from:
Convert a string representation of a hex dump to a byte array using Java?
Now, having said that, I have not been able to convert the hex string provided from your question into a good document.
Running this function:
try (final FileOutputStream fos = new FileOutputStream(new File("C:/", "Test Document.docx")))
{
final byte[] b = DatatypeConverter.parseHexBinary(
"504B030414000600080000002100DFA4D26C5A01000020050000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A0000200000000000000");
fos.write(b);
}
... results in the file below:
The [Content_Types].xml in there is promising (if you open other valid documents with 7-Zip you will see that in the archive). However, I cannot open this file with MS-Office, LibreOffice, or 7-Zip.
If I had to guess, I would say this particular file has become corrupted, or parts of it gone missing.

Related

How do I write chinese charactes in ZipEntry?

I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?
This is the code I currently have.
ByteArrayOutputStream out = new ByteArrayOutputStream();
ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
try {
ZipEntry entry = new ZipEntry("chinese.csv");
zipOut.putNextEntry(entry);
zipOut.write("类型".getBytes());
} catch (IOException e) {
e.printStackTrace();
} finally {
zipOut.close();
out.close();
}
Instead of "类型", I get "类型" in the CSV file.
First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I tried your inputs and got this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
And the output was:
\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039
So it looks like you did lose the info, and it is not just a display issue
The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Furthermore, as #Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:
-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.
A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.
try (FileOutputStream fileOut = new FileOutputStream("out.zip");
ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
zipOut.putNextEntry(new ZipEntry("chinese.csv"));
zipOut.write("类型".getBytes("UTF-8"));
zipOut.closeEntry();
}
Finally, as #MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}
Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");
In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

Converting String to UTF-8 and save it in a file [duplicate]

This question already has answers here:
How to write a UTF-8 file with Java?
(10 answers)
Closed 7 years ago.
I have a String in Java, this string represents the content of a XML file (That I'm generating in other process), I have a problem with the codification, in the header of the XML I have UTF-8 but when I tried to parse it I gets an error related with the codification, exactly:
Byte not valid 2 pf the sequence UTF-8 of 4 bytes
So, I opened the file with Notepad++ and it says it's with ANSI codification. I was thinking in convert the String to UTF-8 before save it in the file, I made this with:
byte[] encoded = content.getBytes(StandardCharsets.UTF_8);
But then,how I save it in the file?I want the user be able to open the XML file in any text editor, but now I have bytes.How I save it?
The following should do
// Ensure that the stated encoding in the XML is UTF-8:
// $1______________________ $2_____ $3_
content = content.replaceFirst("(<\\?xml[^>]+encoding=\")([^\"]*)(\")",
"$1UTF-8$3");
byte[] encoded = content.getBytes(StandardCharsets.UTF_8);
Files.writeBytes(Paths.get("... .xml"), encoded);
For editing one needs a UTF-8 capable editor (JEdit, Notepad++) - under Windows.
Notepad++ should recognize the file, you could reload it with the right encoding.
Try Files.write(Paths.get("output.xml"), encoded);.

Which charset should i use to decode this array of bytes in java?

I am currently working with Soap web services, and more precisely, recovering a file sent within.
It is working manually :
In SOAPUi, i do receive this (truncated for readability)
JVBERi0xLjQKJeLjz9MKMTIgMCBvY [...]
dL0luZm8gMTggMCBSL1NpemUgMTk+PgpzdGFydHhyZWYKNjk5OQolJUVPRgo=
I can paste this string within notepad++ and after clicking on MIME Tools > base 64 Decode, it become a proper PDF File as follows ( Truncated , just header is shown)
%PDF-1.4 %xE2xE3xCFxD3LF 12 0 obj <>stream
PDF File can be thus read without any problem.
Problem is now to recover this data using java.
I am receiving an array of byte (acopier variable in example below) and using the following code to store into a file.
I tried a couple of the numerous examples found on the net without any success.
Also tried to use UTF-8, ISO-8859-1 amongst others.
OutputStreamWriter osw = null;
try{
String filePath="c:\\temp\\";
filePath = filePath.concat("test.pdf");
FileOutputStream fos = new FileOutputStream(filePath,false);
osw = new OutputStreamWriter(fos,"UTF-8");
osw.write("\uFEFF");
osw.write(new String(acopier));
osw.close();
System.out.println("Success");
fos.close();
}
catch(Exception e)
{
System.out.println(e.getMessage());
osw.close();
}
Unfortunately, file can't be seen as a pdf file,
%PDF-1.4 %âãÏÓ 12 0 obj <>stream
When i tried to check what's within the array of bytes, console is showing me this : (truncated)
% P D F
- 1 . 4
% ? ? ? ?
1 2 0
I presume that windows or notepad++ or soapui is doing something in the background to guess what charset to use but i don't know with certitude which way to go.
Can please someone clarify me how to do it from scratch in java (meaning from the original array of bytes)?
Regards,
Pierre
Get the original (Base64) string data
Use your preferred Base64 decoder to turn it into bytes (plenty of them for Java)
Write bytes to file. As bytes, not as character data (i.e. no Writer class).
Since in your example you're trying to write binary data as character data (and using the String constructor), I assume you're quite new to Java?
Your mistake was converting base64 to binary data in notepad, then saving the result thinking that it would be valid binary data (which it almost most definitely isn't, and even if it did work, that's not the road you want to ).

Issue encoding java->xls

This is not a pure java question and can also be related to HTML
I've written a java servlet that queries a database table and shows the
result as a html table. The user can also ask to receive the result as
an Excel sheet.
Im creating the Excel sheet by printing the same html table, but with
the content-type of "application/vnd.ms-excel". The Excel file is
created fine.
The problem is that the tables may contain non-english data so I want
to use a UTF-8 encoding.
PrintWriter out = response.getWriter();
response.setContentType("application/vnd.ms-excel:ISO-8859-1");
//response.setContentType("application/vnd.ms-excel:UTF-8");
response.setHeader("cache-control", "no-cache");
response.setHeader("Content-Disposition", "attachment; filename=file.xls");
out.print(src);
out.flush();
The non-english characters appear as garbage (áéíóú)
Also I tried converting to bytes from String
byte[] arrByte = src.getBytes("ISO-8859-1");
String result = new String(arrByte, "UTF-8");
But I Still getting garbage, What can I do?.
Thanks
UPDATE: if I open the excel file in notepad + + the type of file encoding is "UTF-8 without BOM", if I change the encoding to "UTF-8" and then open the file in Excel, the characters "áéíóú" look good.
Excel is a binary format, not a text format, so you should not need to set any encoding, since it simply doesn't apply. Whatever system you are using to build the excel file (e.g. Apache Poi) will take care of the encoding of text within the excel file.
You should not try to convert the recieved bytes to a string, just store them in a byte array or write them out to a file.
EDIT: from the comment, it doesn't sound as if you are using a "real" binary excel file, but a tab delimited text file (CSV). In that case, make sure you use consistent encoding, e.g UTF-8 throughout.
Also, before calling response.getWriter(), call setContentType first.
See HttpServletResponse.getPrintWriter()
EDIT: You can try writing the BOM. It's normally not required, but file format handling in Office is far from normal...
Java doesn't really have support for the BOM. You'll have to fake it. It means that you need to use the response outputStream rather than writer, since you need to write raw bytes (the BOM). So you change your code to this:
response.setContentType("application/vnd.ms-excel:UTF-8");
// set other headers also, "cache-control" etc..
OutputStream outputStream = response.getOutputStream();
outputStream.write(0xEF); // 1st byte of BOM
outputStream.write(0xBB);
outputStream.write(0xBF); // last byte of BOM
// now get a PrintWriter to stream the chars.
PrintWriter out = new PrintWriter(new OutputStreamWriter(outputStream,"UTF-8"));
out.print(src);
Do you get "garbage" when you print result to standard output?
Edit (code in code tags from the comment below):
response.setContentType("application/vnd.ms-excel; charset=UTF-8")
Try using the ServletResponse.setCharacterEncoding(java.lang.String charset) method.
response.setCharacterEncoding("UTF-8");
I had the same issue.. i fixed it with using print() instead of write()
outputStream.print('\ufeff');

Categories