How to find Encoding of a file using Java [duplicate] - java

This question already has answers here:
Java : How to determine the correct charset encoding of a stream
(16 answers)
Closed 3 years ago.
I am trying to find the encoding of a file using the java program. But it always providing the UTF-8 as the output. Even though it is an ANSI file.
import java.io.InputStream
import java.io.FileInputStream
import java.io.BufferedInputStream
import java.io.InputStreamReader
new InputStreamReader(new FileInputStream("FILE_NAME")).getEncoding
The library is old and looks no proper support for that.
https://code.google.com/archive/p/juniversalchardet/
Some are so many answers, that say we can find the encoding of the file like
Java : How to determine the correct charset encoding of a stream
These solutions doesnt look good. According to # Jörg W Mittag We cannot find the encoding of a file for sure.

In scala I don't have sure, but have you tried alread this lib?
public static Charset guessCharset2(File file) throws IOException {
return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
}

Related

Hexadecimal to Bytes in Java [duplicate]

This question already has answers here:
Convert a string representation of a hex dump to a byte array using Java?
(25 answers)
Closed 1 year ago.
I'm working on a Word file manipulator (DOCX format to be specific) and it is working fine but at this phase I'm expected to take a file from SAP software, I take the file in the form of bytes that look something like 504B030414000600080000002100DFA4D26C5A0100002005000013000.
However I try to use this code to read the bytes received, put them in an input stream and open them with Apache POI's functions:
byte[] byteArr = "504B030414000600080000002100DFA4D26C5A01000020050000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A0000200000000000000".getBytes();
InputStream fis = new ByteArrayInputStream(byteArr);
return new XWPFDocument(OPCPackage.open(fis));
The last line brings me an error that the file gives isn't OOXML.
How to transform my received bytes to something relevant in Java?
Using getBytes is for the String type. Because this is hexadecimal, you will have to use DatatypeConverter.parseHexBinary.
This question has more information, and even more options to choose from:
Convert a string representation of a hex dump to a byte array using Java?
Now, having said that, I have not been able to convert the hex string provided from your question into a good document.
Running this function:
try (final FileOutputStream fos = new FileOutputStream(new File("C:/", "Test Document.docx")))
{
final byte[] b = DatatypeConverter.parseHexBinary(
"504B030414000600080000002100DFA4D26C5A01000020050000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A0000200000000000000");
fos.write(b);
}
... results in the file below:
The [Content_Types].xml in there is promising (if you open other valid documents with 7-Zip you will see that in the archive). However, I cannot open this file with MS-Office, LibreOffice, or 7-Zip.
If I had to guess, I would say this particular file has become corrupted, or parts of it gone missing.

Java's UTF-8 encoding

I have this code:
BufferedWriter w = Files.newWriter(file, Charsets.UTF_8);
w.newLine();
StringBuilder sb = new StringBuilder();
sb.append("\"").append("éééé").append("\";")
w.write(sb.toString());
But it ain't work. In the end my file hasn't an UTF-8 encoding. I tried to do this when writing:
w.write(new String(sb.toString().getBytes(Charsets.US_ASCII), "UTF8"));
It made question marks appear everywhere in the file...
I found that there was a bug regarding the recognition of the initial BOM charcater (http://bugs.java.com/view_bug.do?bug_id=4508058), so I tried using the BOMInputStream class. But bomIn.hasBOM() always returns false, so I guess my problem is not BOM related maybe?
Do you know how I can make my file encoded in UTF-8? Was the problem solved in Java 8?
You're writing UTF-8 correctly in your first example (although you're redundantly creating a String from a String)
The problem is that the viewer or tool you're using to view the file doesn't read the file as UTF-8.
Don't mix in ASCII, that just converts all the non-ASCII bytes to question marks.

Converting String to UTF-8 and save it in a file [duplicate]

This question already has answers here:
How to write a UTF-8 file with Java?
(10 answers)
Closed 7 years ago.
I have a String in Java, this string represents the content of a XML file (That I'm generating in other process), I have a problem with the codification, in the header of the XML I have UTF-8 but when I tried to parse it I gets an error related with the codification, exactly:
Byte not valid 2 pf the sequence UTF-8 of 4 bytes
So, I opened the file with Notepad++ and it says it's with ANSI codification. I was thinking in convert the String to UTF-8 before save it in the file, I made this with:
byte[] encoded = content.getBytes(StandardCharsets.UTF_8);
But then,how I save it in the file?I want the user be able to open the XML file in any text editor, but now I have bytes.How I save it?
The following should do
// Ensure that the stated encoding in the XML is UTF-8:
// $1______________________ $2_____ $3_
content = content.replaceFirst("(<\\?xml[^>]+encoding=\")([^\"]*)(\")",
"$1UTF-8$3");
byte[] encoded = content.getBytes(StandardCharsets.UTF_8);
Files.writeBytes(Paths.get("... .xml"), encoded);
For editing one needs a UTF-8 capable editor (JEdit, Notepad++) - under Windows.
Notepad++ should recognize the file, you could reload it with the right encoding.
Try Files.write(Paths.get("output.xml"), encoded);.

java.* package for encoding decoding [duplicate]

This question already has answers here:
Decode Base64 data in Java
(21 answers)
Closed 9 years ago.
I have used com.sun.org.apache.xerces.internal.impl.dv.util.Base64 package for the purpose of encoding decoding of strings. But I want to use java.* package for encoding and decoding instead of com.sun.apache.* package.
Can you please suggest an appropriate java.* package?
If you can wait until Java 8 is released - there will be a java.util.Base64 class.
In the meantime you should use the solution from Joachim Sauer's comment. (See Decode Base64 data in Java - second answer)
Use the java Packages:
java.net.URLDecoder
java.net.URLEncoder
And use it like this:
public static String decodeString(final String string) {
try {
return URLDecoder.decode(string, "UTF-8");
} catch (final UnsupportedEncodingException e) {
TLog.d(LOG, "Decoding Not Supported");
}
return string;
}
What you mean?
You want to use:
import java.*;
instead of using:
import com.sun.apache.*;
?
Seems a lit bit hard. I have one way to do this:
Download the source code of com.sun.org.apache.xerces.internal.impl.dv.util.Base64 packege.
Update the package name.
Re-package the source code.
Import the jar file again.
I don't think you should do this, it might be some license issue.

Add non-ASCII file names to zip in Java

What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?
Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.
import java.io.IOException;
import java.io.PrintStream;
import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
try {
PrintStream ps = new PrintStream(new FileOutputStream(
"outer.zip/abc-åäö.txt"));
try {
ps.println("The characters åäö works here though.");
} finally {
ps.close();
}
} finally {
File.umount();
}
}
}
Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.
import java.io.IOException;
import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
ZipOutputStream zipOutput = new ZipOutputStream(
new FileOutputStream(encoding + "-example.zip"), encoding);
ZipEntry entry = new ZipEntry("abc-åäö.txt");
zipOutput.putNextEntry(entry);
zipOutput.closeEntry();
zipOutput.close();
}
}
}
The encoding for the File-Entries in ZIP is originally specified as IBM Code Page 437. Many characters used in other languages are impossible to use that way.
The PKWARE-specification refers to the problem and adds a bit. But that is a later addition (from 2007, thanks to Cheeso for clearing that up, see comments). If that bit is set, the filename-entry have to be encoded in UTF-8. This extension is described in 'APPENDIX D - Language Encoding (EFS)', that is at the end of the linked document.
For Java it is a known bug, to get into trouble with non-ASCII-characters. See bug #4244499 and the high number of related bugs.
My colleague used as workaround URL-Encoding for the filenames before storing them into the ZIP and decoding after reading them. If you control both, storing and reading, that may be a workaround.
EDIT: At the bug someone suggests using the ZipOutputStream from Apache Ant as workaround. This implementation allows the specification of an encoding.
In Zip files, according to the spec owned by PKWare, the encoding of file names and file comments is IBM437. In 2007 PKWare extended the spec to also allow UTF-8. This says nothing about the encoding of the files contained within the zip. Only the encoding of the filenames.
I think all tools and libraries (Java and non Java) support IBM437 (which is a superset of ASCII), and fewer tools and libraries support UTF-8. Some tools and libs support other code pages. For example if you zip something using WinRar on a computer running in Shanghai, you will get the Big5 code page. This is not "allowed" by the zip spec but it happens anyway.
The DotNetZip library for .NET does Unicode, but of course that doesn't help you if you are using Java!
Using the Java built-in support for ZIP, you will always get IBM437. If you want an archive with something other than IBM437, then use a third party library, or create a JAR.
Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to set up filename encodings upon creating the zip file/stream (requires Java 7).
You can still use the Apache Commons implementation of the zip stream : http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang.String%29
Calling setEncoding("UTF-8") on your stream should be enough.
From a quick look at the TrueZIP manual - they recommend the JAR format:
It uses UTF-8 for file name encoding
and comments - unlike ZIP, which only
uses IBM437.
This probably means that the API is using the java.util.zip package for its implementation; that documentation states that it is still using a ZIP format from 1996. Unicode support wasn't added to the PKWARE .ZIP File Format Specification until 2006.
Did it actually fail or was just a font issue? (e.g. font having different glyphs for those charcodes) I've seen similar issues in Windows where rendering "broke" because the font didn't support the charset but the data was actually intact and correct.
Non-ASCII file names are not reliable across ZIP implementations and are best avoided. There is no provision for storing a charset setting in ZIP files; clients tend to guess with 'the current system codepage', which is unlikely to be what you want. Many combinations of client and codepage can result in inaccessible files.
Sorry!

Categories