Byte Order Mark to read UTF-8 cvs and excel files

Byte Order Mark to read UTF-8 cvs and excel files - java

Friends
I have to use BOM ( Byte Order Mark ) to make sure that downloaded files in cvs and excel in UTF-8 formats are displayed properly.
**
My question is can BOM be applied to FileOutputStream instead of
OutputStream as below ?
**
String targetfileName = reportXMLFileName.substring(0, reportXMLFileName.lastIndexOf(".") + 1);
targetfileName += "xlsx";
File tmpFile=new File((filePath !=null?filePath.trim():"")+"/"+templateFile);
FileOutputStream out = new FileOutputStream((filePath !=null?filePath.trim():"")+"/"+targetfileName);
/* Here is the example*/
out.write(new byte[] {(byte)0xEF, (byte)0xBB, (byte)0xBF });
substitute(tmpFile, tmp, sheetRef.substring(1), out);
out.close();

(As asked in comment.)
In the following file may be File, (File)OutputStream or String (filename).
final String ENCODING = "UTF-8";
PrintWriter out = new PrintWriter(file, ENCODING))));
try {
out.print("\ufffe"); // Write the BOM
out.println("<?xml version=\"1.0\" encoding=\"" + ENCODING + "\"?>");
...
} finally {
out.close();
}
Or since Java 7:
try (PrintWriter out = new PrintWriter(f, ENCODING))))) {
out.print("\ufffe"); // Write the BOM
out.println("<?xml version=\"1.0\" encoding=\"" + ENCODING + "\"?>");
...
}
Working with text, makes it more natural to use a Writer, as one does not need oneself to convert string with String.getBytes("UTF-8").

Yes. A BOM sits at the beginning of any data stream, whether over the network or in a file. Just write the BOM to the file at the beginning in the same manner, just to a FileOutputStream. Anyway, remember that a FileOutputStream is a type of OutputStream.

Related

Java Spring returning CSV file encoded in UTF-8 with BOM

Apparently for excel to open CSV files nicely, it should have the Byte Order Mark at the start. The download of CSV is implemented by writing into HttpServletResponse's output stream in the controller, as the data is generated during request. I get an exception when I try to write the BOM bytes - java.io.CharConversionException: Not an ISO 8859-1 character: [] (even though the encoding I specified is UTF-8).
The controller's method in question
#RequestMapping("/monthly/list")
public List<MonthlyDetailsItem> queryDetailsItems(
MonthlyDetailsItemQuery query,
#RequestParam(value = "format", required = false) String format,
#RequestParam(value = "attachment", required = false, defaultValue="false") Boolean attachment,
HttpServletResponse response) throws Exception
{
// load item list
List<MonthlyDetailsItem> list = detailsSvc.queryMonthlyDetailsForList(query);
// adjust format
format = format != null ? format.toLowerCase() : "json";
if (!Arrays.asList("json", "csv").contains(format)) format = "json";
// modify common response headers
response.setCharacterEncoding("UTF-8");
if (attachment)
response.setHeader("Content-Disposition", "attachment;filename=duomenys." + format);
// build csv
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
response.getOutputStream().print("\ufeff");
response.getOutputStream().write(buildMonthlyDetailsItemCsv(list).getBytes("UTF-8"));
return null;
}
return list;
}

I have just come across, this same problem. The solution which works for me is to get the output stream from the response Object and write to it as follows
// first create an array for the Byte Order Mark
final byte[] bom = new byte[] { (byte) 239, (byte) 187, (byte) 191 };
try (OutputStream os = response.getOutputStream()) {
os.write(bom);
final PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(data);
w.flush();
w.close();
} catch (IOException e) {
// logit
}
So UTF-8 is specified on the OutputStreamWriter.
As an addendum to this, I should add, the same application needs to allow users to upload files, these may or may not have BOM's. This may be dealt with by using the class org.apache.commons.io.input.BOMInputStream, then using that to construct a org.apache.commons.csv.CSVParser.
The BOMInputStream includes a method hasBOM() to detect if the file has a BOM or not.
One gotcha that I first fell into was that the hasBOM() method reads (obviously!) from the underlying stream, so the way to deal with this is to first mark the stream, then after the test if it doesn't have a BOM, reset the stream. The code I use for this looks like the following:
try (InputStream is = uploadFile.getInputStream();
BufferedInputStream buffIs = new BufferedInputStream(is);
BOMInputStream bomIn = new BOMInputStream(buffIs);) {
buffIs.mark(LOOKAHEAD_LENGTH);
// this should allow us to deal with csv's with or without BOMs
final boolean hasBOM = bomIn.hasBOM();
final BufferedReader buffReadr = new BufferedReader(
new InputStreamReader(hasBOM ? bomIn : buffIs, StandardCharsets.UTF_8));
// if this stream does not have a BOM, then we must reset the stream as the test
// for a BOM will have consumed some bytes
if (!hasBOM) {
buffIs.reset();
}
// collect the validated entity details
final CSVParser parser = CSVParser.parse(buffReadr,
CSVFormat.DEFAULT.withFirstRecordAsHeader());
// Do stuff with the parser
...
// Catch and clean up
Hope this helps someone.

It doesn't make much sense: the BOM is for UTF-16; there is no byte order with UTF-8. The encoding You've set with setCharacterEncoding is used for getWriter, not for getOutputStream.
UPDATE:
OK, try this:
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
PrintWriter out = response.getWriter();
out.print("\uFEFF");
out.print(buildMonthlyDetailsItemCsv(list));
return null;
}
I'm assuming that method buildMonthlyDetailsItemCsv returns a String.

Getting different byte array than written to file when reading from file

I'm writing my byte array to a file:
PrintWriter pw = new PrintWriter(new FileOutputStream(fileOutput, true));
pw.write(new String(cryptogram, Charset.defaultCharset()));
pw.close();
Then, I am reading it from the file like this:
String cryptogramString = new String();
while (scPriv.hasNext()) {
linePriv = scPriv.nextLine();
cryptogramString += linePriv;
}
But I don't know how to make byte[] from cryptogramString. I'am trying this:
byte[] b = cryptogramString.getBytes(Charset.defaultCharset());
System.out.println(Arrays.toString(b));
System.out.println(Arrays.toString(cryptogram));
But it doesn't return the same values. Does anyone have an idea how to make this right?

You should decide whether you are writing text or binary.
Encrypted data is always binary which means you shouldn't be using Reader/Writer/String classes.
try (FileOutputstream out = new FileOutputStream(filename)) {
out.write(bytes);
}
to read back in
byte[] bytes = new byte[(int) (new File(filename).length())];
try (FileInputstream in = new FileInputStream(filename)) {
in.read(bytes);
}
I have a file that contains xml and then plain text, so i cant read a file as a whole
You also can't write binary into a text file. You can encode it using base64.
Storing base64 data in XML?

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.

Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

How to save an HTML page with special chars (UTF-8) to a txt file

I need to make a java code that save an html to a txt file.
The problem is that the special chars in UTF-8 are broken.
Words like "Hamamélis" are saved in this way "Hamamï¿½lis".
the code that i writed is listed down there:
URLConnection conn;
conn = site.openConnection();
conn.setReadTimeout(10000);
Charset charset = Charset.forName("UTF8");
BufferedReader in = new BufferedReader( new InputStreamReader( conn.getInputStream(), "UTF-8" ) );
buff = in.readLine();
And after:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(Nome), "UTF-8"));
out.write(buff);
out.close();
Anyone can suggest me a solution?

One possible error is omitting the hyphen from "UTF-8" in the 4th line of your first piece of code. See the CharSet documentation.
Otherwise, code seems correct. But of course we cannot test it directly as we do not have your data.
For comparison, here is a little class I wrote. In a manner similar to your code, this class correctly writes your "Hamamélis" example's accented 'e' as the two octets expected in UTF-8 for a single (non-normalized) character: in hex 'C3' & 'A9'.
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.io.IOException;
public class ReaderWriter {
public static void main(String[] args) {
try {
String content = "Hamamélis. Written: " + new java.util.Date();
File file = new File("some_text.txt");
// Create file if not already existent.
if (!file.exists()) {
file.createNewFile();
}
FileOutputStream fileOutputStream = new FileOutputStream( file );
OutputStreamWriter outputStreamWriter = new OutputStreamWriter( fileOutputStream, "UTF-8" );
BufferedWriter bufferedWriter = new BufferedWriter( outputStreamWriter );
bufferedWriter.write( content );
bufferedWriter.close();
System.out.println("ReaderWriter 'main' method is done. " + new java.util.Date() );
} catch (IOException e) {
e.printStackTrace();
}
}
}
As icktoofay commented, you should dig deeper to discover exactly what octets are involved. Use a hex editor like this "File Viewer" app I found today on the Mac App Store to see the exact octets in your saved file.
If the octets are C3 & A9, then the problem is simply that the text editor you used to look at the file as text used the wrong character encoding. For example, you can open that text file in a web browser, and use its menu commands to re-interpret the file as UTF-8.
If the octets are not C3 & A9, I would go further back to examine the input's octets.
If you do not understand that text files in computers actually contain numbers (not text in the human sense), then take a break from coding to read this entertaining article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?

There are several factors at work here:
Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.
Here is the Closer type used in this code:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
#Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)

If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.

Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
The following bug exists to describe the issue in Java:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html

Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.

Try this,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Byte Order Mark to read UTF-8 cvs and excel files - java

Yes. A BOM sits at the beginning of any data stream, whether over the network or in a file. Just write the BOM to the file at the beginning in the same manner, just to a FileOutputStream. Anyway, remember that a FileOutputStream is a type of OutputStream.

Related

Java Spring returning CSV file encoded in UTF-8 with BOM

Getting different byte array than written to file when reading from file

Special characters are not converted correctly from pdf to text

How to save an HTML page with special chars (UTF-8) to a txt file

How to save Chinese Characters to file with java?

Categories

Resources