Import csv issue with characters - java

When I import CSV file that contains some countries, then I have a problem with some characters. It doesn't encode it well and then I get? mark instead of the character that is written in CSV file.
Here are countries which make me this problem: ÅLAND ISLANDS, SAINT BARTHÉLEMY, CÔTE D'IVOIRE, CURAÇAO.
Here is code for importing csv file:
ICsvBeanReader beanReader = new CsvBeanReader(new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8),
new CsvPreference.Builder(CsvPreference.STANDARD_PREFERENCE).useQuoteMode(new AlwaysQuoteMode()).build());
first i used FileReader and there was problem with all of these countries, then i change to InputStreamReader and add this UTF-8 charset and problem was almost solved. When i use charset UTF-8 i have problem only with reading this country "ÅLAND ISLANDS", as result i get "?LAND ISLANDS".
As charset i've also tried ISO_8859_1, Windows-1252 but it's always same problem with "ÅLAND ISLANDS".
Does anyone know which charset i should use to solve this problem?

Java File reader doesn't handle Byte order mark. I hope that's the issue.
Different of versions handles it differently.
Wrap input stream with the below method.Which detects file type.This method is available in commons-io.If you don't have commons-io grab code from that library.It will be around 10 to 20 lines.Hope that works.
public static InputStreamReader getInputStreamReader(InputStream inputStream) throws IOException
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream, false, ByteOrderMark.UTF_8,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? "UTF-8" : bom.getCharsetName();
return new InputStreamReader(bOMInputStream, charsetName);
}

Related

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXX®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader
Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXX®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF
Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString
As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

read greek characters from xls file into java

I am trying to read an xls file in java and convert it to csv. The problem is that it contains greek characters. I have used various different methods with no success.
br = new BufferedReader(new InputStreamReader(
new FileInputStream(saveDir+"/"+fileName+".xls"), "UTF-8"));
FileWriter writer1 = new FileWriter(saveDir+"/A"+fileName+".csv");
byte[] bytes = thisLine.getBytes("UTF-8");
writer1.append(new String(bytes, "UTF-8"));
used that with different encoders, like utf16 and windoes-1253 and ofcourse with out using the bytes array. none worked. any ideas?
Use "ISO-8859-7" instead of "UTF-8". It is for latin and greek. See documentation
InputStream in = new BufferedInputStream(new FileInputStream(new File(myfile)));
result = new Scanner(in,"ISO-8859-7").useDelimiter("\\A").next();
A Byte Order Mask (BOM) should be entered at the start of the CSV file.
Can you try this code?
PrintWriter writer1 = new PrintWriter(saveDir+"/A"+fileName+".csv");
writer1.print('\ufeff');
....

How do I get an FileInputStream from FileItem in java?

I am trying to avoid the FileItem getInputStream(), because it will get the wrong encoding, for that I need a FileInputStream instead. Is there any way to get a FileInputStream without using this method? Or can I transform my fileitem into a file?
if (this.strEncoding != null && !this.strEncoding.isEmpty()) {
br = new BufferedReader(new InputStreamReader(clsFile.getInputStream(), this.strEncoding));
}
else {
// br = ?????
}
You can try
FileItem#getString(encoding)
Returns the contents of the file item as a String, using the specified encoding.
You can use the write method here.
File file = new File("/path/to/file");
fileItem.write(file);
An InputStream is binary data, bytes. It must be converted to text by giving the encoding of those bytes.
Java uses internally Unicode to represent all text scripts. For text it uses String/char/Reader/Writer.
For binary data, byte[], InputStream, OutputStream.
So you could use a bridging class, like InputStreamReader:
String encoding = "UTF-8"; // Or "Windows-1252" ...
BufferedReader in = new BufferedStream(
new InputStreamReader(fileItem.getInputStream(),
encoding));
Or if you read the bytes:
String s = new String(bytes, encoding);
The encoding is often an option parameter (there then exists an overloaded method without encoding).

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

How to save an HTML page with special chars (UTF-8) to a txt file

I need to make a java code that save an html to a txt file.
The problem is that the special chars in UTF-8 are broken.
Words like "Hamamélis" are saved in this way "Hamam�lis".
the code that i writed is listed down there:
URLConnection conn;
conn = site.openConnection();
conn.setReadTimeout(10000);
Charset charset = Charset.forName("UTF8");
BufferedReader in = new BufferedReader( new InputStreamReader( conn.getInputStream(), "UTF-8" ) );
buff = in.readLine();
And after:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(Nome), "UTF-8"));
out.write(buff);
out.close();
Anyone can suggest me a solution?
One possible error is omitting the hyphen from "UTF-8" in the 4th line of your first piece of code. See the CharSet documentation.
Otherwise, code seems correct. But of course we cannot test it directly as we do not have your data.
For comparison, here is a little class I wrote. In a manner similar to your code, this class correctly writes your "Hamamélis" example's accented 'e' as the two octets expected in UTF-8 for a single (non-normalized) character: in hex 'C3' & 'A9'.
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.io.IOException;
public class ReaderWriter {
public static void main(String[] args) {
try {
String content = "Hamamélis. Written: " + new java.util.Date();
File file = new File("some_text.txt");
// Create file if not already existent.
if (!file.exists()) {
file.createNewFile();
}
FileOutputStream fileOutputStream = new FileOutputStream( file );
OutputStreamWriter outputStreamWriter = new OutputStreamWriter( fileOutputStream, "UTF-8" );
BufferedWriter bufferedWriter = new BufferedWriter( outputStreamWriter );
bufferedWriter.write( content );
bufferedWriter.close();
System.out.println("ReaderWriter 'main' method is done. " + new java.util.Date() );
} catch (IOException e) {
e.printStackTrace();
}
}
}
As icktoofay commented, you should dig deeper to discover exactly what octets are involved. Use a hex editor like this "File Viewer" app I found today on the Mac App Store to see the exact octets in your saved file.
If the octets are C3 & A9, then the problem is simply that the text editor you used to look at the file as text used the wrong character encoding. For example, you can open that text file in a web browser, and use its menu commands to re-interpret the file as UTF-8.
If the octets are not C3 & A9, I would go further back to examine the input's octets.
If you do not understand that text files in computers actually contain numbers (not text in the human sense), then take a break from coding to read this entertaining article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Categories