Character corruption going from BufferedReader to BufferedWriter in java - java

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.
I encounter a known problem when text contains a left facing quotation mark. Text such as
mutations to particular “hotspot” regions
becomes
mutations to particular “hotspot�? regions
I have isolated the problem by writting a simple text copy meathod:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
Can anybody offer some advice as how to correct this problem?
★My solution
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write("&#"+Integer.toString(r)+";");
}
}
output.flush();

The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).
Try the following to generate a file with UTF-8 encoding:
BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));
Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream

In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:
Convenience class for reading
character files. The constructors of
this class assume that the default
character encoding and the default
byte-buffer size are appropriate. To
specify these values yourself,
construct an InputStreamReader on a
FileInputStream.

The Javadoc for FileReader says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:
FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

Related

Java: read a TXT file but some content is mistaken [duplicate]

I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?
fr = new FileReader(file);
br = new BufferedReader(fr);
String tmp = null;
while ((tmp = br.readLine()) != null) {
String text;
text = new String(tmp.getBytes(), "UTF-8");
content += text + System.getProperty("line.separator");
}
output after first line
?<style>
In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.
Take a look at this solution: Handle UTF8 file with BOM
The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.
tmp = tmp.replace("\uFEFF", "");
Also see this Guava bug report
Use the Apache Commons library.
Class: org.apache.commons.io.input.BOMInputStream
Example usage:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}
Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):
try( BufferedReader br = new BufferedReader(
new InputStreamReader( new BOMInputStream( new FileInputStream(
file), false, ByteOrderMark.UTF_8,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
// use br here
} catch( Exception e)
}
Consider UnicodeReader from Google which does all this work for you.
Charset utf8 = StandardCharsets.UTF_8; // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name())) {
....
}
Maven Dependency:
<dependency>
<groupId>com.google.gdata</groupId>
<artifactId>core</artifactId>
<version>1.47.1</version>
</dependency>
Use Apache Commons IO.
For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:
String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));
BOMInputStream bomInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
data = reader.read();
ari.add(Character.toString(theChar));
}
reader.close();
As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.
If somebody wants to do it with the standard, this would be a way:
public static String cutBOM(String value) {
// UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
if (bom.equals("efbbbf"))
// UTF-8
return value.substring(3, value.length());
else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
// UTF-16BE or UTF16-LE
return value.substring(2, value.length());
else
return value;
}
It's mentioned here that this is usually a problem with files on Windows.
One possible solution would be running the file through a tool like dos2unix first.
The easiest way I found to bypass BOM
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
while ((currentLine = br.readLine()) != null) {
//case of, remove the BOM of UTF-8 BOM
currentLine = currentLine.replace("","");

unalble to read clear data from pdf file as other language is not english

I am trying to copy some data from pdf to txt file here is the code
public void readPDFFile() throws IOException {
InputStreamReader reader;
OutputStreamWriter writer;
FileInputStream inputstream;
FileOutputStream outputStream;
BufferedReader bufferedReader = null;
BufferedWriter bufferedWriter = null;
String str;
File rfile = new File(
"C://Documents and Settings/Administrator/My Documents/EGDownloads/source.pdf");
File wFile = new File("C://Documents and Settings/Administrator/My Documents/Folder/destination.txt");
try {
inputstream = new FileInputStream(rfile);
outputStream = new FileOutputStream(wFile);
reader = new InputStreamReader(inputstream, "UTF-8");
writer = new OutputStreamWriter(outputStream, "UTF-8");
bufferedReader = new BufferedReader(reader);
bufferedWriter = new BufferedWriter(writer);
while ((str = bufferedReader.readLine()) != null) {
writer.write(str);
}
} catch (IOException es) {
System.out.println(es.getMessage());
es.printStackTrace(System.out);
} finally {
if (bufferedReader != null) {
bufferedReader.close();
}
if (bufferedWriter != null)
bufferedWriter.close();
}
}
Expected output is supposed in other language but all I am getting is some random boxes as tried both UTF-16 and UTF-8 unicodes
I tried pdfBox but is still not working as all I'm getting is only original language accent and in english language
Note :
1 I'm not trying to print data on console but copying from pdf to txt file
2 Other file contains non english words,
can anyone help me to solve that??
Or any link that might help
Thanks.
The PDF format is a binary format. You must have a really special PDF as all that I know of are compressed in some way. Use a proper library to read it, be it pdfbox or itext or other. Be aware that in some PDFs it's impossible to extract text, you can check it with Acrobat, if Acrobat can't do it nobody can.

java detect if file is UTF-8 or Ansi

In Java is there a way to detect if a file is ANSI or UTF-8? The problem i am having is that if someone creates a CSV file in Excel it's UTF-8. If they create it using note pad it's ANSI.
I am wondering if i can detect the type of file then handle it accordingly.
Thanks.
You could try something like this. It relies on Excel including a Byte Order Mark (BOM), which a quick search suggests it does although I can't verify it, and on the fact that java treats the BOM as a particular "character" \uFEFF.
FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line = br.readLine();
if (line.startsWith("\uFEFF")) {
// it's UTF-8, throw away the BOM character and continue
line = line.substring(1);
} else {
// it's not UTF-8, reopen
br.close(); // also closes fis
fis = new FileInputStream(file); // reopen from the start
br = new BufferedReader(new InputStreamReader(fis, "Cp1252"));
line = br.readLine();
}
// now line contains the first line, and br.readLine() will get the next
Some more information on the UTF-8 Byte Order Mark and detection of encoding at http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

Convert file from Cp1252 to utf -8 java

User uploads file with the character encoding : Cp1252
Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE command.
Java source code:
OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
writ.write(in);
writ.newLine();
}
writ.flush();
writ.close();
It seems that characters are not converted correctly. Converted unicode file has � and box symbols at multiple places. How to convert file efficiently to uft-8? Thanks.
One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters:
CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
try(FileInputStream is=new FileInputStream(filepath);
BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
FileOutputStream fw=new FileOutputStream(destpath);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {
for(String in; (in = reader.readLine()) != null; ) {
out.write(in);
out.newLine();
}
}
Note that the output encoder is configured for symmetry here, but UTF-8 is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions.
Further, note that this won’t help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. One thing to consider is whether the input encoding "windows-1252" actually meant the system’s default encoding (and whether that is really the same). If in doubt, you may use Charset.defaultCharset() instead of Charset.forName("windows-1252") when the actually intended conversion is default → UTF-8.

Reading hebrew from text file with Java

I'm having troubles with reading a UTF-8 encoded text file in Hebrew.
I read all Hebrew characters successfully, except to two letters = 'מ' and 'א'.
Here is how I read it:
FileInputStream fstream = new FileInputStream(SCHOOLS_LIST_PATH);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
if(strLine.contains("zevel")) {
continue;
}
schools.add(getSchoolFromLine(strLine));
}
Any idea?
Thanks,
Tomer
You're using InputStreamReader without specifying the encoding, so it's using the default for your platform - which may well not be UTF-8.
Try:
new InputStreamReader(in, "UTF-8")
Note that it's not obvious why you're using DataInputStream here... just create an InputStreamReader around the FileInputStream.

Categories