Java: read a TXT file but some content is mistaken [duplicate]

Java: read a TXT file but some content is mistaken [duplicate] - java

I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?
fr = new FileReader(file);
br = new BufferedReader(fr);
String tmp = null;
while ((tmp = br.readLine()) != null) {
String text;
text = new String(tmp.getBytes(), "UTF-8");
content += text + System.getProperty("line.separator");
}
output after first line
?<style>

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.
Take a look at this solution: Handle UTF8 file with BOM

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.
tmp = tmp.replace("\uFEFF", "");
Also see this Guava bug report

Use the Apache Commons library.
Class: org.apache.commons.io.input.BOMInputStream
Example usage:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}

Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):
try( BufferedReader br = new BufferedReader(
new InputStreamReader( new BOMInputStream( new FileInputStream(
file), false, ByteOrderMark.UTF_8,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
// use br here
} catch( Exception e)
}

Consider UnicodeReader from Google which does all this work for you.
Charset utf8 = StandardCharsets.UTF_8; // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name())) {
....
}
Maven Dependency:
<dependency>
<groupId>com.google.gdata</groupId>
<artifactId>core</artifactId>
<version>1.47.1</version>
</dependency>

Use Apache Commons IO.
For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:
String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));
BOMInputStream bomInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
data = reader.read();
ari.add(Character.toString(theChar));
}
reader.close();
As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.

If somebody wants to do it with the standard, this would be a way:
public static String cutBOM(String value) {
// UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
if (bom.equals("efbbbf"))
// UTF-8
return value.substring(3, value.length());
else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
// UTF-16BE or UTF16-LE
return value.substring(2, value.length());
else
return value;
}

It's mentioned here that this is usually a problem with files on Windows.
One possible solution would be running the file through a tool like dos2unix first.

The easiest way I found to bypass BOM
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
while ((currentLine = br.readLine()) != null) {
//case of, remove the BOM of UTF-8 BOM
currentLine = currentLine.replace("ï»¿","");

Related

problems with reading of text file in java

I use a FileWriter to save a CSV file (text file).
All seems good when I read it with a text editor like sublime text.
But when I read it with java I get some nasty characters, anyhow I try to read it.
An example of the reading:
StringBuilder sb=new StringBuilder();
try {
String ligne;
BufferedReader fichier1 = new BufferedReader(new FileReader(nom_office));
while ((ligne = fichier1.readLine()) != null) {
sb.append(ligne);
}
fichier1.close();
} catch (Exception e) {
e.printStackTrace();
}
//String totalité = new String(encoded, encoding);
String totalité = sb.toString();
the result of these following statements is:
System.out.println("##############");
System.out.println(totalité);
PK ! T��ep [Content_Types].xml �(�
�TKn�0�W�"o���EUU�,[$�L/�i"m�k�IO)�
...and so on.
why isn't it the same result as in sublime text?

BufferedReader uses default system encoding which probably isn't UTF-8 and that's what you need here. Try this:
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(file), "UTF-8"));
Also, your IDEs console needs to be configured to use UTF-8, that's really important!

same content is getting while reading a file in java

I have created a FileReader object.
public String getFileContent(){
StringBuilder filecontent= new StringBuilder();
FileReader fileReader = new FileReader("D:/myfile");
BufferedReader bufferedReader = new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
filecontent.append(bufferedReader.readLine());
}
return filecontent.toString;
}
The problem I face is that the function always returns the same string even if the file content is changed.
Anyone to help???

append the variable line since you store it to that variable. The problem is you're calling the readline() twice
like this:
while((line = bufferedReader.readLine()) != null) {
filecontent.append(line);
}

You received an answer already, but it seems that what you're doing could be achieved way better in Java 8:
Path filePath = Paths.get("D:/myfile");
byte[] fileBytes = Files.readAllBytes(filePath);
Charset fileEncoding = StandardCharsets.UTF_8;
String fileContents = new String(fileBytes, fileEncoding);
If you're not on Java 8, you can use Apache Commons IO's FileUtils to get the String:
File file = new File("D:/myfile");
String fileContents = FileUtils.readFileToString(file);
My advice here is leverage the JDK and existing libraries as much as possible. It leads to cleaner code that is easier to maintain.

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Reading hebrew from text file with Java

I'm having troubles with reading a UTF-8 encoded text file in Hebrew.
I read all Hebrew characters successfully, except to two letters = 'מ' and 'א'.
Here is how I read it:
FileInputStream fstream = new FileInputStream(SCHOOLS_LIST_PATH);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
if(strLine.contains("zevel")) {
continue;
}
schools.add(getSchoolFromLine(strLine));
}
Any idea?
Thanks,
Tomer

You're using InputStreamReader without specifying the encoding, so it's using the default for your platform - which may well not be UTF-8.
Try:
new InputStreamReader(in, "UTF-8")
Note that it's not obvious why you're using DataInputStream here... just create an InputStreamReader around the FileInputStream.

Character corruption going from BufferedReader to BufferedWriter in java

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.
I encounter a known problem when text contains a left facing quotation mark. Text such as
mutations to particular “hotspot” regions
becomes
mutations to particular “hotspot�? regions
I have isolated the problem by writting a simple text copy meathod:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
Can anybody offer some advice as how to correct this problem?
★My solution
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write("&#"+Integer.toString(r)+";");
}
}
output.flush();

The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).
Try the following to generate a file with UTF-8 encoding:
BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));
Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream

In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:
Convenience class for reading
character files. The constructors of
this class assume that the default
character encoding and the default
byte-buffer size are appropriate. To
specify these values yourself,
construct an InputStreamReader on a
FileInputStream.

The Javadoc for FileReader says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:
FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: read a TXT file but some content is mistaken [duplicate] - java

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason. tmp = tmp.replace("\uFEFF", ""); Also see this Guava bug report

It's mentioned here that this is usually a problem with files on Windows. One possible solution would be running the file through a tool like dos2unix first.

The easiest way I found to bypass BOM BufferedReader br = new BufferedReader(new InputStreamReader(fis)); while ((currentLine = br.readLine()) != null) { //case of, remove the BOM of UTF-8 BOM currentLine = currentLine.replace("ï»¿","");

Related

problems with reading of text file in java

same content is getting while reading a file in java

Check line for unprintable characters while reading text file

Reading hebrew from text file with Java

Character corruption going from BufferedReader to BufferedWriter in java

Categories

Resources