Junk characters while reading text file in java

Junk characters while reading text file in java - java

I have a java which calls windows bat file which does some processing and generates the output file.
Process p = Runtime.getRuntime().exec("cmd /c "+filename);
Now when reading the file from following program. (filexists() is function which checks whether file exists or not). Output file contains only single line
if ( filexists("output.txt") == true)
{ String FileLine;
FileInputStream fstream = new FileInputStream("output.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
FileLine = br.readLine();
fstream.close();
filein.close();
}
Variable FileLine contains 3 junk charcters in the starting. I also checked few other files in the progam and no file has this issue except for the fact it is created with Runtime function.
ï»¿9087.
As you can see three junk characters are coming in the output file. When opened with Notepad++, i am not able to see those junk characters.
Please suggest

This is happening because you have not mentioned the file encoding while creating your FileInputStream.Assuming your file is UTF-8 encoded, you need to do something like this
new FileInputStream("output.txt, "UTF-8"));
Change the encoding as per the encoding of your file

That looks like the byte order mark for UTF-8 encoding. See https://en.wikipedia.org/wiki/Byte_order_mark

May be its an issue with file encoding. Though I am not sure.
Can you please try following piece of code and see if it works for you
BufferedReader in = new BufferedReader(
new InputStreamReader( new FileInputStream("output.txt"), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}

Related

How to read an ansi-coded file with german (european) characters and display correctly

I have a standard textfile written with Windows editor (or Word) und saved in ANSI-format in the android device. If I open and read this file and display it on my Android device all characters are displayed correctly except the German Umlaute äÄöÖüÜß. Instead of these characters a white question mark is shown within a black diamond. (I display them in a homescreen widget using remoteViews.setTextViewText(...)
I googled for hours and found lots of hints on using UTF-8 encoding etc. But when I save the file in UTF-8 or with any format but ANSI, I will get an exception and can't read the file at all. Using an android editor shows the encoding of the file is correct both in ANSI and in UTF-8.
My program is too long to copy it here so I extracted the hopefully relevant part and put it below. Please help!
public class Test {
static void readFile() {
File file = new File(Environment.getExternalStorageDirectory().getAbsolutePath(), "birthday.txt");
if (file.exists())
{
try {
FileReader fileReader = new FileReader(file);
BufferedReader br = new BufferedReader(fileReader);
//BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(directory+"birthday.txt"), "windows-1252"));
String lineOfText;
while ((lineOfText = br.readLine()) != null) {
//Output lineOfText via remoteViews.setTextViewText(WidgetOutput.getRef(linecounter).getIdWhat(), lineOfText);
}
br.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
}

The commented BufferedReader line looks like it should work. However, windows-1252 is the canonical name for classes in java.nio. For classes in java.io (like InputStreamReader) the canonical name is Cp1252. See Supported Encodings
You may also wish to try ISO-8859-1 (nio) or ISO8859_1 (io).

Java readline() skipping second line

I have a questions file that I'd like to read, and when its reading, I want it to Identify the questions from the answers and print them, before each questions there is a line of "#" characters, code keeps skipping question one for some reason? what am I missing here?
Here is the code:
try {
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
strLine = br.readLine();
System.out.println(strLine);
// Read File Line By Line
while ((strLine ) != null) {
strLine = strLine.trim();
if ((strLine.length()!=0) && (strLine.charAt(0)=='#' && strLine.charAt(1)=='#')) {
strLine = br.readLine();
System.out.println(strLine);
//questions[q] = strLine;
}
strLine = br.readLine();
}
// Close the input stream
fstream.close();
// System.out.println(questions[0]);
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}

I suspect, that the file you read is in UTF-8 with BOM.
The BOM is a code before the first character, that helps to identify the proper encoding of textfiles.
The issue with BOM is, that it is invisible and disturbs the reading. The textfile with BOM is arguable no longer a textfile. Especially, if you read the first line, the first character is no longer a #, but it is something different, because it is the character BOM+#.
Try to load the file with the explicit encoding specified. Java can handle BOM in newer releases, don't remember which exactly.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream, "UTF-8"));
Otherwise, take a decent text editor, like notepad++ and change the encoding to UTF-8 without BOM or ANSI encoding (yuck).

Notice that when you either enter the if statement in the while or not, you first do strLine = br.readLine(); which overwrite the line you read when you initialized strline.

java detect if file is UTF-8 or Ansi

In Java is there a way to detect if a file is ANSI or UTF-8? The problem i am having is that if someone creates a CSV file in Excel it's UTF-8. If they create it using note pad it's ANSI.
I am wondering if i can detect the type of file then handle it accordingly.
Thanks.

You could try something like this. It relies on Excel including a Byte Order Mark (BOM), which a quick search suggests it does although I can't verify it, and on the fact that java treats the BOM as a particular "character" \uFEFF.
FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line = br.readLine();
if (line.startsWith("\uFEFF")) {
// it's UTF-8, throw away the BOM character and continue
line = line.substring(1);
} else {
// it's not UTF-8, reopen
br.close(); // also closes fis
fis = new FileInputStream(file); // reopen from the start
br = new BufferedReader(new InputStreamReader(fis, "Cp1252"));
line = br.readLine();
}
// now line contains the first line, and br.readLine() will get the next
Some more information on the UTF-8 Byte Order Mark and detection of encoding at http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Reading hebrew from text file with Java

I'm having troubles with reading a UTF-8 encoded text file in Hebrew.
I read all Hebrew characters successfully, except to two letters = 'מ' and 'א'.
Here is how I read it:
FileInputStream fstream = new FileInputStream(SCHOOLS_LIST_PATH);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
if(strLine.contains("zevel")) {
continue;
}
schools.add(getSchoolFromLine(strLine));
}
Any idea?
Thanks,
Tomer

You're using InputStreamReader without specifying the encoding, so it's using the default for your platform - which may well not be UTF-8.
Try:
new InputStreamReader(in, "UTF-8")
Note that it's not obvious why you're using DataInputStream here... just create an InputStreamReader around the FileInputStream.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Junk characters while reading text file in java - java

This is happening because you have not mentioned the file encoding while creating your FileInputStream.Assuming your file is UTF-8 encoded, you need to do something like this new FileInputStream("output.txt, "UTF-8")); Change the encoding as per the encoding of your file

That looks like the byte order mark for UTF-8 encoding. See https://en.wikipedia.org/wiki/Byte_order_mark

Related

How to read an ansi-coded file with german (european) characters and display correctly

Java readline() skipping second line

java detect if file is UTF-8 or Ansi

Check line for unprintable characters while reading text file

Reading hebrew from text file with Java

Categories

Resources