Symbols '  ' is showing when reading from text file - java

Using the same project and text file as here: Java.NullPointerException null (again) the program is outputting the data but with . To put you in the picture:
This program is a telephone directory, ignoring the first "code" block, look at the second "code" block on that link, that is the text file with the entries. The program outputs them as it should but it is giving  at the beginning of the entries read from the text file ONLY.
Any help as to how to remove it? I am using Buffered Reader with File Reader in it.
Encoding of Text File: UTF-8
Using Java 7
Windows 7

Does the read in textfile uses UTF-8 with BOM? It looks like BOM signs: ""
http://en.wikipedia.org/wiki/Byte_order_mark
Are you runnig Windows? Notepad++ sould be able to convert. If using linux or the VI(M) you can use ":set nobomb"

I suppose your input file is encoded in UTF-8 with BOM.
You can either save your input file without a BOM, or handle this in Java.
The thing one might want to do here is to use an InputStreamReader with appropriate encoding. Sadly, that's not possible. The thing is, Java assumes that an UTF-8 encoded file has no BOM, so you have to handle that case manually.
A quick hack would be to check if the first three bytes of your file are 0xEF, 0xBB, 0xBF, and if they are, ignore them.
For a more sophisticated example, have a look at the UnicodeBOMInputStream class in this answer.

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

Unable to read any of file that contains specific character(s)

TL;DR
Why does reading in a file with – not find any data on Notepad?
Problem:
Up to this point, I have been using just plain ol' Notepad (Version 6.1) to read/write text for testing/answering questions here.
Simple bit of code to read in the text files contents, and print them to the console:
Scanner sc = new Scanner(new File("myfile.txt"));
while (sc.hasNextLine()) {
String text = sc.nextLine();
System.out.println(text);
}
All is well, the lines print as expected.
Then, if I put in this exact character: –, anywhere in the text file, it will not read any of the file, and print nothing to the console.
I can of course use Notepad++ or other (better) text editors, and there is no issue, the text, including the dash character, will print as expected.
I can also specify UTF-8, using Notepad, and it will work fine:
File fileDir = new File("myfile.txt");
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
On my original Notepad file, if I copy and paste the text (including the –) into Notepad++ and compare the two files with WinMerge, it tells me that the dash on Notepad is –, but on Notepad++, it is –.
Question:
Why, when this – is used in a text file in Notepad, it reads nothing, basically telling me that hasNextLine() is false? Should it not at least read the input until the line that contains this specific character?
Steps to reproduce:
On Windows 7, right-click and create new Text Document.
Put any text in the file (without any special characters, as such)
Put in this character anywhere in the file: –
Run the first block of code above
Output: BUILD SUCCESSFUL (total time: 1 second), i.e. doesn't print any of the text.
PS:
I know I asked a similar (well, it ended up being the same) question yesterday, but unfortunately, it seems I may not have explained myself well, or some of the viewers didn't fully read the question. Either way, I think I've explained it better here.
The issue seems to be a difference of encoding. You have to read in the same encoding that the file was written into.
Your system notepad probably uses Windows-1252(or Cp-1252) encoding. There have been problems in this encoding with a range of characters between 128 - 159. The Dash lies between this range. This range is not present in the equivalent ISO 8859-1, and is only present in the Cp1252 encoding.
Eclipse, when reading the notepad file, assumes the file to be having the encoding ISO-8859-1 (as it is equivalent). But this character is not present in ISO-8859-1, hence the problem. If you want to read from Java, you will have to specify Cp1252, and you should get your output.
This is also the reason why your code with UTF-8 works correctly, when the file in notepad is written in UTF-8.
A buffered reader reads more than the current line, maybe the text upto the problematic bytes. Charset.CharsetDecoder.onMalformedInput then comes in play, and there something restricive happens, which I would normally not have expected.
Do you use a special JDK? Do you wipe exceptions under the carpet? Like a lambda wrapping the above code. (Add catch Throwable)
Is your platfom encoding -Dfile.encoding=ISO-8859-1 instead of Cp1252.

Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character

I have a big text file on my windows machine in UTF-8 encoding. Somehow one or more of the characters in this file are invalid for UTF-8 encoding, giving error as "Invalid byte 2 of 2-byte UTF-8 sequence".
I am using windows 7, and I want to find the character which is invalid. I guess there is a UNIX command for this, but is there any tool or utility or regex(something which doesn't need to write a programe/code) which can be used in windows.
I can use notepad++ or PSPAD or similar text editor, or if there is any windows command, I can create a batch file. Please suggest.
Create a FileReader to read the file byte by byte. If the current byte looks like the first of a 2-byte UTF-8, read the next byte, put the two in a byte[2] array, and give this to new String(array, "UTF-8"). In the loop, count the bytes read, and catch the exception to produce the position and byte values.
It's possible that your UTF-8 file has Byte Order Mark on it, which is often not recognised by the Java Readers.
Open the file in Notepad++. If the file has a BOM, Notepad++ will report "UTF-8" rather than "UTF-8 w/o BOM".
You can either convert to UTF-8 without BOM or use something like: https://stackoverflow.com/a/2905038/1554386 to strip the BOM.

How to clean a csv file from weird characters (e.g. SUB)?

I am uploading csv files using jdbc to teradata. Everything used to be fine, until recently I came across a csv file that had some weird characters and my code failed to upload .
I opened the csv file in Notepad ++ , and it look like this SUB . When I open it in Excel it looks like this ->->
When I manually deleted those characters, everything went back to normal. I am curious , is there any way I could use java to clean a csv file to remove all kind of invalid characters ?
The SUB character is an ASCII 26 (= hex 0x1A). Back when DEC-10s ruled the earth, this was called Ctrl-Z. It is used to indicate the end of a file.
If it indeed at the end of the file, and you read it in using a Java InputStream (and please have a look at Read/convert an InputStream to a String) it will take off that terminal Ctrl-Z.
It would be quite unusual (and a problem) to have the SUB inside the CSV data, unless it were representing a binary object.
You can try:
myString.replaceAll("\\p{C}", "?");
If you want to remove it:
myString.replaceAll("\\p{C}", "");
More here:
How can I replace non-printable Unicode characters in Java?

Delete non utf8 characters in Eclipse using regex

Is there a possibility to do that in Eclipse? I have a lot of non utf8 characters like sch�ma or propri�t� (it's french :)). For now, I am deleting those characters hand. How can I remove those characters?
Those characters are in the UTF-8 character set.
Either the text is encoded incorrectly or you have your file encoding set incorrectly in Eclipse.
Try right clicking the file -> properties. Then check the text file encoding is set to UTF-8, if its not, select Other and change it to UTF-8.
I would write a little program that reads a file, removes all char > 127 and write back to the file.
[I would pass the file names as command line arguments]

Categories