Delete non utf8 characters in Eclipse using regex

Delete non utf8 characters in Eclipse using regex - java

Is there a possibility to do that in Eclipse? I have a lot of non utf8 characters like sch�ma or propri�t� (it's french :)). For now, I am deleting those characters hand. How can I remove those characters?

Those characters are in the UTF-8 character set.
Either the text is encoded incorrectly or you have your file encoding set incorrectly in Eclipse.
Try right clicking the file -> properties. Then check the text file encoding is set to UTF-8, if its not, select Other and change it to UTF-8.

I would write a little program that reads a file, removes all char > 127 and write back to the file.
[I would pass the file names as command line arguments]

Related

Get File encoding ASCII or EBCDIC with java

I have a file which extension is .b3c i want to know if it's encoded in ASCII or EBCDIC using java jow can I achieve that please.
Help is needed.
Thanks

Assuming the text file contains multiple lines of text, check for the newline character.
In ASCII, lines end with an LF / \n / 0x0a. Sure, on Windows there's also a CR, but we can ignore that part.
In EBCDIC, lines end with an NL / \025 / 0x15.
ASCII text files will not contain a 0x15 / NAK, and EBCDIC text files will not contain a 0x0a / SMM, so look for both:
If only one of them is found, you know the character set.
If both are found, the file is a binary file, and not a text file, so reject the file.
If neither is found, the file could have just one line of text, in which case further analysis might be needed. Hopefully that won't be the case for here, so the simple test done so far should be enough.

Encoding of special properties in Eclipse

I have an application which uses swedish language in some java and jsp pages.
Swedish words are described in application.properties file and those names will be used in the application.
Application Screen:
Words which are defined in the properties file and the words which I am seeing in the jsp page is different.
button.search=Sök
I tried all content types in the settings. Still I am getting this error and because of this different words my application is not working in eclipse.
Could you please anyone tell me what changes I need to do in eclipse to make this application work

From your screenshot it looks like your properties file is encoded in UTF-8,
thus ö is represented by 2 bytes.
But properties files must be encoded in ISO-8859-1 (optionally with \uXXXX escapes), not in UTF-8 or anything else.
Quoted from the javadoc of class
java.util.Properties:
The load(Reader) / store(Writer, String) methods load and store
properties from and to a character based stream in a simple
line-oriented format specified below. The load(InputStream) /
store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output
stream is encoded in ISO 8859-1 character encoding.
Characters that cannot be directly represented in this encoding
can be written using Unicode escapes [...]
That means, you should store your application.properties file
in ISO-8859-1 encoding. Or better yet, you should write
button.search=S\u00F6k
instead of
button.search=Sök
Using the \uXXXX escapes for all non-ASCII characters has the advantage
that you can store the file in UTF-8 or any ISO-8859-x, and you get the same
file content anyway.

Just use Eclipse's Properties Editor. It saves a .properties file in the only allowed character encoding (ISO 8859-1) and \u escapes characters that are not in that character set.
It does have a hover display to show decoded codepoints but a view showing a table of name-value pairs would be nicer.

Maybe change the workspace encoding will help. Go Window -> Preferences -> General -> Workspace and change the text file encoding to UTF8.

Change you eclipse default content type:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character

I have a big text file on my windows machine in UTF-8 encoding. Somehow one or more of the characters in this file are invalid for UTF-8 encoding, giving error as "Invalid byte 2 of 2-byte UTF-8 sequence".
I am using windows 7, and I want to find the character which is invalid. I guess there is a UNIX command for this, but is there any tool or utility or regex(something which doesn't need to write a programe/code) which can be used in windows.
I can use notepad++ or PSPAD or similar text editor, or if there is any windows command, I can create a batch file. Please suggest.

Create a FileReader to read the file byte by byte. If the current byte looks like the first of a 2-byte UTF-8, read the next byte, put the two in a byte[2] array, and give this to new String(array, "UTF-8"). In the loop, count the bytes read, and catch the exception to produce the position and byte values.

It's possible that your UTF-8 file has Byte Order Mark on it, which is often not recognised by the Java Readers.
Open the file in Notepad++. If the file has a BOM, Notepad++ will report "UTF-8" rather than "UTF-8 w/o BOM".
You can either convert to UTF-8 without BOM or use something like: https://stackoverflow.com/a/2905038/1554386 to strip the BOM.

Symbols ' ï»¿ ' is showing when reading from text file

Using the same project and text file as here: Java.NullPointerException null (again) the program is outputting the data but with ï»¿. To put you in the picture:
This program is a telephone directory, ignoring the first "code" block, look at the second "code" block on that link, that is the text file with the entries. The program outputs them as it should but it is giving ï»¿ at the beginning of the entries read from the text file ONLY.
Any help as to how to remove it? I am using Buffered Reader with File Reader in it.
Encoding of Text File: UTF-8
Using Java 7
Windows 7

Does the read in textfile uses UTF-8 with BOM? It looks like BOM signs: "ï»¿"
http://en.wikipedia.org/wiki/Byte_order_mark
Are you runnig Windows? Notepad++ sould be able to convert. If using linux or the VI(M) you can use ":set nobomb"

I suppose your input file is encoded in UTF-8 with BOM.
You can either save your input file without a BOM, or handle this in Java.
The thing one might want to do here is to use an InputStreamReader with appropriate encoding. Sadly, that's not possible. The thing is, Java assumes that an UTF-8 encoded file has no BOM, so you have to handle that case manually.
A quick hack would be to check if the first three bytes of your file are 0xEF, 0xBB, 0xBF, and if they are, ignore them.
For a more sophisticated example, have a look at the UnicodeBOMInputStream class in this answer.

JTextField to read chinese character

I need to read chinese character through java input fields. I have installed chinese
in my system and I have set system locale to chines. I could type chinese character using
US key board in notepad. When type in my application I get only the english character. How to make java input fields to read chinese character. Please help me out.

How to make java input fields to read chinese character.
Perhaps it has something to do with the encoding of your text file. If I have trouble with encodings, I try and create files in either eclipse or OpenOffice, where I am able to specify the encoding. If you are on a mac, you can also use TextEdit, or (even better) Bean.
edit: I use utf8 encoding.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Delete non utf8 characters in Eclipse using regex - java

Is there a possibility to do that in Eclipse? I have a lot of non utf8 characters like sch�ma or propri�t� (it's french :)). For now, I am deleting those characters hand. How can I remove those characters?

Those characters are in the UTF-8 character set. Either the text is encoded incorrectly or you have your file encoding set incorrectly in Eclipse. Try right clicking the file -> properties. Then check the text file encoding is set to UTF-8, if its not, select Other and change it to UTF-8.

I would write a little program that reads a file, removes all char > 127 and write back to the file. [I would pass the file names as command line arguments]

Related

Get File encoding ASCII or EBCDIC with java

Encoding of special properties in Eclipse

Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character

Symbols ' ï»¿ ' is showing when reading from text file

JTextField to read chinese character

Categories

Resources