(Intellij) "unclosed character literal" and "illegal character '\u00a7'" when compiling

(Intellij) "unclosed character literal" and "illegal character '\u00a7'" when compiling - java

Im trying to compile the '§' character into a char (char c = '§') but when i try to build, it says in the build output "Unclosed character literal" and "Illegal character: '\u00a7'" followed by "Unclosed character literal" again. If i put the character into a String (String s = "§") it works fine. But when i print it to console, it prints the 'Â' character (which it shouldn't)
In another java project, i can use the '§' character fine and compile normally, and it works as intended; printing it to console shows nothing (which is normal, because it's used as an escape character for colouring the text). That project (and the current one) don't use "BOM" in intellij, and they both use UTF8 encoding
Does anyone know how to fix this? thanks:)

How I could reproduce that error:
created a new Test.java with UTF-8 encoding (default)
added main with char c = '§'; and print it
run Test.main()
no errors, as expected.
So I tried:
created second file Test1.java and changed to ISO-8859-1 encoding
added main method with just a print command
run Test1.java
this time I got the reported error but for Test.java (still encoded as UTF-8)
Looks like IDEA uses the encoding of first file for the whole source code.
Solutions:
make sure all files are encoded with UTF-8; and/or
use javac -encoding UTF-8 ... as commented by saka1029

Use Java's Character: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html
You can retrieve the character through charValue().
Also, make sure you're using the encoding scheme (eg UTF-16) that supports your character.

Related

ISO-8859-1 character encoding not working in Linux

I have tried below code in windows and able decode the message .But same code when i have tried Linux it's not working.
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
I have verified and could see the default character set in Linux platform is UTF-8(Charset.defaultCharset().name())
Kindly suggest me how to do same encoding Linux platform.

The explanation for this, is, almost always, that somewhere bytes are turned to characters or characters are turned to bytes there where the encoding is not clearly specified, thus, defaulting to 'platform default', thus, causing different results depending on which platform you run it on.
Except, every place where you turn bytes to chars or chars to bytes in your snippet of code explicitly specified encoding.
Or does it?
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
Ah, no, you forgot one place: javac itself.
You compile this code. That'll be where raw bytes (because the compiler is looking at ManmohansSourceFile.java, which is a file, which isn't characters, but a bunch of bytes) - which are converted into characters (because the java compiler works on characters), and this is done using some encoding. If you don't use the -encoding switch when running javac (or maven or gradle is running javac, and it passes an encoding, which one depends on your pom/gradle file), then this is read in using system encoding, and thus whether the string actually contains those bytes - who knows.
This is most likely the source of your problem.
The fix? Pick one:
Don't put non-ascii in your source files. Note that you can write the unicode symbol "Latin Capital Letter A with Tilde" as \u00C3 in your source file instead of as Ã. Then use \u00B6 for ¶.
String message ="\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
> ööööö
Ensure you specify the right -encoding switch when compiling. So, if your text editor (that you use to type String message = "¶";) is configured as 'UTF-8', and then run javac -encoding UTF-8 manMohansFile.java.

First of all, I'm not sure exactly what you are expecting...your use of the term "encode" is a bit confusing, but from your comments, it appears that with the input "Ã¶Ã¶Ã¶Ã¶Ã¶", you expect the output "ööööö".
On both Linux and OS X with Java 1.8, I do get that result.
I do not have a Windows machine to try this on.
As #Pshemo indicated, it is possible that your input, since it's hardcoded in the source code as a string, is being represented as UTF-8, not as ISO-8859-1. Actually, this is what I expected, and I was surprised that the code worked as you expected.
Try creating the input with String.encode(), encoding to ISO-8859-1.

Cannot compile Java file with non-ASCII character

Important:
I must use plain windows notepad only (neither IDE nor Notepad++ or any other text editors allowed).
So I have a simple class:
class Test{
public static void main(String[] args){
char c = 'қ';
System.out.println(c);
}
}
By default notepad saves text files using ANSII encoding, but as you can see I have a non-ANSII character in my code. I can compile and run this code via command prompt, but output is ? instead of қ, which seems obvious. When I change the file's encoding to UTF-8, compiler throws an error. I have read this article Illegal Character when trying to compile java code but there is no solution for my particular problem, because as I wrote above, I am not allowed to use any text editors but Windows notepad.
Thank you!

Probably you need like this:
char c = '\u039A';
I don't know the code of your 'k', but you may find it on https://www.ssec.wisc.edu/~tomw/java/unicode.html
Also hopes that Windows has this character for output in the console
p.s. The console of windows has a certain code page. Try to change it in console, for example:
REM change CHCP to UTF-8
CHCP 65001
CLS
and remember about different fonts in windows console, some of them can't draw specific symbols.

Yes, the problem is that javac is non-compliant in not accepting the BOM with UTF-8.
Use Notepad to save as Unicode (actually UTF-16LE).
Compile with
javac -encoding UTF-16 Test.java

Use Get-Content in powershell as java input get extra character

I am practicing to use command line to run java script in windows 10.The java script is using scanner(System.in) to get input from a file and print the string it get from the file.The powershell command is as follow:
Get-Content source.txt | java test.TestPrint
The content of source.txt file is as follow:
:
a
2
!
And the TestPrint.java file is as follow:
package test;
import java.util.Scanner;
public class TestPrint {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner in = new Scanner(System.in);
while(in.hasNextLine())
{
String str = in.nextLine();
if(str.equals("q")) break;
System.out.println( str );
}
}
}
Then weird thing happed.The result is
?:
a
2
!
You see,It add question mark into the begging of first line.Then when I change the character in first line of the source.txt file from ":" to "a",the result is
a
a
2
!
It add space into the begging of the first line.
I had tested the character and found the regularity：if the character is larger than "?" in ASCII,which is 63 in ASCII,then it will add space,such as "A"(65 in ASCII) or "["(91 in ASCII).If the character is smaller than "?", including "?" itself ,it will add question mark.

Could this be a Unicode issue (See: Java Unicode problems)? i.e. try specifying the type you want to read in:
Scanner in = new Scanner(System.in, "UTF-8");
EDIT:
Upon further research, PowerShell 5.1 and earlier, the default code page is Windows-1252. PowerShell 6+ and cross platform versions have switched to UTF-8. So (from the comments) you may have to specify Windows-1252 encoding:
Scanner in = new Scanner(System.in, "Windows-1252");
To find out what encoding is being used, execute the following in PowerShell:
[System.Text.Encoding]::Default
And you should be able to see what encoding is being used (for me in PowerShell v 5.1 it was Windows-1252, for PowerShell 6 it was UTF-8).

There is no text but encoded text.
Every program reading a text file or stream must know and use the same character encoding that the writer used.
An adaptive default character encoding is a 90s solution to a 70s and 80s problem (approx). Today, it's usually best to avoid constructors and methods that use a default, and in PowerShell, add an encoding argument where needed to control input or output.
To prevent data loss, you can use the Unicode character set throughout. UTF-8 is the most common for files and streams. (PowerShell and Java use UTF-16 for text datatypes.)
But you need to start from what you know the character encoding of the text file is. If you don't know this metadata, that's data loss right there.
Unicode provides that if a file or stream is known to be Unicode, it can start with metadata called a BOM. The BOM indicates which specific Unicode character encoding is being used and what the byte order is (for character encodings with code units longer than a byte). [This provision doesn't solve any problem that I've seen and causes problems of its own.]
(A character encoding, at the abstract level, is a map between codepoints and code units and is therefore independent of byte order. In practice, a character encoding takes the additional step of serializing/deserializing code units to/from byte sequences. So, sometimes using or not using a BOM is included in the encoding's name or description. A BOM might also be referred to as a signature. Ergo, "UTF-8 with signature.")
As metadata, a BOM, if present, should used if needed and always discarded when putting text into text datatypes. Unfortunately, Java's standard libraries don't discard the BOM. You can use popular libraries or a dozen or so lines of your own code to do this.
Again, start with the knowing the character encoding of the text file and inserting that metadata into the processing as an argument.

Ant compile: unclosed character literal

When I compile my web application using ant I get the following compiler message:
unclosed character literal
The line of offending code is:
protected char[] diacriticVowelsArray = { 'á', 'é', 'í', 'ó', 'ú' };
What does the compiler message mean?

Java normally expects its source files to be encoded in UTF-8. Have you got your editor set up to save the source file using UTF-8 encoding? The problem is if you use a different encoding, then the Java compiler will be confused (since you're using characters that will be encoded differently between UTF-8 and other encodings) and be unable to decode your source.
It's also possible that your Java is set up to use a different encoding. In that case, try:
javac -encoding UTF8 YourSourceFile.java

Use UTF encoding for text files with your Java sources.
or
Use '\uCODE' where CODE is Unicode number for á, é etc. (like for 'á' you write '\u00E1').
You might need this:
http://www.fileformat.info/info/unicode/char/e1/index.htm

It worked for me to use " instead of the ' char.
It also worked the javac -encoding UTF8 param as previously described.
This means that the compiler did not used the UTF8 coding.

Get filename as UTF-8? (ä,ü,ö ... is always '?')

I have to read the name of some files and put them in a list as a string. Its not so hard I just have some Problems with some characters like ä,ö,ü ... they are always as a '?' in my string.
Whats the Problem? Well the encoding. Ok this should be easy... thats what i thought. So I tried to use functions like:
new String(insert.getBytes("UTF-8")
or
new String(insert.getBytes("ISO-8859-1"), "UTF-8")
because the most of the files are ISO-8859-1
Its not helping. This is my code:
...
File[] fileList = dir.listFiles();
String insert;
for(File f : fileList) {
...
insert=f.getName().substring(0,f.getName().length()-4);
insert=insert.charAt(0)+insert.substring(1,insert.length()).toLowerCase().replaceFirst("([0-9]*(_s?(i)?(_dat)?)*$)", "").replaceFirst("_", " ");
...
System.out.println("test UTF8: " + new String(insert.getBytes("UTF-8"))); //not helping
System.out.println("test ISO , UTF8: " + new String(insert.getBytes("ISO-8859-1"), "UTF-8")); //not helping
...
names.add(insert);
}
At the end there are a lot of strings with '?' characters in my list.
How to fix the problem? And whats the best way if there are not only ISO-8859-1 files? (lets say there are a lot of unknown encoded files)
Thank You!

Given the extended comments back and forth under the question, it now looks like this is either a font problem or (perhaps more likely) a filename encoding problem.
I asked Lissy to run the following command to let us figure out what the problem is. If she is sure that the filename contain "ä" in them, but that character does not appear when she ls the filename, then this command will tell us whether this is a font or encoding problem.
touch filenäme
ls filen*me
If this shows "filenäme" in the output of ls then we know the problem is with the creation/copy of the files onto this system. This could happen if the program which created the files didn't realize what the filesystem encoding was or was too stupid to do the right thing. The convmv program will probably be the best way to fix this.
convmv -f ENCODING -t utf8 -r .
The question is what is the proper encoding. Possibilities include UTF-16, cp850, or perhaps iso8859-1. convmv --list will show you the list of currently known (to your system) encodings. Since the listed command above only shows you what it might do, it is safe to run several times with different encodings until you find one which works for all files.
If this is a font problem, we'll have to look into that

Unexpected question marks, spalts, etc in a String are a sign that something somewhere doesn't recognize a particular character when converting from one character set to another.
In your case, the problem could be occurring in a couple of places:
It could be occurring when your Java program is reading the file names from the directory (in the dir.listFiles() call).
It could be happening when you print the characters to the console stream.
In either case, the root cause is most likely a mismatch between what Java thinks the locale settings should be and the settings that the operating system and/or command shell are using.
As an experiment, try to list a directory containing the problematic file names from the command line. Do you see question marks or other splats there?
A second experiment to perform is to modify your Java program to dump one of the problem Strings as a sequence of numbers representing the character codes for each of the characters. Do you see the character codes for an ASCII / Unicode '?'.

The encoding of the content of the file name has nothing to do with the encoding of the file name itself.
You should get correct results from System.out.println(insert)
If you don't, it means that the shell has a different character encoding that the default character encoding for your system (this rarely happens; it would usually be the result of an explicit command to switch encodings in the shell).
If the file names are displayed correctly when you list the directory in the shell, I would expect them to be displayed correctly without specifying an encoding in your Java program.
If the shell is incapable of displaying the character (it is substituting the replacement character 0xFFFD (�) for these unprintable characters), there's nothing you can do from your Java application to change that. You need to change the terminal character encoding, install the right fonts, etc.; that is a operating system issue, not a Java issue.
At the same time, even if your terminal can't display the correct results, the Java program should be handling the character encodings correctly without your intervention.
The library behind the File API is figuring out the correct character encoding for your system and doing the necessary decoding into characters. Likewise, the database driver should negotiate with the database to determine the correct encoding, and do any necessary encoding into bytes on behalf of your application.

In a comment you wrote:
#mdrg: well, theres a Problem. I have to read the name of the files and then put them into a database. And there are a lot of '?' , that shouldnt be... – Lissy 27 mins ago
My guess is that the column you're inserting the filenames into specifies US-ASCII as the encoding and replaces characters outside that range with a replacement character, which in your case is the question mark.
So you have to find out the encoding for the column in your database table where you store the filenames. Various products have various syntaxes for retrieving that information.

In Java 1.6 you can use System.console() instead of System.out.println() to display accentuated characters to console.
public class Test {
public static void main(String args[]){
String s = "caractères français : à é \u00e9"; // Unicode for "é"
System.console().writer().println(s);
}
}
and the output is
C:\temp>java Test
caractères français : à é é

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.