How to change hyphen sign for question mark in java program?

How to change hyphen sign for question mark in java program? - java

I'm utilizing this line codes
String string = "Some usefull information − don't know what happens with my output";
System.out.println(string);
String str2verify = driver.findElement(By.xpath("//someWellFormXpath")).getText();
Assert.assertEquals(str2verify , "Some usefull information − don't know what happens with my output");
And I'm getting this in my console, so if I want to use equals function doesn't work.
Output
Some usefull information ? don't know what happens with my output
expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]
java.lang.AssertionError: expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]

This is the process:
You write some text. In an editor. That is showing strings to you.
You save your file. files are bytes, not characters, so your editor is applying a charset encoding to do this. Which one? Your editor will know, you didn't mention which one you use so I can't tell you.
Javac reads your file. files are bytes, but javac needs characters, so javac is applying a charset encoding to do this. Which one? "The platform default", unless you use the -encoding parameter / the tool you are using that calls javac has a way to tell it which -encoding parameter to use.
Javac emits class files. These are byte based so this doesn't require encoding.
Your java JVM runs your class file. As part of running, a string is printed to standard out.
System.out refers to 'standard out'. These things are, on pretty much every OS, a stream of bytes. Meaning, when you send strings there, the JVM first encodes your string using some charset encoding, then it goes to standard out.
Something is connected to the other end of standard out and sees these bytes. These convert the bytes back to a string, also using some encoding.
The characters are sent to the font rendering engine on your OS. Even if the character 'survived' all those conversions back and forth, it is possible your font doesn't have a glyph for it. The intent is clearly for that character to be an emdash (a dash that is as long as the letter 'm' - the standard 'minus' character is an ndash, not the same thing; that one is a bit shorter).
Count em up - that's like 6 conversions. They all need to be using the same charset encoding. So, check that your editor and javac agree on what charset encoding your source file is in. Then, check that the console thing that is showing the string is in agreement with standard out (which should be 'platform default', whatever that might be), then, check if the font you use has emdash.
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
Then write to ps, not System.out - that's how you can explicitly force some charset to be used when writing to output.

It turns that em dash doesn't have a representation in cp-1252 charset encoding, so at the end I have to change to UTF-8 all my files in the project to be able to save this character.
It was a pain in the brain this encoding issue.
Thanks for all the suggestions friends.

Related

ISO-8859-1 character encoding not working in Linux

I have tried below code in windows and able decode the message .But same code when i have tried Linux it's not working.
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
I have verified and could see the default character set in Linux platform is UTF-8(Charset.defaultCharset().name())
Kindly suggest me how to do same encoding Linux platform.

The explanation for this, is, almost always, that somewhere bytes are turned to characters or characters are turned to bytes there where the encoding is not clearly specified, thus, defaulting to 'platform default', thus, causing different results depending on which platform you run it on.
Except, every place where you turn bytes to chars or chars to bytes in your snippet of code explicitly specified encoding.
Or does it?
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
Ah, no, you forgot one place: javac itself.
You compile this code. That'll be where raw bytes (because the compiler is looking at ManmohansSourceFile.java, which is a file, which isn't characters, but a bunch of bytes) - which are converted into characters (because the java compiler works on characters), and this is done using some encoding. If you don't use the -encoding switch when running javac (or maven or gradle is running javac, and it passes an encoding, which one depends on your pom/gradle file), then this is read in using system encoding, and thus whether the string actually contains those bytes - who knows.
This is most likely the source of your problem.
The fix? Pick one:
Don't put non-ascii in your source files. Note that you can write the unicode symbol "Latin Capital Letter A with Tilde" as \u00C3 in your source file instead of as Ã. Then use \u00B6 for ¶.
String message ="\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
> ööööö
Ensure you specify the right -encoding switch when compiling. So, if your text editor (that you use to type String message = "¶";) is configured as 'UTF-8', and then run javac -encoding UTF-8 manMohansFile.java.

First of all, I'm not sure exactly what you are expecting...your use of the term "encode" is a bit confusing, but from your comments, it appears that with the input "Ã¶Ã¶Ã¶Ã¶Ã¶", you expect the output "ööööö".
On both Linux and OS X with Java 1.8, I do get that result.
I do not have a Windows machine to try this on.
As #Pshemo indicated, it is possible that your input, since it's hardcoded in the source code as a string, is being represented as UTF-8, not as ISO-8859-1. Actually, this is what I expected, and I was surprised that the code worked as you expected.
Try creating the input with String.encode(), encoding to ISO-8859-1.

how to insert the ≠ sign into a string

What I want as an end result is this
System.out.println("This is the not equal to sign\n≠");
to appear (when run) as
This is the not equal to sign
≠
not to appear as
This is the not equal to sign
?
Is there any way to do this? I tried using windows character map, copied the symbol here, and in my code, but after changing encoding to UTF-8 and inserting it, it comes up as ? when run...
What can be done? Thanks in advance for answers to this utterly simple question

Set character encoding to UTF-8, pass this vm argument, if your text editor already uses UTF-8 or supports this character
-Dfile.encoding=UTF-8

As #Tobias Brandt says, you could use: \u2260
And btw also #Crozin is right about your console configuration
Like this
System.out.println("This is the not equal to sign \n\u2260");

There are five potential issues here:
1) In which charset encoding are you saving (from your editor) you Java source?
2) Which charset encoding the java compiler assumes?
3) Which charset is your console?
4) Are you using some terminal with translation?
5) Does your console font include that particular character?
For getting issues 1-2 right, you should use UTF-8 for both (editor and javac settings), or more robust, specifify the Unicode char with escaped pure ascii text (Frakcool answer).
For issue 3, try -Dfile.encoding=UTF-8 or see this answer. Issues 4-5 are outside your Java program scope. If you are unsure, just redirect the ouput to a file, and look at it with a Hex editor.

When you save the java file, make sure it is saved in the same Charset as the one it is open.
In my Eclipse, when I save a file with special chars (such as \u2260) it asks me what charset I want to use.
Open your file in the terminal and inspect the content of the file.
Make sure it is the same char as the one in the editor you are using.

It seems that after Eclipse asked me if I want to change to UTF-8, it worked, only after I posted this.
Sorry for wasting your time

Get filename as UTF-8? (ä,ü,ö ... is always '?')

I have to read the name of some files and put them in a list as a string. Its not so hard I just have some Problems with some characters like ä,ö,ü ... they are always as a '?' in my string.
Whats the Problem? Well the encoding. Ok this should be easy... thats what i thought. So I tried to use functions like:
new String(insert.getBytes("UTF-8")
or
new String(insert.getBytes("ISO-8859-1"), "UTF-8")
because the most of the files are ISO-8859-1
Its not helping. This is my code:
...
File[] fileList = dir.listFiles();
String insert;
for(File f : fileList) {
...
insert=f.getName().substring(0,f.getName().length()-4);
insert=insert.charAt(0)+insert.substring(1,insert.length()).toLowerCase().replaceFirst("([0-9]*(_s?(i)?(_dat)?)*$)", "").replaceFirst("_", " ");
...
System.out.println("test UTF8: " + new String(insert.getBytes("UTF-8"))); //not helping
System.out.println("test ISO , UTF8: " + new String(insert.getBytes("ISO-8859-1"), "UTF-8")); //not helping
...
names.add(insert);
}
At the end there are a lot of strings with '?' characters in my list.
How to fix the problem? And whats the best way if there are not only ISO-8859-1 files? (lets say there are a lot of unknown encoded files)
Thank You!

Given the extended comments back and forth under the question, it now looks like this is either a font problem or (perhaps more likely) a filename encoding problem.
I asked Lissy to run the following command to let us figure out what the problem is. If she is sure that the filename contain "ä" in them, but that character does not appear when she ls the filename, then this command will tell us whether this is a font or encoding problem.
touch filenäme
ls filen*me
If this shows "filenäme" in the output of ls then we know the problem is with the creation/copy of the files onto this system. This could happen if the program which created the files didn't realize what the filesystem encoding was or was too stupid to do the right thing. The convmv program will probably be the best way to fix this.
convmv -f ENCODING -t utf8 -r .
The question is what is the proper encoding. Possibilities include UTF-16, cp850, or perhaps iso8859-1. convmv --list will show you the list of currently known (to your system) encodings. Since the listed command above only shows you what it might do, it is safe to run several times with different encodings until you find one which works for all files.
If this is a font problem, we'll have to look into that

Unexpected question marks, spalts, etc in a String are a sign that something somewhere doesn't recognize a particular character when converting from one character set to another.
In your case, the problem could be occurring in a couple of places:
It could be occurring when your Java program is reading the file names from the directory (in the dir.listFiles() call).
It could be happening when you print the characters to the console stream.
In either case, the root cause is most likely a mismatch between what Java thinks the locale settings should be and the settings that the operating system and/or command shell are using.
As an experiment, try to list a directory containing the problematic file names from the command line. Do you see question marks or other splats there?
A second experiment to perform is to modify your Java program to dump one of the problem Strings as a sequence of numbers representing the character codes for each of the characters. Do you see the character codes for an ASCII / Unicode '?'.

The encoding of the content of the file name has nothing to do with the encoding of the file name itself.
You should get correct results from System.out.println(insert)
If you don't, it means that the shell has a different character encoding that the default character encoding for your system (this rarely happens; it would usually be the result of an explicit command to switch encodings in the shell).
If the file names are displayed correctly when you list the directory in the shell, I would expect them to be displayed correctly without specifying an encoding in your Java program.
If the shell is incapable of displaying the character (it is substituting the replacement character 0xFFFD (�) for these unprintable characters), there's nothing you can do from your Java application to change that. You need to change the terminal character encoding, install the right fonts, etc.; that is a operating system issue, not a Java issue.
At the same time, even if your terminal can't display the correct results, the Java program should be handling the character encodings correctly without your intervention.
The library behind the File API is figuring out the correct character encoding for your system and doing the necessary decoding into characters. Likewise, the database driver should negotiate with the database to determine the correct encoding, and do any necessary encoding into bytes on behalf of your application.

In a comment you wrote:
#mdrg: well, theres a Problem. I have to read the name of the files and then put them into a database. And there are a lot of '?' , that shouldnt be... – Lissy 27 mins ago
My guess is that the column you're inserting the filenames into specifies US-ASCII as the encoding and replaces characters outside that range with a replacement character, which in your case is the question mark.
So you have to find out the encoding for the column in your database table where you store the filenames. Various products have various syntaxes for retrieving that information.

In Java 1.6 you can use System.console() instead of System.out.println() to display accentuated characters to console.
public class Test {
public static void main(String args[]){
String s = "caractères français : à é \u00e9"; // Unicode for "é"
System.console().writer().println(s);
}
}
and the output is
C:\temp>java Test
caractères français : à é é

Java Charset problem on linux

problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName()
however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
how to make it work using the UTF-8 default charset and not setting the -D option in unix environment.
edit: i use jdk1.6.13
edit:code snippet
works with cs = "ISO-8859-1"; or cs="UTF-8"; on win but not in linux
String x = "½";
System.out.println(x);
byte[] ba = x.getBytes(Charset.forName(cs));
for (byte b : ba) {
System.out.println(b);
}
String y = new String(ba, Charset.forName(cs));
System.out.println(y);
~regards
daed

Your characters are probably being corrupted by the compilation process and you're ending up with junk data in your class file.
if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
In short, don't use -Dfile.encoding=...
String x = "½";
Since U+00bd (½) will be represented by different values in different encodings:
windows-1252 BD
UTF-8 C2 BD
ISO-8859-1 BD
...you need to tell your compiler what encoding your source file is encoded as:
javac -encoding ISO-8859-1 Foo.java
Now we get to this one:
System.out.println(x);
As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:
System.out.write(x.getBytes(Charset.defaultCharset()));
That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.

Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And ... this does't fit in the problem description you literally gave: "converting chars to bytes and back to chars failed". I don't see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?
That said, why would you like to convert the chars forth and back to/from bytes? I don't see any useful business purposes for this.
(sorry, this didn't fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).
Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there's one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don't do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all ... I think it's better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.

You should make the conversion explicitly:
byte[] byteArray = "abcd".getBytes( "ISO-8859-1" );
new String( byteArray, "ISO-8859-1" );
EDIT:
It seems that the problem is the encoding of your java file. If it works on windows, try compiling the source files on linux with javac -encondig ISO-8859-1. This should solve your problem.

String.getBytes("ISO-8859-1") gives me 16-bit characters on OS X

Using Java 6 to get 8-bit characters from a String:
System.out.println(Arrays.toString("öä".getBytes("ISO-8859-1")));
gives me, on Linux: [-10, 28]
but OS X I get: [63, 63, 63, -89]
I seem get the same result when using the fancy new nio CharSetEncoder class. What am I doing wrong? Or is it Apple's fault? :)

I managed to reproduce this problem by saving the source file as UTF-8, then telling the compiler it was really MacRoman:
javac -encoding MacRoman Test.java
I would have thought javac would default to UTF-8 on OSX, but maybe not. Or maybe you're using an IDE and it's defaulting to MacRoman. Whatever the case, you have to make it use UTF-8 instead.

What is the encoding of the source file? 63 is the code for ? which means "character can't be converted to the specified encoding".
So my guess is that you copied the source file to the Mac and that the source file uses an encoding which the Mac java compiler doesn't expect. IIRC, OS X will expect the file to be UTF-8.

Your source file is producing "öä" by combining characters.
Look at this:
System.out.println(Arrays.toString("\u00F6\u00E4".getBytes("ISO-8859-1")))
This shall print [-10,-28] like you expect (I don't like to print it this way but I know it's not the point of your question), because there the Unicode codepoints are specified, carved in stone, and your text editor is not allowed to "play smart" by combining 'o' and 'a' with diacritic signs.
Typically, when you encounter such problems you probably want to use two OS X Un*x commmands to figure what's going on under the hood: file and hexdump are very convenient in such cases.
You want to run them on your source file and you may want to run them on your class file.

Maybe the character set for the source is not set (and thus different according to system locale)?
Can you run the same compiled class on both systems (not re-compile)?

Bear in mind that there's more than one way to represent characters. Mac OS X uses unicode by default, so your string literal may actually not be represented by two bytes. You need to make sure that you load the string from the appropriate incoming character set; for example, by specifying in the source a \u escape character.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.