Java character encoding to HTML ISO-8859-1

Java character encoding to HTML ISO-8859-1 - java

I have a byte[] containing the value -110 (and other negative values). When I convert it to string I need it to display a ’ (right single quote). Currently, I am getting a question mark (?)
The ’ aligns to the special ASCII character #146 mentioned in this page but I am now stuck as to how I can input -110 or 146 (-110+256) and be a ’ value. I have also trued Any advice would be greatly appreciated.
byte[] b = {-110,84};
System.out.println(new String(b, Charset.forName("Windows-1252"))); //Displays ?T . The desired output should be ’T
System.out.println(new String(b, Charset.forName("UTF-8"))); //Displays ?T . The desired output should be ’T
System.out.println(new String(b, Charset.forName("ISO-8859-1"))); //Displays ?T . The desired output should be ’T

Thanks for the responses as John Skeet points out in his reply, the Java program needed to recognize the input data Windows-1252 and the Windows command line wasn't set to the code page either.
Setting the command line codepage to Windows-1252 was done by running
chcp 1252
Starting the Java program to use Windows-1252 as the default was done by adding the following parameter
-Dfile.encoding="Windows-1252"

Related

Why is a valid Windows-1252 character written as question mark in a file?

I have the following code:
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File("C:/test.txt")), Charset.forName("windows-1252")));
writer.write(0x94); // written as question mark
writer.write(new char[] {(char)0x94}); // written question mark
writer.write(new char[] {'”'}); // written correctly as the right double quote character
writer.write(new String("”")); // written correctly as the right double quote character
writer.close();
My main question would be: why are the first 2 writes not capable of writing the right double quote character (0x94 in Windows1252) correctly into the file? I am on Windows (system text encoding is windows-1252 according to Powershell) and even the default charset of the JVM is windows-1252, and i am using Notepad++ which is guessing ANSI as the encoding, so everything should be in sync. The stranger thing is: if i change the charset parameter to StandardCharsets.ISO_8859_1, the first 2 writes will actually write the correct character into the file (even though the right double quote character is not a valid ISO_8859_1 character!). I'm using OpenJDK 11.0.15

When I extract spanish text from a PDF with PDFBox, accents are changed by "strange" characters

I have this code in java to take a PDF file and extract all the text:
File file= new File("C:/file.pdf");
PDDocument doc= PDDocument.load(file);
PDFTextStripper s = new PDFTextStripper();
content= s.getText(doc);
System.out.println(content)
If we run the application with Windows, it works correctly and extracts all the text. However, when we pass the app to the server that uses Linux, the spanish accents are converted into "strange" characters like --> "carÃ¡cter" (it should be "carácter"). I tried to convert the String to bytes and then to UTF8 unicode:
byte[] b = content.getBytes(Charset.forName("UTF-8"));
String text= new String(b);
System.out.println(text);
But it does not work, in Windows it continues working well but in the Linux server it still shows wrong the spanish accents, etc ... I understand that if in a Windows environment it works correctly, in a Linux environment it should have to work too ... Any idea of What can it be or what can I do? Thank you

Ã¡ is what you get when the UTF-8 encoded form of á is misinterpreted as Latin-1.
There are two possibilities for this to happen:
a bug in PDFTextStripper.getText() - Java strings are UTF-16 encoded, but getText() may be returning a string containing UTF-8 byte octets that have been expanded as-is to 16-bit Java chars, thus producing 2 chars 0x00C3 0x00A1 instead of 1 char 0x00E1 for á. Subsequently calling content.getBytes(UTF8) on such a malformed string would just give you more corrupted data.
To "fix" this kind of mistake, loop through the string copying its chars as-is to a byte[] array, and then decode that array as UTF-8:
byte[] b = new byte[content.length()];
for (int i = 0; i < content.length(); ++i) {
b[i] = (byte) content[i];
}
String text = new String(b, "UTF-8");
System.out.println(text);
a configuration mismatch - PDFTextStripper.getText() may be returning a properly encoded UTF-16 string containing a á char as expected, but then System.out.println() outputs the UTF-8 encoded form of that string, and your terminal/console misinterprets the output as Latin-1 instead of as UTF-8.
In this case, the code you shown is fine, you would just need to double-check your Java environment and terminal/console configuration to make sure they agree on the charset used for console output.
You need to check the actual char values in content to know which case is actually happening.

Appending extended ascii in strings

I'm trying to use extended ascii character 179(looks like pipe).
Here is how I use it.
String cmd = "";
char pipe = (char) 179;
// cmd ="02|CO|0|101|03|0F""
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
Output
cmd 02³CO³0³101³03³0F
But the output is like this . I have read that extended ascii characters are not displayed correctly.
Is my code correct and just the ascii is not correctly displayed
or my code is wrong.
I'm not concerned about showing this string to user I need to send it to server.
EDIT
The vendor's api document states that we need to use ascii 179 (looks like pipe) . The server side code needs 179(part of extended ascii) as pipe/vertical line so I cannot use 124(pipe)
EDIT 2
Here is the table for extended ascii
On the other hand this table shows that ascii 179 is "3" . Why
are there different interpretation of the same and which one should I
consider??
EDIT 3
My default charset value is (is this related to my problem?)
System.out.println("Default Charset=" + Charset.defaultCharset());
Default Charset=windows-1252
Thanks!
I have referred to
How to convert a char to a String?
How to print the extended ASCII code in java from integer value
Thanks

Use the below code.
String cmd = "";
char pipe = '\u2502';
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
System.out.println("int value: " + (int)pipe);
Output:
cmd 02│CO│0│101│03│0F
int value: 9474
I am using IntelliJ. This is the output I am getting.

Your code is correct; concatenating String values and char values does what one expects. It's the value of 179 that is wrong. You can google "unicode 179", and you'll find "Unicode Character 'SUPERSCRIPT THREE' (U+00B3)", as one might expect. And, you could simply say "char pipe = '|';" instead of using an integer. Or even better: String pipe = "|"; which also allows you the flexibility to use more than one character :)
In response to the new edits...
May I suggest that you fix this rather low-level problem not at the Java String level, but instead replace the byte encoding this character before sending the bytes to the server?
E.g. something like this (untested)
byte[] bytes = cmd.getBytes(); // all ascii, so this should be safe.
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '|') {
bytes[i] = (byte)179;
}
}
// send command bytes to server
// don't forget endline bytes/chars or whatever the protocol might require. good luck :)

Does Java read 0xA0 as 0xFFFD?

One of my data processing modules crashed while reading ANSI input. Looking at the string in question using a hex viewer, there was a mysterious 0xA0 byte at the end of it.
Turns out this is
Unicode Character 'NO-BREAK SPACE' (U+00A0).
I tried replacing that:
String s = s.replace("\u00A0", "");
But it didn't work.
I then went and printed out what that character is using charAt and Java reports
65533
or 0xFFFD
(Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD))
Plugging that into the replace code, I finally got rid of it!
But why do I see an 0xA0 in the file, but Java reads it as 0xFFFD?
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
String line = r.readLine();
while (line != null){
// do stuff
line = r.readLine();
}

U+FFFD is the "Unicode replacement character", which is generally used to represent "some binary data which couldn't be decoded correctly in the encoding you were using". (Sometimes ? is used for this instead, but U+FFFD is generally a better idea, as it's unambiguous.)
Its presence is usually a sign that you've tried to use the wrong encoding. You haven't specified which encoding you were using - or indeed how you were using it - but that's probably the problem. Check the encoding you're using and the encoding of the file. Be aware that "ANSI" isn't an encoding - there are lots of encodings which are known as ANSI encodings, and you'll need to pick the right one for your file.

How did you open the file?
If you use InputStreamReader(InputStream, CharSet) your can specify the 'true' charset of the file you would like to open. If you do not specify the charset yourself, java is using the default charset of your platform. On unix this is often UTF8 while on windows its often ISO8859.

Why is ¿ displayed different in Windows vs Linux even when using UTF-8?

Why is the following displayed different in Linux vs Windows?
System.out.println(new String("¿".getBytes("UTF-8"), "UTF-8"));
in Windows:
¿
in Linux:
Â¿

System.out.println() outputs the text in the system default encoding, but the console interprets that output according to its own encoding (or "codepage") setting. On your Windows machine the two encodings seem to match, but on the Linux box the output is apparently in UTF-8 while the console is decoding it as a single-byte encoding like ISO-8859-1. Or maybe, as Jon suggested, the source file is being saved as UTF-8 and javac is reading it as something else, a problem that can be avoided by using Unicode escapes.
When you need to output anything other than ASCII text, your best bet is to write it to a file using an appropriate encoding, then read the file with a text editor--consoles are too limited and too system-dependent. By the way, this bit of code:
new String("¿".getBytes("UTF-8"), "UTF-8")
...has no effect on the output. All that does is encode the contents of the string to a byte array and decode it again, reproducing the original string--an expensive no-op. If you want to output text in a particular encoding, you need to use an OutputStreamWriter, like so:
FileOutputStream fos = new FileOutputStream("out.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");

Not sure where the problem is exactly, but it's worth noting that
Â¿ ( 0xc2,0xbf)
is the result of encoding with UTF-8
0xbf,
which is the Unicode codepoint for ¿
So, it looks like in the linux case, the output is not being displayed as utf-8, but as a single-byte string

Check what encoding your linux terminal has.
For gnome-terminal in ubuntu - go to the "Terminal" menu and select "Set Character Encoding".
For putty, Configuration -> Window -> Translation -> UTF-8 (and if that doesn't work, see this post).

Run this code to help determine if it is a compiler or console issue:
public static void main(String[] args) throws Exception {
String s = "¿";
printHex(Charset.defaultCharset(), s);
Charset utf8 = Charset.forName("UTF-8");
printHex(utf8, s);
}
public static void printHex(Charset encoding, String s)
throws UnsupportedEncodingException {
System.out.print(encoding + "\t" + s + "\t");
byte[] barr = s.getBytes(encoding);
for (int i = 0; i < barr.length; i++) {
int n = barr[i] & 0xFF;
String hex = Integer.toHexString(n);
if (hex.length() == 1) {
System.out.print('0');
}
System.out.print(hex);
}
System.out.println();
}
If the encoded bytes for UTF-8 are different on each platform (it should be c2bf), it is a compiler issue.
If it is a compiler issue, replace "¿" with "\u00bf".

It's hard to know exactly which bytes your source code contains, or the string which getBytes() is being called on, due to your editor and compiler encodings.
Can you produce a short but complete program containing only ASCII (and the relevant \uxxxx escaping in the string) which still shows the problem?
I suspect the problem may well be with the console output on either Windows or Linux, but it would be good to get a reproducible program first.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java character encoding to HTML ISO-8859-1 - java

Related

Why is a valid Windows-1252 character written as question mark in a file?

When I extract spanish text from a PDF with PDFBox, accents are changed by "strange" characters

Appending extended ascii in strings

Does Java read 0xA0 as 0xFFFD?

Why is ¿ displayed different in Windows vs Linux even when using UTF-8?

Categories

Resources