Retaining special character while reading from html java?

Retaining special character while reading from html java? - java

i am trying to read html source file which contains German characters like ä ö ü ß €
Reading using JSOUP
citAttr.nextElementSibling().text()
Encoding the string with
unicodeEscaper.translate(citAttr.nextElementSibling().text())
org.apache.commons.lang3.text.translate.UnicodeEscaper
Issue is after reading the charecters turns into �
But where as reading CSV with Encoded type UTF-8 with above unicodeEscaper saving & retriving the charecters works fine.
unicodeEscaper.translate(record.get(headerPosition.get(0)))
Whats the issue with reading from html ?? did try with StringUtilEscaper methods still the charecters turns into �
private String getText(Part p) throws MessagingException, IOException {
if (p.isMimeType("text/*")) {
String s = (String) p.getContent();
textIsHtml = p.isMimeType("text/html");
return s;
}
This is how i am reading email which have html content!

I just answered a similar question today... I guess I can just type what I know about extended character sets (foreign-language characters), since that's one of the major facets of the software I write.
Java's internal String's all use 16-bit chars (The primitive type char is a 16-bit primitive value. The name UTF-8 is a little misleading since it is used to represent the 16-bit "Unicode Space" (using two 8-bit numbers). This means that Java (and Java String's) have no problems representing the entire Unicode foreign-language alphabet ranges.
JSoup, and just about any HTML tool written in Java, when asking for website pages to download, will return 16-bit characters - as Java String's - just fine, without any problems! If there are problems viewing these ranges, it is likely not the download process, nor a JSoup or HttpUrlConnection setting. When you save a web-page to a String in Java, you haven't lost those characters, you essentially get UTF-8 "for free."
HOWEVER: Whenever a programmer attempts to save a UTF-8 String to a '.txt' File or a '.html' File, if you then go on to view that content (that file) in a web-browser, all you might see is that annoying question mark: �. This is because you need to make sure to let your web-browser know that the '.html' File that you have saved using Java - is not intended to be interpreted using the (much older, much shorter) 8-bit ASCII Range.
If you view an '.html' File in any web-browser, or upload that file to Google Cloud Platform (or some hosting site), you must do one of two things:
Include the <META> Tag mentioned in the comments: <meta charset="UTF-8"> in the HTML Page's <HEAD> ... </HEAD> section.
Or provide the setting in whatever hosting platform you have to identify the file as 'text/html, charset=UTF-8'. In Google Cloud Platform Storage Buckets there is a popup menu to assign this setting to any file.

Related

Encoding string doesn't work properly in java

I am developing a JavaFX application. I need to create a TreeView programmatically using Persian language as it's nodes' name.
The problem is I see strange characters when I run the application. I have searched through the web and SO same questions. I code a function to do the encoding based on the answers to same question:
public static String getUTF(String encodeString) {
return new String(encodeString.getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8);
}
And I use it to convert my string to build the TreeView:
CheckBoxTreeItem<String> userManagement =
new CheckBoxTreeItem<>(GlobalItems.getUTF("کاربران"));
This answer dowsn't work properly for some characters:
I still get strange results. If I don't use encoding, I get:

For hard coded string literals you need to tell the javac compiler to use the same encoding as the java source, say UTF-8. Check the IDE / build settings. You can u-escape some Farsi symbols,
\u062f for Dal, د. If the escaped characters come thru correctly, the compiler uses the wrong encoding.
String will always contain Unicode, no new Strings with hacking reconversion needed.
Reading files with text, one needs to convert those bytes (byte/InputStream) to java text (String/Reader) specifying the encoding of those bytes.

Character encoding issues?

We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ” (not the usual double quote on regular keyboard)
One more observation:
System.out.println("”".getBytes()[0]);
prints -108.
Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?

Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:
byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
System.out.println(b & 0xFF);
}
which outputs:
226
128
157
Note that your string is actually three bytes long in UTF-8.
As pointed out in the comments, it depends on the encoding. For UTF-16 you get:
254
255
32
29
and for US-ASCII or ISO-8859-1 you get
63
which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:
The behavior of this method [getBytes()] when this string cannot be
encoded in the given charset is unspecified. The CharsetEncoder class
should be used when more control over the encoding process is
required.

I think that it will be better to print character code like this way:
System.out.println((int)'”');//result is 8221
This link can help you to explain this extraordinary double quote (include html code).

To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.
To set the encoding you have two options.
Update your web server to send an appropriate charset argument in
the Content-Type header. The correct header would be Content-Type:
text/html; charset=UTF-8.
Add a <meta charset="UTF-8" /> tag to
the head section of your page.
Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.
The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ” or ” or ”. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.
Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.
One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");

You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting Â characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

how access file name with non english

when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name

The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.

Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.

Unicode issue with an HTML Title, question mark? 65533;

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/
When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following
Das hermetische Caf�: Rock & Wrestling 2010
however when I display that in my webpage with utf-8 encoding it just shows a question mark.
Using the following code:
String title = StringEscapeUtils.escapeHtml(myTitle);
If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct
TITLE:
<title>Das hermetische Café: Rock & Wrestling 2010</title>
BECOMES (which I was expecting the escapeHtml method to do):
<title>Das hermetische Café: Rock & Wrestling 2010</title>
any ideas? thanks

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.
One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).
So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.
The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

These decoders(charset) attribute could also be used in java Stream readers such as InputStreamReader as it has its own constructors to allow them what kind of characters that are entering stream. Agree with the answer Erickson gave.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.