Loosing unicode/ASCII element once parse HTML document with Jsoup

Loosing unicode/ASCII element once parse HTML document with Jsoup - java

I addressed a strange behavior when I parsed a HTML page which contains a unicode/ASCII element. Here the example git://gist.github.com/2995626.git.
What performed is:
File layout = new File(html_file);
Document doc = Jsoup.parse(layout, "UTF-8");
System.out.println(doc.toString());
What I expected was the HTML triangle, but it is converted to "â–¼". Do you have any suggestions?
Thanks in advance.

Jsoup is perfectly capable of parsing HTML using UTF-8. Even more, it's its default character encoding already. Your problem is caused elsewhere. Based on the information provided so far, I can see two possible problem causes:
The HTML file was originally not saved using UTF-8 (or perhaps it's one step before; it's originally not been read using UTF-8).
The stdout (there where the System.out goes to) does not use UTF-8.
If you make sure that both are correctly set, then your problem should disappear. If not, then there's another possible cause which is not guessable based on the information provided so far in your question. At least, this blog should bring a lot of new insight: Unicode - How to get the characters right?

It is a problem caused by unicode. Here you can have an example following. You can try the code below .The result will show you the cause why the code you write not working.
public static void main(String[] argv) {
String test = "Ch\u00e0o bu\u1ed5i s\u00e1ng";
System.out.println(unicode2String(test));
}
/**
* unicode 转字符串
*/
public static String unicode2String(String unicode) {
StringBuffer string = new StringBuffer();
String[] hex = unicode.split("\\\\u");
string.append(hex[0]);
for (int i = 1; i < hex.length; i++) {
// 转换出每一个代码点
int data = Integer.parseInt(hex[i], 16);
// 追加成string
string.append((char) data);
}
return string.toString();
}
Maybe you code should be as follows:
System.out.println(unicode2String(doc.toString()));

Related

Map supplementary Unicode characters to BMP (if possible)

I ran into the issue that my XML parser (VTD-XML) doesn't seem to be able to handle Unicode Supplementary characters (please correct if I'm already wrong here). It seems, the parser only uses the lower 16 bit of such characters.
I cannot switch to another parser within the project I'm occupied with. I am parsing Medline abstracts (https://www.ncbi.nlm.nih.gov/pubmed) and it seems there have been added documents that contain supplementary characters over the last year (e.g. https://www.ncbi.nlm.nih.gov/pubmed/?term=26855708, ends of results section).
As a quick and dirty fix I would just delete all characters above 0xFFFF from the documents. Obviously, that will destroy some expressions in the document texts and so I'm not really happy with that solution.
Since I can't change the parser, I was wondering if there exists some possibility to map supplementary characters to characters within the BMP that are likely to have a glyph with similar appearance, if existent.
Of course I welcome any other idea. It would even be fine to replace the supplementary characters with some kind of placeholder and then put the original character back in but this seems error prone. Better ideas?
Edit: Here is some - hopefully - minimal example of how this issue comes up with VTD-XML:
#Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
// character codpoint 0x10400
String unicode = "<supplementary>\uD801\uDC00</supplementary>";
byte[] unicodeBytes = unicode.getBytes();
assertEquals(unicode, new String(unicodeBytes, "UTF-8"));
VTDGen vg = new VTDGen();
vg.setDoc(unicodeBytes);
vg.parse(false);
VTDNav vn = vg.getNav();
long fragment = vn.getContentFragment();
int offset = (int) fragment;
int length = (int) (fragment >> 32);
String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
String vtdString = vn.toRawString(offset, length);
// this actually succeeds
assertEquals("\uD801\uDC00", originalBytePortion);
// this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
assertEquals("\uD801\uDC00", vtdString);
}

decode character from unicode using java

I was unable to insert a chinese character to mysql. So I though of doing this. I have a excel sheet where I have chinese characters. Like 秀昭 and so on.
I got them converted to unicode representations like \uxxx using below code which I got from SO, and then I stored in MySQL.
private static String escapeNonAscii(String str) {
List<String> arr = new ArrayList<String>();
StringBuilder retStr = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int cp = Character.codePointAt(str, i);
System.out.println("cp="+cp);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
arr.add(String.format("\\\\u%x", cp));
}
}
return retStr.toString();
}
The values have been stored properly. So now I need to display them back. When I tried
System.out.println("\u8BF7\u5728\u6B64\u5904");
It gives me proper output like,
`请在此`
But when I read from DB and did like
System.out.println(rs.getString(1).trim().toString() + " from DB");
It printed
`\u8BF7\u5728\u6B64\u5904`
What might be the problem? Have I missed anything? please help.

Escaped characters will only be processed prior to compiling. To store and retrieve the data from a database, you only have to consider two things: Make sure the data you read had the correct encoding. And when printing the data the correct encoding is set.
If you read data on a windows machine, it is posible you have to use the cp* encodings. Just use a InputStreamReader and set the charset. Now you have the data in the JVM. The internal encoding is some utf-16. Now that you use a type 4 jdbc, you do not have to worry about encoding, except that your database needs a encoding capable to store the data. UTF-8 or Unicode will to the trick. Consult your jdbc documentation for properties to set. Sometimes you have set an encoding explicitly (jdbc:mysql://localhost:3306/?useUnicode=yes&characterEncoding=UTF-8).
When outputting the data, sometimes the output must have a specific encoding. Normally, your JVM runs with the default system char set but you need another one, for example when rendering a HTML file.

Troubles with XPath and Links

my first time posting!
The problem I'm having is I'm using XPath and Tag-Soup to parse a webpage and read in the data. As these are news articles sometimes they have links embedded in the content and these are what is messing with my program.
The XPath I'm using is storyPath = "//html:article//html:p//text()"; where the page has a structure of:
<article ...>
<p>Some text from the story.</p>
<p>More of the story, which proves what a great story this is!</p>
<p>More of the story without links!</p>
</article>
My code relating to the xpath evaluation is this:
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
String tmp = n.toString();
tmp = tmp.replace("[#text:", "");
tmp = tmp.replace("]", "");
tmp = tmp.replaceAll("‚Äô", "'");
tmp = tmp.replaceAll("‚Äò", "'");
tmp = tmp.replaceAll("‚Äì", "-");
tmp = tmp.replaceAll("¬", "");
tmp = tmp.trim();
story.add(tmp);
}
this.setStory(story);
...
private void setStory(LinkedList<String> story) {
String tmp = "";
for (String p : story) {
tmp = tmp + p + "\n\n";
}
this.story = tmp.trim();
}
The output this gives me is
Some text from the story.
More of the story, which proves
what a great story this is
!
More of the story without links!
Does anyone have a way of me eliminating this error? Am I taking a wrong approach somewhere? (I understand I could well be with the setStory code, but don't see another way.
And without the tmp.replace() codes, all the results appear like [#text: what a great story this is] etc
EDIT:
I am still having troubles, though possibly of a different kind.. what is killing me here is again a link, but the way the BBC have their website, the link is on a separate line, thus it still reads in with the same problem as described before (note that problem was fixed with the example given). The section of code on the BBC page is:
<p> Former Queens Park Rangers trainee Sterling, who
<a href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a>
had not started a senior match for the Reds before this season.
</p>
which appears in my output as:
Former Queens Park Rangers trainee Sterling, who
moved to the Merseyside club in February 2010 aged 15,
had not started a senior match for the Reds before this season.

For the problem with your edit where new lines in the html source code come out into your text document, you'll want to remove them before you print them. Instead of System.out.print(text.trim()); do System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));

First find paragraphs,: storyPath = "//html:article//html:p, then for each paragraph, get out all the text with another xpath query and concatenate them without new lines and put two new lines just at the end of the paragraph.
On another note, you shouldn't have to replaceAll("‚Äô", "'"). That is a sure sign that you are opening your file incorrectly. When you open your file you need to pass a Reader into tag soup. You should initialize the Reader like this: Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252")); Where you specify the correct character set for the file. A list of character sets is here: http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html My guess is that it is Windows latin 1.

The [#text: thing is simply the toString() representation of a DOM Text node. The toString() method is intended to be used when you want a string representation of the node for debugging purposes. Instead of toString() use getTextContent() which returns the actual text.
If you don't want the link content to appear on separate lines then you could remove the //text() from your XPath and just take the textContent of the element nodes directly (getTextContent() for an element returns the concatenation of all the descendant text nodes)
String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
story.add(n.getTextContent().trim());
}
The fact that you are having to manually fix up things like "‚Äô" suggests your HTML is actually encoded in UTF-8 but you're reading it using a single-byte character set such as Windows1252. Rather than try and fix it post-hoc you should instead work out how to read the data in the correct encoding in the first place.

How to set a java string variable equal to "htp://website htp://website " [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
so I have a large list of websites and I want to put them all in a String variable. I know I can not individually go to all of the links and escape the //, but is there is over a few hundred links. Is there a way to do a "block escape", so everything in between the "block" is escaped? This is an example of what I want to save in the variable.
String links="http://website http://website http://website http://website http://website http://website"
Also can anyone think of any other problems I might run into while doing this?
I made it htp instead of http because I am not allowed to post "hyperlinks" according to stack overflow as I am not at that level :p
Thanks so much
Edit: I am making a program because I have about 50 pages of a word document that is filled with both emails and other text. I want to filter out just the emails. I wrote the program to do this which was very simple, not I just need to figure away to store the pages in a string variable in which the program will be run on.

Your question is not well-written. Improve it, please. In its current format it will be closed as "too vague".
Do you want to filter e-mails or websites? Your example is about websites, you text about e-mails. As I don't know and I decided to try to help you anyway, I decided to do both.
Here goes the code:
private static final Pattern EMAIL_REGEX =
Pattern.compile("[A-Za-z0-9](:?(:?[_\\.\\-]?[a-zA-Z0-9]+)*)#(:?[A-Za-z0-9]+)(:?(:?[\\.\\-]?[a-zA-Z0-9]+)*)\\.(:?[A-Za-z]{2,})");
private static final Pattern WEBSITE_REGEX =
Pattern.compile("http(:?s?)://[_#\\.\\-/\\?&=a-zA-Z0-9]*");
public static String readFileAsString(String fileName) throws IOException {
File f = new File(fileName);
byte[] b = new byte[(int) f.length()];
InputStream is = null;
try {
is = new FileInputStream(f);
is.read(b);
return new String(b, "UTF-8");
} finally {
if (is != null) is.close();
}
}
public static List<String> filterEmails(String everything) {
List<String> list = new ArrayList<String>(8192);
Matcher m = EMAIL_REGEX.matcher(everything);
while (m.find()) {
list.add(m.group());
}
return list;
}
public static List<String> filterWebsites(String everything) {
List<String> list = new ArrayList<String>(8192);
Matcher m = WEBSITE_REGEX.matcher(everything);
while (m.find()) {
list.add(m.group());
}
return list;
}
To ensure that it works, first lets test the filterEmails and filterWebsites method:
public static void main(String[] args) {
System.out.println(filterEmails("Orange, pizza whatever else joe#somewhere.com a lot of text here. Blahblah blah with Luke Skywalker (luke#starwars.com) hfkjdsh fhdsjf jdhf Paulo <aaa.aaa#bgf-ret.com.br>"));
System.out.println(filterWebsites("Orange, pizza whatever else joe#somewhere.com a lot of text here. Blahblah blah with Luke Skywalker (http://luke.starwars.com/force) hfkjdsh fhdsjf jdhf Paulo <https://darth.vader/blackside?sith=true&midclorians> And the http://www.somewhere.com as x."));
}
It outputs:
[joe#somewhere.com, luke#starwars.com, aaa.aaa#bgf-ret.com.br]
[http://luke.starwars.com/force, https://darth.vader/blackside?sith=true&midclorians, http://www.somewhere.com]
To test the readFileAsString method:
public static void main(String[] args) {
System.out.println(readFileAsString("C:\\The_Path_To_Your_File\\SomeFile.txt"));
}
If that file exists, its content will be printed.
If you don't like the fact that it returns List<String> instead of a String with items divided by spaces, this is simple to solve:
public static String collapse(List<String> list) {
StringBuilder sb = new StringBuilder(50 * list.size());
for (String s : list) {
sb.append(" ").append(s);
}
sb.delete(0, 1);
return sb.toString();
}
Sticking all together:
String fileName = ...;
String webSites = collapse(filterWebsites(readFileAsString(fileName)));
String emails = collapse(filterEmails(readFileAsString(fileName)));

I suggest that you save your Word document as plain text. Then you can use classes from the java.io package (such as Scanner to read the text).
To solve the issue of overwriting the String variable each time you read a line, you can use an array or ArrayList. This is much more ideal than holding all the web addresses in a single String because you can easily access each address individually whenever you like.

For your first problem, take all the text out of word, put it in something that does regular expressions, use regular expressions to quote each line and end each line with +. Now edit the last line and change + to ;. Above the first line write String links =. Copy this new file into your java source.
Here's an example using regexr.
To answer your second question (thinking of problems) there is an upper limit for a Java string literal if I recall correctly 2^16 in length.
Oh and Perl was basically written for you to do this kind of thing (take 50 pages of text and separate out what is a url and what is an email)... not to mention grep.

I'm not sure what kind of 'list of websites' you're referring to, but for eg. a comma-separated file of websites you could read the entire file and use the String split function to get an array, or you could use a BufferedReader to read the file line by line and add to an ArrayList.
From there you can simply loop the array and append to a String, or if you need to:
do a "block escape", so everything in between the "block" is escaped
You can use a Regular Expression to extract parts of each String according to a pattern:
String oldString = "<someTag>I only want this part</someTag>";
String regExp = "(?i)(<someTag.*?>)(.+?)(</someTag>)";
String newString = oldString.replaceAll(regExp, "$2");
The above expression would remove the xml tags due to the "$2" which means you're interested in the second group of the expression, where groups are identified by round brackets ( ).
Using "$1$3" instead should then give you only the surrounding xml tags.
Another much simpler approach to removing certain "blocks" from a String is the String replace function, where to remove the block you could simply pass in an empty string as the new value.
I hope any of this helps, otherwise you could try to provide a full example with you input "list of websites" and the output you want.

Get XML and encoded data in a file with hex zeros using Java

I have to read a file (existing format not under my control) that contains an XML document and encoded data. This file unfortunately includes MQ-related data around it including hex zeros (end of files).
So, using Java, how can I read this file, stripping or ignoring the "garbage" I don't need to get at the XML and encoded data. I believe an acceptable solution is to just leave out the hex zeros (are there other values that will stop my reading?) since I don't need the MQ information (RFH header) anyway and the counts are meaningless for my purposes.
I have searched a lot and only find really heinous complicated "solutions". There must be a better way...

What worked was to pull out the XML documents - Groovy code:
public static final String REQUEST_XML = "<Request>";
public static final String REQUEST_END_XML = "</Request>";
/**
* #param xmlMessage
* #return 1-N EncodedRequests for those I contain
*/
private void extractRequests( String xmlMessage ) {
int start = xmlMessage.indexOf(REQUEST_XML);
int end = xmlMessage.indexOf(REQUEST_END_XML);
end += REQUEST_END_XML.length();
while( start >= 0 ) { //each <Request>
requests.add(new EncodedRequest(xmlMessage.substring(start,end)));
start = xmlMessage.indexOf(REQUEST_XML, end);
end = xmlMessage.indexOf(REQUEST_END_XML, start);
end += REQUEST_END_XML.length();
}
}
and then decode the base64 portion:
public String getDecodedContents() {
if( decodedContents == null ) {
byte[] decoded = Base64.decodeBase64(getEncodedContents().getBytes());
String newString = new String(decoded);
decodedContents = newString;
decodedContents = decodedContents.replace('\r','\t');
}
return decodedContents;
}

I've hit this issue before (well ... something similar). Have a look a my FilterInputStream for a file filter that you should be able to modify to your needs.
Essentially it implements a push-back buffer that chucks away anything you don't want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Loosing unicode/ASCII element once parse HTML document with Jsoup - java

Related

Map supplementary Unicode characters to BMP (if possible)

decode character from unicode using java

Troubles with XPath and Links

How to set a java string variable equal to "htp://website htp://website " [closed]

Get XML and encoded data in a file with hex zeros using Java

Categories

Resources