HtmlUnit asNormalizedText() returns empty string - java

I have this code:
HtmlPage rowPage = ...
String address1 = ((HtmlDivision)rowPage.getFirstByXPath("//div[contains(#class, 'client_address1')]")).asXml();
System.out.println("address1 = " + address1);
String address1_2 = ((HtmlDivision)rowPage.getFirstByXPath("//div[contains(#class, 'client_address1')]")).asNormalizedText();
System.out.println("address1_2 = " + address1_2);
and my output is:
address1 = <div class="client_address1 clientRow">
123 Somewhere ln
</div>
address1_2 =
I expect asNormalizedText() to return 123 Somewhere ln. What circumstances would cause asNormalizedText to return nothing?

A bit more specific XPath would help
//div[contains(#class, 'client_address1')]/text()

What circumstances would cause asNormalizedText to return nothing?
From the javadoc:
Returns a normalized textual representation of this element that represents
what would be visible to the user if this page was shown in a web browser.
Please check the css class - maybe the style hides the text.

Related

GWT - extract text in between two characters

In GWT i have a servlet that returns an image from the database to the client. I need to extract out part of the string to properly show the image. What is returned in chrome, firefox, and IE has a slash in the src part. Ex: String s = "src=\""; Which is not visible in the string below. Maybe the slash is adding more parentheses around the http string. Im not sure?
what is returned in those 3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
EDGE browser doesn't have the slash in the src so my method to extract the image doesnt work in edge
What edge returns:
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
Problem: I need to extract the string below.
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
either with src= or src=\
What I tried and works with the browsers that return without the parentheses "src=\":
String s = "src=\"";
int index = returned.indexOf(s) + s.length();
image.setUrl(returned.substring(index, returned.indexOf("\"", index + 1)));
But fails to work in EDGE because it doesnt return a slash
I do not have access to Pattern, and matcher in GWT.
How can i extract and keep in mind the entityId number will change
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
out of what is returned string above?
EDIT:
I need a generic way to extract out http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
When the string might look like this both ways.
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
public static void main(String[] args) {
String toParse = "<img style=\"-webkit-user-select: none;\" src=\"http://localhost:8080/dashboardmanager/downloadfile?entityId=4886\">";
String delimiter = "src=\"";
int index = toParse.indexOf(delimiter) + delimiter.length();
System.out.println(toParse.substring(index, toParse.length()).split("\"")[0]);
}

No output for parsing google news content

For my code here , I want to get the google new search title & URL .
It worked in the past .However , I don't know why it is not working now ?
Did Google change its CSS structure or what ?
Thanks
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String news="&tbm=nws";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
If the question is "how do I get the code working again?"
It would be difficult for anyone to know what the old page looked like unless they saved off a copy.
I broke down your select like this and it worked for me.
String string = google + URLEncoder.encode(search , charset) + news;
Document document = Jsoup.connect(string).userAgent(userAgent).get();
Elements links = document.select( ".r>a");
The current page source looks like
<div class="g">
<table>
<tbody>
<tr>
<td valign="top" style="width:516px"><h3 class="r">Marlboro Ransomware Defeated in One Day</h3>
Results:
Title: Marlboro Ransomware Defeated in One Day
URL: https://www.bleepingcomputer.com/news/security/marlboro-ransomware-defeated-in-one-day/
Title: Stack Overflow puts a new spin on resumes for developers
URL: https://techcrunch.com/2016/10/11/stack-overflow-puts-a-new-spin-on-resumes-for-developers/
Edited - Time range
These URL parameters look awful.
Add the suffix &tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016
But this part "min%3A5%2F30%2F2016" contains your minimum date. 5 30 2016.
min%3A + (month of year) + %2F + (day of month) + %2F + year
And in "max%3A6%2F30%2F2016" is your maximum date. 6 30 2016.
max%3A + (month of year) + %2F + (day of month) + %2F + year
Here's the full URL searching for Mindy Kaling between 05/30/2016 and 06/30/2016
https://www.google.com/search?tbm=nws&q=mindy%20kaling&tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016
Below worked for me. Please note the pattern ".g .r>a" - find elements with class g >>> all elements inside that with class r which is immediately descended with tag a
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news)
.userAgent(userAgent).get().select( ".g .r>a");
From documentation:
.class: find elements by class name, e.g. .masthead
ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
Though the solution worked, I guess relying on the same might not be recommended unless this is for study purpose or temporary use. Shipping this as a part of product might lead to failure anytime Google changes their page rendering.

Preserving the <br> tags when cleaning with Jsoup

For the input text:
<p>Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?
I run the following code:
Whitelist list = Whitelist.simpleText().addTags("br");
// Some other code...
// plaintext is the string shown above
retVal = Jsoup.clean(plaintext, StringUtils.EMPTY, list,
new Document.OutputSettings().prettyPrint(false));
I get the output:
Arbit string <b>of</b>
text. <em>What</em> to <strong>do</strong> with it?
I don't want Jsoup to convert the <br> tags to line breaks, I want to keep them as-is. How can I do that?
Try this:
Document doc2deal = Jsoup.parse(inputText);
doc2deal.select("br").append("br"); //or append("<br>")
This is not reproducible for me. Using Jsoup 1.8.3 and this code:
String html = "<p>Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?";
String cleaned = Jsoup.clean(html,
"",
Whitelist.simpleText().addTags("br"),
new Document.OutputSettings().prettyPrint(false));
System.out.println(cleaned);
I get the following output:
Arbit string <b>of</b><br><br>text. <em>What</em> to <strong>do</strong> with it?
Your problem must be somewhere else I guess.

java / jsoup - retrieve language

i use jsoup to crawl content from specific website´s.
Example, meta-tags:
String meta_description = doc.select("meta[name=description]").first().attr("content");
What i need to crawl as well is the language, what i do:
String meta_language = doc.select("http-equiv").first().attr("content");
But what is thrown:
java.lang.NullPointerException
Anybody could help with with this?
Greetings!
Try this:
String meta_language = doc.select("meta[name=http-equiv]").get(0).attr("content");
System.out.println("Meta description : " + meta_language);
However if you have a list of content in your meta tag then you can use this :
//get meta keyword content
String keywords = doc.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

XMLParser is eating my whitespace

I am losing significant whitespace from a wiki page I am parsing and I'm thinking it's because of the parser. I have this in my Groovy script:
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it)
println "originalContent = " + doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'}.#value
}
Where inputStream is initialized from a URL GET request to edit a confluence wiki page.
Later on in the withInputStream block where I do this:
println "originalContent = " + doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'}.#value
I notice all the original content of the page is stripped of its newlines. I originally thought it was a server-side thing but when I went to make the same req in my browser and view source I could see newlines in the "originalContent" hidden parameter. Is there an easy way to disable the whitespace normalization and preserve the contents of the field? The above was run against a internal Confluence wiki page but could most likely be reproved when editing any arbitrary wiki page.
Updated above
I added a call to "slurped.keepWhitespace = true" in an attempt to preserve whitespace but that still doesn't work. I'm thinking this method is intended for elements and not attributes? Is there a way to easily tweak flags on the underlying Java XMLParser? Is there a specific setting to set for whitespace in attribute values?
I first tried to reproduce this with some confluence page of my own, but there was no value attribute and no text content in the input node, so I created my own test html.
Now, I figured the tagsoup parser would need to be configured to preserve whitespace too, just setting this on the slurper won't help because the default is to ignore whitespace.
So I've done just this, the tagsoup feature ignorable-whitespace is documented btw. (search for whitespace on the page)
Anyway, it doesn't work. Whitespace from attributes is preserved as you can see from the example and preserving text whitespace doesn't seem to work despite setting the extra feature. Maybe this is a bug in tagsoup or the xml slurper?
I suggest you have a closer look at your html too, is there really a value attribute present?
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
String html = """\
<html><head><title>test</title></head><body>
<p>
<form id="editpageform">
<p>
<input name="originalContent" value=" ">
</input>
</p>
</form>
</p>
</body></html>
"""
def inputStream = new ByteArrayInputStream(html.getBytes())
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature("http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace", true)
def slurper = new XmlSlurper(parser)
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it)
def parse = { doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'} }
println "originalContent (name) = '${parse().#name}'"
println "originalContent (value) = '${parse().#value}'"
println "originalContent (text) = '${parse().text()}'"
}
It seems the newlines are not preserved in the value attribute. See below:
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
String html = """\
<html><head><title>test</title></head><body>
<p>
<form id="editpageform">
<p>
<input name="originalContent" value="
">
</input>
</p>
</form>
</p>
</body></html>
"""
def inputStream = new ByteArrayInputStream(html.getBytes())
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature("http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace", true)
def slurper = new XmlSlurper(parser)
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it)
def parse = { doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'} }
println "originalContent (name) = '${parse().#name}'"
println "originalContent (value) = '${parse().#value}'"
println "originalContent (text) = '${parse().text()}'"
assert parse().#value.toString().contains('\n') : "Should contain a newline"
}

Categories