jsoup output in angularjs web - java

doc = Jsoup.connect("https://www.example.com/p/laptop-aksesoris").get();
Element element = doc.select("div.product-card a").first();
Elements elements = element.getElementsByAttribute("href");
for (Element e:elements) {
System.out.println("url: " + e.attr("href"));
System.out.println("text: " + e.text());
}
output:
url: {{model.url}}
text: {{model.name}} {{model.price}} {{e.title}} Grosir {{wp.quantity_min}} - {{wp.quantity_max}} ≥ {{wp.quantity_min}} {{wp.price}} PO
this web using angularjs(version 1) for froent-end.
when i try web without angularjs, it's work well.
Questions:
what the happen?
how can i fix that?

AFAIK jsoup just parses HTML and DOM elements. It does not execute javascript and hence cannot parse any AngularJS dynamic elements.

Related

XPath parsing failing with dom4j for text function

My input xml is
String xml= "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<disks-array>\n" +
"<array-item>\n" +
" <value>\n" +
"<scsi>\n" +
"<bus>0</bus>\n" +
"<unit>0</unit>\n" +
"</scsi>\n" +
"<backing>\n" +
"<vmdk_file>[909_TCUP_02] u999orcat017t/u999orcat017t.vmdk</vmdk_file>\n" +
"<type>VMDK_FILE</type>\n" +
"</backing>\n" +
"<label>Hard disk 1</label>\n" +
"<type>SCSI</type>\n" +
"<capacity>107374182400</capacity>\n" +
"</value>\n" +
"<key>2000</key>\n" +
"</array-item>\n" +
"</disks-array>"
and the XPath filter is
"//array-item[contains(./value/backing/vmdk_file/text(),'u999orcat017t/u999orcat017t.vmdk')]"
Here is my parsing and filtering code
Document doc = DocumentHelper.parseText(xml);
XPath xp = DocumentHelper.createXPath(xpathQuery);
// evaluate the xpath
Object xpResult = xp.evaluate(doc);
Ideally it should return me the array items /value/vmdk_file text contains the given text. However it gives me empty string.
I am using dom4j 1.61 and jaxen 1.1.1 version library.
What is going wrong ?
Finally after debugging for many hours figured out the root cause for incorrect parsing of xml. The text value is broken into multiple nodes instead of single node. See the highlighted picture
Turns out this is a bug in dom4j library which is still open
https://github.com/dom4j/dom4j/issues/21
The fix is to call document.normalize() to settle text nodes.

Jsoup (basic .class) CSS selectors not working - html() shows elements exist, and less specific selector works

Sorry, this is kind of a long post. If you scroll down to the bottom edit, the problem will probably be clear.
A portion of the page I'm parsing:
Sort by popularity
New manga
<a class="bigChar" href="/Manga/Shinryaku-Ika-Musume">Shinryaku! Ika Musume</a>
Comedy
I specifically want to pull the 3rd <a> tag with class "bigChar". It's the only occurrence of the "bigChar" class on the page. This should be straightforward enough. I do the same thing on a page from another site, and it works totally fine.
Document doc = MCache.getDocument(url);
if (doc == null)
return null;
Series series = new Series();
series.source = this.getSourceName();
series.imageURL = getImageURL(doc);
M.debug("selecting from " + doc.html() + " " + doc.select(".bigChar") + " with a " + doc.select("a") + " final " + doc.select("a.bigChar"));
series.title = doc.select(".bigChar").first().ownText();
Above is the segment of code I'm running.
For some reason, doc.select(".bigChar") is not getting anything, so doc.select(".bigChar") is throwing a NPE.
You can see my debug output line in the code above. Here is what it outputs:
selecting from <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
...<div>
<a class="bigChar" href="/Manga/Shinryaku-Ika-Musume">Shinryaku! Ika Musume</a>
<p> <span class="info">Genres:</span> <a href="/Genre/Comedy" class="dotUnder" title="A dramatic work that is light and often humorous or satirical in tone and that usually
....
with a Login
Register
...
<a class="bigChar" href="/Manga/Shinryaku-Ika-Musume">Shinryaku! Ika Musume</a>
...
final
To be clear: in the "selecting from" part, I output the document.html(). This includes the desired <a> tag with the bigChar class.
The "with a" part, I print the results of document.select("a"). These results include the <a> tag with the bigChar class.
At the end, I print "final" followed by document.select(".bigChar"). For whatever reason, this doesn't select anything. I've also tried a.bigChar as the selector. Calling .first() on the Elements returned by both of those selectors gives null, since it doesn't seem to select anything.
Does anyone know what's going on here? I know selectors can be tricky but I'm pretty sure I'm not making a mistake considering it's just selecting a single class... I'm especially confused by why the a selector includes the tag I want, even showing the same class, but .bigChar doesn't select it.
Edit: I tried adding some more debugging code:
Elements as = doc.select("a");
for (Element e : as) {
if (e.classNames().size() > 0)
M.debug(e.classNames());
if(e.classNames().contains("bigChar")) {
M.debug("found!");
}
}
as = doc.select(".bigChar");
M.debug("now: " + as);
for (Element e : as) {
if (e.classNames().size() > 0)
M.debug(e.classNames());
}
The output:
[logo]
[bigChar]
found!
[dotUnder]
[dotUnder]
[dotUnder]
[dotUnder]
[dotUnder]
now:
So the bigChar is even seen as a class, but the .bigChar selector is not getting it...

How to fetch an URL containing non-ASCII characters (ą, ś ...) with Jsoup?

I am using jsoup to parse some polish sites, but I have problem with special characters like "ą", "ś" in URL(!), for example example.com/kąt is readed like example.com/k
every query without this special characters works perfectly
I have tried Document doc = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url) but it does not work.
any other tips?
You want to encode your URL before passing it to Jsoup.
SAMPLE CODE
String url = "http://sjp.pl/maść";
System.out.println("BEFORE " + url);
String encodedURL = URI.create(url).toASCIIString();
System.out.println("AFTER " + encodedURL);
System.out.println("Title: " + Jsoup.connect(encodedURL).get().title());
OUTPUT
BEFORE http://sjp.pl/maść
AFTER http://sjp.pl/ma%C5%9B%C4%87
Title: maść - Słownik SJP
French locale
Jsoup 1.8.3

java / jsoup - retrieve language

i use jsoup to crawl content from specific website´s.
Example, meta-tags:
String meta_description = doc.select("meta[name=description]").first().attr("content");
What i need to crawl as well is the language, what i do:
String meta_language = doc.select("http-equiv").first().attr("content");
But what is thrown:
java.lang.NullPointerException
Anybody could help with with this?
Greetings!
Try this:
String meta_language = doc.select("meta[name=http-equiv]").get(0).attr("content");
System.out.println("Meta description : " + meta_language);
However if you have a list of content in your meta tag then you can use this :
//get meta keyword content
String keywords = doc.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

Find anchors in string and wrap those with a header

Hi I'm pretty new with Java. And I can't figure out what a nice solution is for my problem. I would like to add a header, in this case an H2 around the anchors in the following string.
Lets say li.getContent().toString contains the following:
TEST<div class="description">lorem ipsum</div>
With the following code:
for (Element li : liS) {
out.append("<div class=\"result\">\n" +
"\t\t\t<ul class=\"resultset\">\t\n" +
"\t\t\t\t<li><div class=\"item\"> " +
li.getContent().toString() + "</div></li>\n" +
"\t\t\t</ul>\n" + "\t\t</div>");
}
What I want to have is that li.getContent().toString shows like this, by adding the H2 headers :
<h2>TEST</h2><div class="description">lorem ipsum</div>
Is there some kind of wrap function where I can find the Anchor and wrap it with a header?

Categories