jsoup - Capturing <h1> elements excluding those that have the value - java

I'm using a crawler to capture data from a website.
Now, I'm trying to select all of the <h1> elements, and print it (for now). I noticed that there are some headers that contains only which makes the data looks empty.
I want to exclude <h1>s with the values .
Here's what I have tried:
`private static void getAllH1(String url, Element tCon) {
// System.out.println("Url: " + url);
Elements headers1 = tCon.getElementsByTag("h1");
System.out.println("Url\t\tHeader");
for(Element h1: headers1) {
if(h1.text().length()!=0 && h1.text()!="\u00a0") {
System.out.println(url + "\t\t" + h1.text());
}
}
}`
EDIT: I saw from one of the threads here about jsoup reading as \u00a0 but it's still not working.
Here is an example output:
`
Url Header
http://www.url.com/index.asp Quick Links
http://www.url.com/index.asp What's New
http://www.url.com/index.asp  
http://www.url.com/index.asp What's Next
http://www.url.com/index.asp What's On
http://www.url.com/index.asp Key Rates
http://www.url.com/index.asp Public Advisories
`
Thank you in advance!

I have found the answer from this link:
Element.text() doesn't normalize ' ' whitespace #529
So what I did, from jsoup-1.9.2, I updated my jsoup into jsoup-1.11.2.
Then, when I run the code (same code; no alteration), it finally recognized the .

Related

Jsoup (basic .class) CSS selectors not working - html() shows elements exist, and less specific selector works

Sorry, this is kind of a long post. If you scroll down to the bottom edit, the problem will probably be clear.
A portion of the page I'm parsing:
Sort by popularity
New manga
<a class="bigChar" href="/Manga/Shinryaku-Ika-Musume">Shinryaku! Ika Musume</a>
Comedy
I specifically want to pull the 3rd <a> tag with class "bigChar". It's the only occurrence of the "bigChar" class on the page. This should be straightforward enough. I do the same thing on a page from another site, and it works totally fine.
Document doc = MCache.getDocument(url);
if (doc == null)
return null;
Series series = new Series();
series.source = this.getSourceName();
series.imageURL = getImageURL(doc);
M.debug("selecting from " + doc.html() + " " + doc.select(".bigChar") + " with a " + doc.select("a") + " final " + doc.select("a.bigChar"));
series.title = doc.select(".bigChar").first().ownText();
Above is the segment of code I'm running.
For some reason, doc.select(".bigChar") is not getting anything, so doc.select(".bigChar") is throwing a NPE.
You can see my debug output line in the code above. Here is what it outputs:
selecting from <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
...<div>
<a class="bigChar" href="/Manga/Shinryaku-Ika-Musume">Shinryaku! Ika Musume</a>
<p> <span class="info">Genres:</span> <a href="/Genre/Comedy" class="dotUnder" title="A dramatic work that is light and often humorous or satirical in tone and that usually
....
with a Login
Register
...
<a class="bigChar" href="/Manga/Shinryaku-Ika-Musume">Shinryaku! Ika Musume</a>
...
final
To be clear: in the "selecting from" part, I output the document.html(). This includes the desired <a> tag with the bigChar class.
The "with a" part, I print the results of document.select("a"). These results include the <a> tag with the bigChar class.
At the end, I print "final" followed by document.select(".bigChar"). For whatever reason, this doesn't select anything. I've also tried a.bigChar as the selector. Calling .first() on the Elements returned by both of those selectors gives null, since it doesn't seem to select anything.
Does anyone know what's going on here? I know selectors can be tricky but I'm pretty sure I'm not making a mistake considering it's just selecting a single class... I'm especially confused by why the a selector includes the tag I want, even showing the same class, but .bigChar doesn't select it.
Edit: I tried adding some more debugging code:
Elements as = doc.select("a");
for (Element e : as) {
if (e.classNames().size() > 0)
M.debug(e.classNames());
if(e.classNames().contains("bigChar")) {
M.debug("found!");
}
}
as = doc.select(".bigChar");
M.debug("now: " + as);
for (Element e : as) {
if (e.classNames().size() > 0)
M.debug(e.classNames());
}
The output:
[logo]
[bigChar]
found!
[dotUnder]
[dotUnder]
[dotUnder]
[dotUnder]
[dotUnder]
now:
So the bigChar is even seen as a class, but the .bigChar selector is not getting it...

No output for parsing google news content

For my code here , I want to get the google new search title & URL .
It worked in the past .However , I don't know why it is not working now ?
Did Google change its CSS structure or what ?
Thanks
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String news="&tbm=nws";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
If the question is "how do I get the code working again?"
It would be difficult for anyone to know what the old page looked like unless they saved off a copy.
I broke down your select like this and it worked for me.
String string = google + URLEncoder.encode(search , charset) + news;
Document document = Jsoup.connect(string).userAgent(userAgent).get();
Elements links = document.select( ".r>a");
The current page source looks like
<div class="g">
<table>
<tbody>
<tr>
<td valign="top" style="width:516px"><h3 class="r">Marlboro Ransomware Defeated in One Day</h3>
Results:
Title: Marlboro Ransomware Defeated in One Day
URL: https://www.bleepingcomputer.com/news/security/marlboro-ransomware-defeated-in-one-day/
Title: Stack Overflow puts a new spin on resumes for developers
URL: https://techcrunch.com/2016/10/11/stack-overflow-puts-a-new-spin-on-resumes-for-developers/
Edited - Time range
These URL parameters look awful.
Add the suffix &tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016
But this part "min%3A5%2F30%2F2016" contains your minimum date. 5 30 2016.
min%3A + (month of year) + %2F + (day of month) + %2F + year
And in "max%3A6%2F30%2F2016" is your maximum date. 6 30 2016.
max%3A + (month of year) + %2F + (day of month) + %2F + year
Here's the full URL searching for Mindy Kaling between 05/30/2016 and 06/30/2016
https://www.google.com/search?tbm=nws&q=mindy%20kaling&tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016
Below worked for me. Please note the pattern ".g .r>a" - find elements with class g >>> all elements inside that with class r which is immediately descended with tag a
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news)
.userAgent(userAgent).get().select( ".g .r>a");
From documentation:
.class: find elements by class name, e.g. .masthead
ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
Though the solution worked, I guess relying on the same might not be recommended unless this is for study purpose or temporary use. Shipping this as a part of product might lead to failure anytime Google changes their page rendering.

Find anchors in string and wrap those with a header

Hi I'm pretty new with Java. And I can't figure out what a nice solution is for my problem. I would like to add a header, in this case an H2 around the anchors in the following string.
Lets say li.getContent().toString contains the following:
TEST<div class="description">lorem ipsum</div>
With the following code:
for (Element li : liS) {
out.append("<div class=\"result\">\n" +
"\t\t\t<ul class=\"resultset\">\t\n" +
"\t\t\t\t<li><div class=\"item\"> " +
li.getContent().toString() + "</div></li>\n" +
"\t\t\t</ul>\n" + "\t\t</div>");
}
What I want to have is that li.getContent().toString shows like this, by adding the H2 headers :
<h2>TEST</h2><div class="description">lorem ipsum</div>
Is there some kind of wrap function where I can find the Anchor and wrap it with a header?

Is there a limit on the number of comments to be extracted from Youtube?

I am trying to extract comments on some YouTubeVideos using the youtube-api with Java. Everything is going fine except the fact that I am not able to extract all the comments if the video has a large number of comments (it stops at somewhere in between 950 and 999). I am following a simple method of paging through the CommentFeed of the VideoEntry, getting comments on each page and then storing each comment in an ArrayList before writing them in an XML file. Here is my code for retrieving the comments
int commentCount = 0;
CommentFeed commentFeed = service.getFeed(new URL(commentUrl), CommentFeed.class);
do {
//Gets each comment in the current feed and prints to the console
for(CommentEntry comment : commentFeed.getEntries()) {
commentCount++;
System.out.println("Comment " + commentCount + " plain text content: " + comment.getPlainTextContent());
}
//Checks if there is a next page of comment feeds
if (commentFeed.getNextLink() != null) {
commentFeed = service.getFeed(new URL(commentFeed.getNextLink().getHref()), CommentFeed.class);
}
else {
commentFeed = null;
}
}
while (commentFeed != null);
My question is: Is there some limit on the number of comments that I could extract or am I doing something wrong?
use/refer this
String commentUrl = videoEntry.getComments().getFeedLink().getHref();
CommentFeed commentFeed = service.getFeed(new URL(commentUrl), CommentFeed.class);
for(CommentEntry comment : commentFeed.getEntries()) {
System.out.println(comment.getPlainTextContent());
}
source
Max number of results per iteration is 50 (it seems) as mentioned here
and you can use start-index to retrieve multiple result sets as mentioned here
Google Search API and as well as Youtube Comments search limits max. 1000 results and u cant extract more than 1000 results

How to get h2 Tag of a table using Jsoup

I need some help scraping a webpage with Jsoup. I want to pars player profiles from the hcfactions webpage and gather their kills and deaths. The problem I'm running into is that each profile page is dynamically created and will only have said tables if the player has kills or deaths. So in order to tell which table I'm parsing I need to get the header text that's set after the call.
example web page: http://www.hcfactions.net/index.php?action=playerinfo&player=Djmaddox.
Below is a html segment from the web page I'm scraping:
<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>
<tr><td>Date</td><td>Reason</td><td>Details</td></tr><tr><td>Dec 11 5:27pm CST</td>.....
I have this code that pulls the tables and counts entries but it wont pull the h2 tags with it for me to select.
public void getPlayerDetails(String name) {
String data = "";
Avatar temp = _db.getPlayer(name);
playerUrl = "http://www.hcfactions.net/index.php?action=playersearch&player=" + name;
try {
// data = Jsoup.connect(url)
// .url(url).get().html();
playerDoc = Jsoup.connect(playerUrl).get();
} catch (IOException ex) {
Logger.getLogger(JParser.class.getName()).log(Level.SEVERE, null, ex);
}
if (playerDoc.select("table").size() == 1) {
return;
} else if (playerDoc.select("table").size() >= 2) {
for (int x = 1; x < playerDoc.select("table").size(); x++) {
System.out.println("deaths");
Element table = playerDoc.select("table").get(x);
Iterator<Element> ite = table.select("tr").iterator();
int count = 0;
while (ite.hasNext()) {
data = ite.next().text();
count++;
}
if (count > 0) {
temp.setDeaths(count - 1);
}
}
}
}
The tag <h2> is on an invalid position. That's why JSoup cannot find it I think. You have to extract it yourself with regular expressions. You can get the content of the <h2> with the following code:
String tableToString = "<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>" + "<tr>" + "<td>Date</td>" + "<td>Reason</td>" + "<td>Details</td>" + "</tr>" + "</table>";
String regex = "<h2.*>(.*)?</h2>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(tableToString);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
You can init tableToString with table.toString() from your code.
As ka3ak says, the <h2> is mispositioned. But you don't have to abandon your parser as resort to regex for that. Assuming JSoup is a decent HTML parser (never used it myself) the <h2> element should be the element immediately preceding the <table> element. Get your 'select' statement to look for it there.
Elements headers=playerDoc.select("div.span10.offset1 h2");
IMHO Your selections seams to be little bit overcomplicated, but maybe it has to be like that. Anyway snippet above will get you every H2 tags present in proper container.
Later on you ca select required tables like that Elements tables=playerDoc.select("div.span10.offset1 table"); and apply proper data digging onto them. Headers will be in corresponding order to tables ofc. I think, that my job is done here :)

Categories