Java parsing HTML with dynamic pages - java

I've come to a halt.
For school project we have to parse shitton of links formatted: http://us.imdb.com/M/title-exact?Desperado%20(1995). If you go to this link, you'll see that page gets built dynamically.
How could I use jsoup.org or something similar to get HTML to my procedures? I'm trying to parse some names out of these pages.
I tried this:
Document doc;
doc = (Document) Jsoup.connect(url).get();
System.out.println("text : " + doc.title());
but it returns 403.
Help:(

Are you sure to use connect(String url) method initialize all default parameter before getting result, If not you may first do,
Try this way,
Document doc = Jsoup.connect("http://www.imdb.com/title/tt0112851/")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.get();
String title = doc.title();
System.out.println("text : " + title);

Related

Getting the particular (pre-formatted) text (from a website) using JSoup

I'm new to JSoup, and I want to get the text written in this specific HTML tag:
<pre class="cg-msgbody cg-view-msgbody"><span class="cg-msgspan"><span>**the text I want to get is present here, how can I get it using JSoup?**</span></span></pre>
Any help would be appreciated.
Thanks!
String html = "<pre class=\"cg-msgbody cg-view-msgbody\">"
+ "<span class=\"cg-msgspan\">"
+ "<span>**the text I want to get is present here, "
+ "how can I get it using JSoup?**</span>"
+ "</span>"
+ "</pre>";
org.jsoup.nodes.Document document = Jsoup.parse(html);
//a with href
Element link = document.select("span").last();
System.out.println("Text: " + link.text());

Jsoup don't parse xml correctly, missing tags

I want to parse a xml text but jsoup seems to delete <col> tags.
This is what happens:
Original:
<rowh> <col>DTC Code</col> <col>Description</col> </rowh>
Result:
<rowh> DTC Code Description
</rowh>
This is the code I am using to see the content.
Document jDoc = Jsoup.parse(contentXML);
Log.d("Original", contentXML);
Log.d("Document", jDoc.outerHtml());
I need to count how many <col> tags are inside each <rowh> tag but it always returns 0. I am using Jsoup version 1.11.2
May this helps you:
String html = "<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"><rowh><col>DTC Code</col><col>Description</col></rowh></xml>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements e = doc.select("rowh");
String text = e.text();
Log.i("TAG1", text);
OutPut:

How to parse HTML text and links with java and jsoup

I need to parse text from a webpage. The text is presented in this way:
nonClickableText= link1 link2 nonClickableText2= link1 link2
I want to be able to convert all to a string in java. The non clickable text should remain like it is while the clickable text should be replaced with its actual link.
So in java I would have this:
String parsedHTML = "nonClickableText= example.com example.com nonClickableText2= example3.com example4.com";
Here are some pictures: first second
What exactly is link1 and link2? According to your example
"... nonClickableText2= example3.com example4.com"
they can be different, so what would be the source besides the href?
Based on you images the following code should give you everything to adopt your final string presentation. First we grab the <strong>-block and then go through the child nodes, using <a>-children with preceding text-nodes:
String htmlString = "<html><div><p><strong>\"notClickable1\"<a rel=\"nofollow\" target=\"_blank\" href=\"example1.com\">clickable</a>\"notClickable2\"<a rel=\"nofollow\" target=\"_blank\" href=\"example2.com\">clickable</a>\"notClickable3\"<a rel=\"nofollow\" target=\"_blank\" href=\"example3.com\">clickable</a></strong></p></div></html>";
Document doc = Jsoup.parse(htmlString); //can be replaced with Jsoup.connect("yourUrl").get();
String parsedHTML = "";
Element container = doc.select("div>p>strong").first();
for (Node node : container.childNodes()) {
if(node.nodeName().equals("a") && node.previousSibling().nodeName().equals("#text")){
parsedHTML += node.previousSibling().toString().replaceAll("\"", "");
parsedHTML += "= " + node.attr("href").toString() + " ";
}
}
parsedHTML.trim();
System.out.println(parsedHTML);
Output:
notClickable1= example1.com notClickable2= example2.com notClickable3= example3.com

Jsoup selector on RSS <link> tag returns empty string with .text() method

I'm using jsoup to parse an rss feed using java. I'm having problems getting a result when trying to select the first <link> element in the document.
When I use title.text() I get an expected result with this code:
Document doc = Jsoup.connect(BLOG_URL).get();
Element title = doc.select("rss channel title").first();
System.out.println(title.text()); // print the blog title...
However, link.text() doesn't work the same way:
Element link = doc.select("rss channel link").first();
System.out.println(link.text()); // prints empty string
When I inspect doc.select("rss channel link") the Element link object is populated but the .println() statement is just an empty string.
What makes .select("rss channel link") so dang special that I can't figure out how to use it?
Edit: The RSS response begins like this:
<?xml version="1.0" encoding="UTF-8"?>
<rss>
<channel>
<title>The Blog Title</title>
<link>http://www.the.blog/category</link>
Your rss feed is XML, not HTML. For this to work, you must tell JSoup to use its XMLParser. This will work:
String rss = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
+"<rss><channel>"
+ "<title>The Blog Title</title>"
+ "<link>http://www.the.blog/category</link>"
+"</channel></rss>";
Document doc = Jsoup.parse(rss, "", Parser.xmlParser());
Element link = doc.select("rss channel link").first();
System.out.println(link.text()); // prints empty string
Explanation:
The link tag in HTML follows a different format and Jsoup tries to interpret the <link> of your rss as such html tag.
Refer here. Jsoup added this XmlParser.
try {
String xml = "<rss></rss><channel></channel><link>http://www.the.blog/category</link><title>The Blog Title</title>";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
Element title = doc.select("title").first();
System.out.println(title.text());
Element link = doc.select("link").first();
System.out.println(link.text());
} catch (Exception e) {
e.printStackTrace();
}

Extracting user details from facebook page

I am extracting details from a page which I'm administering. I tried using jsoup to extract the links then from that extract names of users but it's not working. It only shows links other than user links. I tried extracting names from this link
https://www.facebook.com/plugins/fan.php?connections=100&id=pageid
which is working quite well but does not works for this link
https://www.facebook.com/browse/?type=page_fans&page_id=
Can anyone help me...Below is the code which I tried.
doc = Jsoup.connect("https://www.facebook.com/browse/?type=page_fans&page_id=mypageid").get();
Elements els = doc.getElementsByClass("fsl fwb fcb");
Elements link = doc.select("a[href]");
for(Element ele : link)
{
system.out.println(ele.attr("href"));
} }
Try This
Document doc = Jsoup.connect("https://www.facebook.com/plugins/fan.php?connections=100&id=pageid").timeout(0).get();
Elements nameLinks = doc.getElementsByClass("link");
for (Element users : nameLinks) {
String name = users.attr("title");
String url = users.attr("href");
System.out.println(name + "-" + url);
}
It will give all the users name and URl present on the first link defined in your question.

Categories