How to parse HTML text and links with java and jsoup - java

I need to parse text from a webpage. The text is presented in this way:
nonClickableText= link1 link2 nonClickableText2= link1 link2
I want to be able to convert all to a string in java. The non clickable text should remain like it is while the clickable text should be replaced with its actual link.
So in java I would have this:
String parsedHTML = "nonClickableText= example.com example.com nonClickableText2= example3.com example4.com";
Here are some pictures: first second

What exactly is link1 and link2? According to your example
"... nonClickableText2= example3.com example4.com"
they can be different, so what would be the source besides the href?
Based on you images the following code should give you everything to adopt your final string presentation. First we grab the <strong>-block and then go through the child nodes, using <a>-children with preceding text-nodes:
String htmlString = "<html><div><p><strong>\"notClickable1\"<a rel=\"nofollow\" target=\"_blank\" href=\"example1.com\">clickable</a>\"notClickable2\"<a rel=\"nofollow\" target=\"_blank\" href=\"example2.com\">clickable</a>\"notClickable3\"<a rel=\"nofollow\" target=\"_blank\" href=\"example3.com\">clickable</a></strong></p></div></html>";
Document doc = Jsoup.parse(htmlString); //can be replaced with Jsoup.connect("yourUrl").get();
String parsedHTML = "";
Element container = doc.select("div>p>strong").first();
for (Node node : container.childNodes()) {
if(node.nodeName().equals("a") && node.previousSibling().nodeName().equals("#text")){
parsedHTML += node.previousSibling().toString().replaceAll("\"", "");
parsedHTML += "= " + node.attr("href").toString() + " ";
}
}
parsedHTML.trim();
System.out.println(parsedHTML);
Output:
notClickable1= example1.com notClickable2= example2.com notClickable3= example3.com

Related

Getting the particular (pre-formatted) text (from a website) using JSoup

I'm new to JSoup, and I want to get the text written in this specific HTML tag:
<pre class="cg-msgbody cg-view-msgbody"><span class="cg-msgspan"><span>**the text I want to get is present here, how can I get it using JSoup?**</span></span></pre>
Any help would be appreciated.
Thanks!
String html = "<pre class=\"cg-msgbody cg-view-msgbody\">"
+ "<span class=\"cg-msgspan\">"
+ "<span>**the text I want to get is present here, "
+ "how can I get it using JSoup?**</span>"
+ "</span>"
+ "</pre>";
org.jsoup.nodes.Document document = Jsoup.parse(html);
//a with href
Element link = document.select("span").last();
System.out.println("Text: " + link.text());

How to get the anchor tag value and href value inside a header tag using selenium

My html page code consists alot of anchor tags, but i need to get all the href value inside the anchor tag and anchor tag value which are present inside a header tag of a div element, i'm using selenium in java to get the page source of the html.
The part of html code of my the webpage look like this :-
qq
ww
ee
<div class="hello">
<h2>
aa
aa
</h2>
<div>
The java code i'm using to retrieve the anchor tag values look like this :-
List<WebElement> list = driver.findElements(By.xpath("//*[#href]"));
for (WebElement e : list) {
String link = e.getAttribute("href");
System.out.println(e.getTagName() + "=" + link);
}
The output of the above code is look like this:-
a=www.qq.com
a=www.ww.com
a=www.ee.com
a=www.aa.com
a=www.ss.com
But the output i need is like this:-
a=www.aa.com , aa
a=www.ss.com , ss
I need to get the all the anchortag values and href values inside the hello class
Try this -- Use getText() and modified xpath to include hrefs inside the div with class hello. Assuming the particular div is the unique one with the class name.
List<WebElement> list = driver.findElements(By.xpath("//div[#class='hello']//a[#href]"));
for (WebElement e : list) {
String link = e.getAttribute("href");
System.out.println(e.getTagName() + "=" + link + " , " + e.getText());
}

jsoup clean includes unwanted carriage return

This is currently vexing me.
Jsoup is including an extra line break in the returned string if the string includes <br />
eg.
String html ="TEST<br />TEST";
Jsoup.clean(html, org.jsoup.safety.Whitelist.basic());
returns
TEST\n<br />TEST
Any advice on how to avoid the inclusion of the troublesome \n?
Have you tried .text(); or .ownText(); from the Elements class?
//If you want the whole page
String url = "http://www.yourwebsite.com";
Document doc = Jsoup.connect(url).get();
System.out.println(doc.text());
//If you want some specific part of the page
Elements elems = doc.select("query");
for (Element element : elems) {
System.out.println(element.text() + "\n");
System.out.println(element.ownText() + "\n\n");
}
If each element returned < p>Hello< b> there< /b> now!< /p>
The method text(); would return Hello there now!
The method ownText(); would return Hello now!
Just to make it easier to understand: The .text(); will return the whole text within the tag you got. The ownText(); method will return the text from the tag itself, and not the text from its children.
About the query in doc.select("query");, you can search here for any pattern you want.
Cleaner cleaner = new Cleaner(WHITE_LIST);
Document clean = cleaner.clean(body);
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
clean.outputSettings(outputSettings);
return clean.body().html();
outputSettings.prettyPrint(false);

How to parse and return a list of links to seperate strings[] or strings?

I have html div class formated accordingly....
<div class="latest-media-images">
<div class="hdr-article">LATEST IMAGES</div>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg1" src="http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg2" src="http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg3" src="http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
</div>
Now.... Ive been trying to think of different ways to do this.
I want to parse each URL to sepereate strings for each one...
Now i was thinking of some how parsing them into a list and then selecting each one by passing a position?
(If anyone wants to answer this please feel free too)
Or i could do something such as navigating to the div class...
Element latest_images = doc.select("div.latest-media-images");
Elements links = latest_images.getElementsByTag("img");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
I was thinking of this,havent tried it out yet. I will when i get the chance.
But how will i parse each to a seperate string or a whole list using the code?(if its correct)
Feel free to leave suggestions or answers =) or let me know if the code i have above will do the trick.
Thanks,
coder-For-Life22
Here goes code sample to extract all img urls from your html using RegEx:
//I used your html with some obfuscations to test some fringe cases.
final String HTML
= "<div class=\"latest-media-images\">\n"
+ "<div class=\"hdr-article\">LATEST IMAGES</div>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg1\" \n "
+ "src=\"http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg2\" src= \n"
+ "\"http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg3\" src "
+ "= \t \n "
+ "\"http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "</div>";
Pattern pattern = Pattern.compile ("<img[^>]*?src\\s*?=\\s*?\\\"([^\\\"]*?)\\\"");
Matcher matcher = pattern.matcher (HTML);
List<String> imgUrls = new ArrayList<String> ();
while (matcher.find ())
{
imgUrls.add (matcher.group (1));
}
for (String imgUrl : imgUrls) System.out.println (imgUrl);
The output is the same as Sahil Muthoo posted:
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
If by using a link to get the html first you mean that you have an url than the only change will be that instead of using a hard-coded String you'll need to load the html first. For example, you can use Java OOB class URL:
new URL ("http://some_address").openConnection ().getInputStream ();
Elements thumbs = doc.select("div.latest-media-images img.latestMediaThumb");
List<String> thumbLinks = new ArrayList<String>();
for(Element thumb : thumbs) {
thumbLinks.add(thumb.attr("src"));
}
for(String thumb : thumbLinks) {
System.out.println(thumb);
}
Output
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
Obviously you can parse the html into a DOM tree and extract all "img" nodes using XPath or CSS selector. And then iterating through them fill an array of links.
Though your code doesn't exactly do the trick.
The cycle is written to work with "a" nodes while the code before it extracts img nodes.
There's also another way: you can extract required data using RegEx which should have better performance and less memory cost.

How to get text & Other tags between specific tags using Jericho HTML parser?

I have a HTML file which contains a specific tag, e.g. <TABLE cellspacing=0> and the end tag is </TABLE>. Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text & other tags between specific tags in Jericho parser?
For example:
<TABLE cellspacing=0>
<tr><td>HELLO</td>
<td>How are you</td></tr>
</TABLE>
Answer:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Once you have found the Element of your table, all you have to do is call getContent().toString(). Here's a quick example using your sample HTML:
Source source = new Source("<TABLE cellspacing=0>\n" +
" <tr><td>HELLO</td> \n" +
" <td>How are you</td></tr>\n" +
"</TABLE>");
Element table = source.getFirstElement();
String tableContent = table.getContent().toString();
System.out.println(tableContent);
Output:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Aby, I walk down the code for all elements and show on screen. Maybe help you.
List<Element> elementListTd = source.getAllElements(HTMLElementName.TD);
//Scroll through the list of elements "td" page
for (Element element : elementListTd) {
if (element.getAttributes() != null) {
String td = element.getAllElements().toString();
String tag = "td";
System.out.println("TD: " + td);
System.out.println(element.getContent());
String conteudoAtributo = element.getTextExtractor().toString();
System.out.println(conteudoAtributo);
if (td.contains(palavraCompara)) {
tabela.add(conteudoAtributo);
}
}

Categories