JSoup Parse text and links in sequence from html file - java

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.
Here is my code:
try {
doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for(Element p : paragraphs){
// System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
}
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
}
}
I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:
Text
Link
Text
Link
I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?

Try
Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for (Element p : paragraphs) {
System.out.println(p.text());
Elements links = p.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
}
}

Related

How to get scrape specific URL from multiple URL in Webpage Java

I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted changes to the exempt offering framework and ignore other links. How do I do that in Java? I was able to extract all URL but not sure how to get specific one. Below is my code
while (rs.next()) {
String Content = rs.getString("Content");
doc = Jsoup.parse(Content);
//email extract
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
//title extract
String title = doc.title();
System.out.println("Title: " + title);
}
Elements links = doc.select("a");
for(Element link: links) {
String url = link.attr("href");
System.out.println("\nlink :"+ url);
System.out.println("text: " + link.text());
}
System.out.println("Getting all the images");
Elements image = doc.getElementsByTag("img");
for(Element src:image) {
System.out.println("src "+ src.attr("abs:src"));
}

How to get anchor tag href and anchor tag text inside a div using Selenium in Java

My HTML code consists of multiple divs. Inside each div is a list of anchor tags. I need to fetch the href values and text values of the anchor tags that are in the sub-container div. I'm using Selenium to get the HTML code of the webpage.
HTML code:
<body>
<div id="main-container">
One
Two
Three
<div id="sub-container">
Abc
Xyz
Pqr
</div>
</div>
</body>
Java code:
List<WebElement> list = driver.findElements(By.xpath("//*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(e.getTagName() + "=" + link);
}
Output:
a=www.one.com
a=www.two.com
a=www.three.com
a=www.abc.com
a=www.xyz.com
a=www.pqr.com
Output I need:
a=www.abc.com , Abc
a=www.xyz.com , Xyz
a=www.pqr.com , Pqr
Try this,
List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(element.getTagName() + "=" + link +", "+ element.getText());
}
You can use element.getText() to get the link text.
If you only want to select the links in the sub-container, you can adjust your xPath:
//*[#id="sub-container"]/a
Pretty simple, try as below:
`List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/a"));
for (WebElement element : list) {
String link = element.getAttribute("href");
String text = element.getText();
System.out.println(e.getTagName() + "=" + link + ", " + text);
}
if id sub-container is unique, just use the below line
driver.findElements(By.cssSelector("div#sub-container>a"));
thanks

[JAVA]Get html link from webpage

I want to get the link in this pic using java, image is below. There are few more links in that webpage. I found this code on stackoverflow, I don't understand how to use it though.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class weber{
public static void main(String[] args)throws Exception{
String url = "http://www.skyovnis.com/category/ufology/";
Document doc = Jsoup.connect(url).get();
/*String question = doc.select("#site-inner").text();
System.out.println("Question: " + question);*/
Elements anser = doc.select("#container .entry-title a");
for (Element anse : anser){
System.out.println("Answer: " + anse.text());
}
}
}
code is edited from the original I found tho. please help.
For your URL following code works fine.
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.skyovnis.com/category/ufology/").userAgent("Mozilla").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links (this is what you want)
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
output was
title : Ufology
link : http://www.shop.skyovnis.com/
text : Shop
link : http://www.shop.skyovnis.com/product-category/books/
text : Books
Following code filter the links by text of it.
for (Element link : links) {
if(link.text().contains("Arecibo Message"))//find the link with some texts
{
System.out.println("here is the element you need");
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
}
It’s recommended to specify a “userAgent” in Jsoup, to avoid HTTP 403 error messages.
Document doc = Jsoup.connect("http://anyurl.com").userAgent("Mozilla").get();
"Onna malli mage yuthukama kala."
refernce :
https://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

How to use JSoup to get hyperlink href?

I have the following jsFiddle
http://jsfiddle.net/B5zvV/
I am trying to use JSoup to obtain the value of the hyperlink's href string on Line 238:
<a href="/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450">
Hence, the desired result would be to obtain a String with a value of:
/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450
Here's my code:
Document doc = Jsoup.connect("http://myapp.example.com/fizz.html").get()
Elements elems = doc.getElementsByAttributeValueContaining("href", "repositoryId")
When I run this, the value of elems is empty: why, and what do I need to do to get the desired String?
The getElementsByAttributeValueContaining() method will return multiple values in this case because many hrefs has repositoryId. If you are particular about line 238 then that a is enclosed inside an li with class item item-default. There is only one such li and two a tags inside it. Just take the first one like
String html = "<li class=\"item item-default\" data-item-id=\"28049450\" id=\"item-28049450\">"
+ "<a href=\"/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450\">"
+ "<h3 class=\"item-title\">MCAppRepo <span class=\"item-default-marker grey\">(default)</span></h3>"
+ "</a>"
+ "<a href=\"/chain/admin/config/confirmDeleteRepository.action?planKey=AB-CSD&repositoryId=28049450\" class=\"delete\" title=\"Remove repository\">"
+ "<span class=\"assistive\">Delete</span>"
+ "</a>"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements elems = doc.select("li.item.item-default > a");
System.out.println(elems.first().attr("href"));

Search Function in HTML

How can I search text in HTMLDocument and then return the index and last index of that word/sentence but ignoring tags when searching..
Searching: stackoverflow
html: <p class="red">stack<b>overflow</b></p>
this should return index 15 and 31.
Just like in browsers when searching in webpages.
If you want to do that in Java, here are rough example using Jsoup. But of course you should implement the detail so that the code can parse properly for any given html.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p class=\"red\">stack<b>overflow</b></p></body></html>";
String search = "stackoverflow";
Document doc = Jsoup.parse(html);
String pPlainText = doc.body().getElementsByTag("p").first().text(); // will return stackoverflow
if(search.matches(pPlainText)){
System.out.println("text found in html");
String pElementString = doc.body().html(); // this will return <p class="red">stack<b>overflow</b></p></body>
String firstWord = doc.body().getElementsByTag("p").first().ownText(); // "stack"
String secondWord = doc.body().getElementsByTag("p").first().children().first().ownText(); // "overflow"
//search the text in pElementString
int start = pElementString.indexOf(firstWord); // 15
int end = pElementString.lastIndexOf(secondWord) + secondWord.length(); // 31
System.out.println(start + " >> " + end);
}else{
System.out.println("cannot find searched text");
}

Categories