So I am just trying out the Jsoup API and have a simple question. I have a string and would like to keep the string in tact except when passed through my method. I want the string to pass through this method and take out the elements that wrap the links. Right now I have:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements select = Jsoup.parse(html).select("a");
String linkHref = select.attr("href");
System.out.println(linkHref);
}}
This returns the first URL unwrapped only. I would like all URLs unwrapped as well as the original string. Thanks in advance
EDIT: SOLUTION:
Thanks alot for the answer and I edited it only slightly to get the results I wanted. Here is the solution in full that I am using:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link : links) {
doc.select("a").unwrap();
}
System.out.println(doc.text());
}
}
Thanks again
Here's the corrected code:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements links = Jsoup.parse(html).select("a[href]"); // a with href;
for (Element link : links) {
//Do whatever you want here
System.out.println("Link Attr : " + link.attr("abs:href"));
System.out.println("Link Text : " + link.text());
}
}
}
Related
I want to fetch only the HTML content along with the attributes and remove the text.
Input String:
String html = "<p>An <br/><b></b> <b> example <a><p></b>this is the link </p>";
Output
<p><br></br><b></b><b><a><p></p></b></a></p>
Edit:
Most of the questions in google or stackoverflow are only related to removing the html and extract text only. I spent around 3 hours to come across the below mentioned solutions. So posting it here as it will help others
Hope this helps someone like me looking to remove only the text content from the HTML string.
Output
<p><br></br><b></b><b><a><p></p></b></a></p>
String html = "<p>An <br/><b></b> <b> example <a><p></b>this is the link </p>";
Traverser traverser = new Traverser();
Document document = Jsoup.parse(html, "", Parser.xmlParser());// you can use the html parser as well. which will add the html tags
document.traverse(traverser);
System.out.println(traverser.extractHtmlBuilder.toString());
By appending the node.attributes will includes all the attributes.
public static class Traverser implements NodeVisitor {
StringBuilder extractHtmlBuilder = new StringBuilder();
#Override
public void head(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
extractHtmlBuilder.append("<").append(node.nodeName()).append(node.attributes()).append(">");
}
}
#Override
public void tail(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
extractHtmlBuilder.append("</").append(node.nodeName()).append(">");
}
}
}
Another Solution:
Document document = Jsoup.parse(html, "", Parser.xmlParser());
for (Element element : document.select("*")) {
if (!element.ownText().isEmpty()) {
for (TextNode node : element.textNodes())
node.remove();
}
}
System.out.println(document.toString());
With the following code I am able to get the desired text from the website but am unable to get the associated link of the text. Tried several method permutations and combinations. At most what I get is the entire outer html as given below:
<li class="list-item">
<h4><a class="bold" href="abacavir.htm">Abacavir </a> </h4>
Abacavir is an antiviral drug that is effective against the HIV-1 virus.</li>
Here is the code:
public static void main(String[] args) throws Exception {
Map<String,String> drugLinks = new LinkedHashMap<String,String>();
final int OK = 200;
//String currentURL;
//int page = 1;
int status = OK;
Connection.Response response = null;
Document doc = null;
String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
//String keyword = "a";
for (String keyword : keywords){
final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;
response = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.execute();
status = response.statusCode();
doc = response.parse();
Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();
Elements links = tds.select("li[class=list-item]");
for (Element link : links){
System.out.println("generic::"+link.select("a[href]").text());
System.out.println("link::"+link.attr("abs:a"));
}
}
}
Output
generic::Abacavir
link::
generic::Abacavir Sulfate and Lamivudine
link::
generic::Abacavir Sulfate, Lamivudine and Zidovudine
link::
generic::Abaloparatide
link::
generic::Abarelix
link::
How do i get the absolute links from the given HTML?
To get the link from the element, you can use:
link.select("a").attr("href")
However, that will only give you the relative link.
The full link will be:
"https://www.medindia.net/doctors/drug_information/" + link.select("a").attr("href")
<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>
I am learning Jsoup by trying to scrap all the p tags, arranged by title from wikipedia site. I can scrap all the p tags between h2, from the help of this question:
extract unidentified html content from between two tags, using jsoup? regex?
by using
Elements elements = docx.select("span.mw-headline, h2 ~ p");
but I can't scrap it when there is a <div> between them. Here is the wikipedia site I am working on:
https://simple.wikipedia.org/wiki/Battle_of_Hastings
How can I grab all the p tags where they are between two specific h2 tags?
Preferably ordered by id.
Try this option : Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
sample code :
package jsoupex;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class stackoverflw {
public static void main(String[] args) throws IOException {
//Validate.isTrue(args.length == 1, "usage: supply url to fetch");
//String url = "http://localhost/stov_wiki.html";
String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
//args[0];
System.out.println("Fetching %s..." + url);
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
for (Element elem : elements) {
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
}
System.out.println(elem.text());
if ( elem.hasClass("mw-headline")) {
System.out.println("************************");
} else {
System.out.println("");
}
}
}
}
public static void main(String[] args) {
String entity =
"<h2><span class=\"mw-headline\" id=\"The_battle\">The battle</span></h2>" +
"<div class=\"thumb tright\"></h2>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<p>text I want</p>" +
"<h2>Second Title I want to stop collecting p tags after</h2>";
Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
element.outputSettings().prettyPrint(false);
element.outputSettings().outline(false);
List<TextNode>text=getAllTextNodes(element);
}
private static List<TextNode> getAllTextNodes(Element newElementValue) {
List<TextNode>textNodes = new ArrayList<>();
Elements elements = newElementValue.getAllElements();
for (Element e : elements){
for (TextNode t : e.textNodes()){
textNodes.add(t);
}
}
return textNodes;
}
I've this HTML block:
ul class="list_attachments"><li>
<img src='pdf.png' alt='pdf'/> File1</li><li>
<img src='pdf.png' alt='pdf'/> File2</li>
</ul>
I would like to extract all the "a href" row, in particular site and name file informations.
So I tried this:
String [] fileName = new String[2];
String [] url = new String[2];
int i=0;
attachments = document.select(".list_attachments");
for (Element attachment : attachments) {
String fileName[i] = attachment.text();
String url[i] = attachment.select("a").attr("href");
i++;
}
But the result is:
String fileName = "File1 File2";
String url = "www.site1.com";
The problem is that there is only one attachment element instead of two as I expected.
How to solve this? Thanks.
.select(".list_attachments") can select only <ul class="list_attachments"> so it returns Elements with only one <ul> element. I suspect that you wanted to select all <a ...> elements which exist inside .list_attachments and then take their text and href. In that case your code should look more like
Elements anchors = document.select(".list_attachments a");
for (Element anchor : anchors) {
fileName[i] = anchor.text();
url[i] = anchor.attr("href");
i++;
}
I extracted data from an html page and then parsed the tags containing tags like this now I tried different ways like extracting substring etc do extract only the title and href tags. but it'snot working..Can anyone help me. This is the small snippet of my output
my code
doc = Jsoup.connect("myurl").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}
output
<a class="link" title="Waf Ad" href="https://www.facebook.com/waf.ad.54"
data- jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https:
//fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186729_100007938933785_
508764241_q.jpg" alt="Waf Ad" data-jsid="img" /></a>
<a class="link" title="Ana Ga" href="https://www.facebook.com/ata.ga.31392410"
data-jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https://
fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186901_100002334679352_
162381693_q.jpg" alt="Ana Ga" data-jsid="img" /></a>
You can use the attr() method of Element class to extract the value of attributes.
For example:
String href = link.attr("href");
String title = link.attr("title");
See this page for more: Extract attributes, text, and HTML from elements
To get the page title, you can use
Document doc = Jsoup.connect("myurl").get();
String title = doc.title();
For getting the individual links from the different hrefs, you can use this
Elements links = doc.select("a[href]");
for(Element ele : links) {
System.out.println(ele.attr("href").toString());
}
attr() method gives the content inside the matching attributed spedified to it in the given tag.
public class Solution{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
int testCases = Integer.parseInt(scan.nextLine());
while (testCases-- > 0) {
String line = scan.nextLine();
boolean matchFound = false;
Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
}
}
}
REGULAR EXPRESSION EXPLAINATION: