Extracting multiple links within a specific class - java

<div class="customer"><a href='view.php?customer=1234' class=''></div>
<div class="customer"><a href='view.php?customer=1235' class=''></div>
<div class="customer"><a href='view.php?customer=1236' class=''></div>
In this example, would there be any way to extract the href links within the customer class, without having to re-parse the html?

Unless there is a better way, I think this works...
Elements links = doc.select("div.customer a[href]");
String absHref;
for (Element link : links) {
absHref = link.attr("abs:href");
System.out.println(absHref);
}

Elements customerElements = doc.getElementsByTag("cutomer");
for(Element cutomer:customerElements){
String link =customer.getElementsByTag("a").first().attr("href");
}

You can do it using the below code.
Elements links = doc.select("div.customer a");
String linkUrl;
for (Element link : links) {
linkUrl = link.attr("href");
System.out.println(linkUrl);
}

Related

jsoup: parse data of certain tag which is just after a particular tag

I am trying to parse certain information through jsoup in Java from last 3 days -_-, this is my code:
Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");
for (Element link : links) {
// String name = link.text();
String title = link.select("h2").text();
String content = link.select("p").text();
System.out.println(title);
System.out.println(content);
}
It is fetching the data as it is directed, fetching the data of h2 and p separated, but the problem is, I want to parse the data inside of <p> tag which is just after every <h2> tag.
For example (HTML content):
<h2>main content</h2>
<div class="acx"><div>
<p>content</p>
<p>content 2</p>
<h2>content 2</h2>
<div class="acx"><div>
<p>new content od 2</p>
<p>new 2</p>
Now it should fetch like (in array):
array[0] = "content content 2",
array[1] = "new content od 2 new 2",
Any solutions?
You can play with "~" next element selector. For example
link.select("h2 ~ p").get(0).text(); // returns "content"
link.select("h2 ~ p").get(1).text(); // returns "new content od 2"
Just use your initial approach to iterate all necessary tags within selected .contentBox class:
Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");
for (Element link : links) {
for (Element h2Tag : link.select("h2"))
{
System.out.println(h2Tag.text());
}
for (Element pTag : link.select("p"))
{
System.out.println(pTag.text());
}
}

Convert String to arraylist using split

Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}
Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.

java jsoup parse how to parse html

Is there any possible way to parse
Huhi
in html:
Huhi
White
Angle
Output:
Huhi
White
Angle
Create your document and get all the a[href] links, iterate through these links and get the text they contain. Like so:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
String text = link.text();
}
You just select a and iterate the elements and print
String html ="Huhi\n" +
"White\n" +
"Angle";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element link : links) {
System.out.println(link.text());
}
For further reference check this link selector-syntax

Parse the inner html tags using jSoup

I want to find the important links in a site using Jsoup library. So for this suppose we have following code:
<h1>This is important </h1>
Now while parsing how can we find that the tag a is inside the h1 tag?
You can do it this way:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements headlinesCat1 = doc.getElementsByTag("h1");
for (Element headline : headlinesCat1) {
Elements importantLinks = headline.getElementsByTag("a");
for (Element link : importantLinks) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println(linkHref);
}
}
Taken from the JSoup Cookbook.
Use selector:
Elements elements = doc.select("h1 > a");

Jsoup get href within a class

I have this html code that I need to parse
<a class="sushi-restaurant" href="/greatSushi">Best Sushi in town</a>
I know there's an example for jsoup that you can get all links in a page,e.g.
Elements links = doc.select("a[href]");
for (Element link : links) {
print(" * a: <%s> (%s)", link.attr("abs:href"),
trim(link.text(), 35));
}
but I need a piece of code that can return me the href for that specific class.
Thanks guys
You can select elements by class. This example finds elements with the class sushi-restaurant, then gets the absolute URL of the first result.
Make sure that when you parse the HTML, you specify the base URL (where the document was fetched from) to allow jsoup to determine what the absolute URL of a link is.
public static void main(String[] args) {
String html = "<a class=\"sushi-restaurant\" href=\"/greatSushi\">Best Sushi in town</a>";
Document doc = Jsoup.parse(html, "http://example.com/");
// find all <a class="sushi-restaurant">...
Elements links = doc.select("a.sushi-restaurant");
Element link = links.first();
// 'abs:' makes "/greatsushi" = "http://example.com/greatsushi":
String url = link.attr("abs:href");
System.out.println("url = " + url);
}
Shorter version:
String url = doc.select("a.sushi-restaurant").first().attr("abs:href");
Hope this helps!
Elements links = doc.select("a");
for (Element link : links) {
String attribute=link.attr("class");
if(attribute.equalsIgnoreCase("sushi-place")){
print link.href//You probably need this
}
}

Categories