Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am using jsoup to extract info from a web, my code is like this:
doc = Jsoup.connect(myurl).get();
Elements newsHeadlines = doc.select(".myclass");
If I do a System.out.println of newsHeadlines I obtain this:
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-minutoPartido="93'"></span>
<span class="blado"></span>
<span class="blahave">
Oh yeah!<br/></span>
</span>
</span>
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-health="97'"></span>
<span class="blado"></span>
<span class="blahave">
This is my world</span>
</span>
</span>
How can I save on array each block:
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-health="92'"></span>
<span class="blado"></span>
<span class="blahave">
This is my world</span>
</span>
</span>
Thank you so much
newsHeadlines is nothing else than a List of Element as Elements implements List.
So you can iterate over the newsHeadlines the same way than you iterate over a list.
for(Element element : newsHeadlines) {
System.out.println(element.toString());
}
If that is not what you need (I did not test the code), you can try Element.children.
This gives you again Elements you can iterate over.
You could also add a div tag for each comment, and use some Java 8 syntax sugar for collecting the Element-instances in a List
Elements elements = Jsoup.parse(markup).getAllElements().select(".myclass");
List<Element> comments = elements.stream().collect(Collectors.<Element>toList());
for(Element comment : comments) {
System.out.println(comment.html());
}
For the sake of the test I used parse, instead of the connect-method.
It prints:
<span class="cmtComentario"> <span class="blaicon">1</span>.......
<span class="cmtComentario"> <span class="blaicon">2</span>........
Test markup:
String markup = "" +
"<div class=\"myclass\">\n" +
"<span class=\"cmtComentario\">\n" +
"<span class=\"blaicon\">1</span>\n" +
"<span class=\"blacoment\"><span class=\"cmtHora\" data-hora=\"\"></span>\n" +
"<span class=\"blathing\" data-minutoPartido=\"93'\"></span>\n" +
"<span class=\"blado\"></span>\n" +
"<span class=\"blahave\">\n" +
"Oh yeah!<br/></span>\n" +
"</span>\n" +
"</span>\n" +
"</div>" +
"<div class=\"myclass\">\n" +
"<span class=\"cmtComentario\">\n" +
"<span class=\"blaicon\">2</span>\n" +
"<span class=\"blacoment\"><span class=\"cmtHora\" data-hora=\"\"></span>\n" +
"<span class=\"blathing\" data-health=\"97'\"></span>\n" +
"<span class=\"blado\"></span>\n" +
"<span class=\"blahave\">\n" +
"This is my world</span>\n" +
"</span>\n" +
"</span>" +
"</div>";
Hope it helps!
Related
so I have a tool that scans a API for changes. If he found a change, he get a String like:
word=\don\u2019t\ item-id=\"1086\">\n <span class=\
I want to extract the Number from item-id , however there are multiple Numbers in the response.
Is there a possible way to do so? (I also dont know if the Number will 4 digits or just 1-2)
so the Regex should search for something like "NUMBERS\" and print it. (for Java)
Based on your comment it looks like you are receiving JSON structure
{
...
"data":{
"html":".. <a .. data-sku=\"XXX\"> ..",
...
}
...
}
and you are interested in value of data-sku attribute.
In that case parse that JSON and traverse it to get HTML structure. You can use org.json.JSONObject for that (or other parser, pick one you like)
String response = "{\"success\":1,\"data\":{\"html\":\"<div class=\\\"inner\\\">\\n <span class=\\\"title js-title-eligible\\\">Upgrade available<\\/span>\\n <span class=\\\"title js-title-warning\\\"><strong>WARNING :<\\/strong> You don\\u2019t own a <span class=\\\"js-from-ship\\\"><\\/span><\\/span>\\n <p class=\\\"explain js-title-eligible\\\">Buy this upgrade and it will be applicable to your <span class=\\\"js-from-ship\\\"><\\/span> from the My Hangar section.<\\/p>\\n <p class=\\\"explain js-title-warning\\\">You can buy this upgrade but it will only be applicable on a <span class=\\\"js-from-ship\\\"><\\/span>.<\\/p>\\n\\n <div class=\\\"price\\\"><strong class=\\\"final-price\\\">\\u20ac5<span class='super'>.41 <span class='currency'>EUR<\\/span><\\/span><\\/strong><div class=\\\"taxes js-taxes\\\">\\n <div class=\\\"taxes-details trans-02s\\\">\\n <div class=\\\"arrow\\\"><\\/div>\\n Tax Included: <br \\/>\\n <ul>\\n <li>VAT 19%<\\/li>\\n <\\/ul>\\n <\\/div>\\n<\\/div><\\/div>\\n\\n\\n <div>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"add-to-cart holosmallbtn trans-03s js-add-to-cart-ship ty-js-add-to-cart\\\" data-sku=\\\"1086\\\">\\n <span class=\\\"holosmallbtn-top abs-overlay trans-02s\\\">BUY NOW<\\/span>\\n <span class=\\\"holosmallbtn-bottom abs-overlay trans-02s\\\"><\\/span>\\n <\\/a>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"more-details\\\">View more details<\\/a>\\n <\\/div>\\n \\n <p class=\\\"explain info\\\">\\n Upgrades that you buy can be found in your <a href=\\\"\\/account\\/pledges\\\">Hangar section<\\/a>.<br \\/>\\n Click \\\"Apply Upgrade\\\" inside the Upgrade Pledge to pick where you want to apply it.\\n <\\/p>\\n <\\/div>\\n\\n\\n\\n\"},\"code\":\"OK\",\"msg\":\"OK\"}";
JSONObject jsonObject = new JSONObject(response);
String html = jsonObject.getJSONObject("data") //pick data:{...} object
.getString("html"); //from that object get value of html:"..."
Now that you have html you can parse it with HTML parser (I am using jsoup)
Document doc = Jsoup.parse(html);
String dataSku = doc.select("a[data-sku]") //get "a" element with "data-sku" attribute
.attr("data-sku"); //value of that attribute
Output: 1086.
String string = "{\"success\":1,\"data\":{\"html\":\"<div class=\\\"inner\\\">\\n <span class=\\\"title js-title-eligible\\\">Upgrade available<\\/span>\\n <span class=\\\"title js-title-warning\\\"><strong>WARNING :<\\/strong> You don\\u2019t own a <span class=\\\"js-from-ship\\\"><\\/span><\\/span>\\n <p class=\\\"explain js-title-eligible\\\">Buy this upgrade and it will be applicable to your <span class=\\\"js-from-ship\\\"><\\/span> from the My Hangar section.<\\/p>\\n <p class=\\\"explain js-title-warning\\\">You can buy this upgrade but it will only be applicable on a <span class=\\\"js-from-ship\\\"><\\/span>.<\\/p>\\n\\n <div class=\\\"price\\\"><strong class=\\\"final-price\\\">\\u20ac5<span class='super'>.41 <span class='currency'>EUR<\\/span><\\/span><\\/strong><div class=\\\"taxes js-taxes\\\">\\n <div class=\\\"taxes-details trans-02s\\\">\\n <div class=\\\"arrow\\\"><\\/div>\\n Tax Included: <br \\/>\\n <ul>\\n <li>VAT 19%<\\/li>\\n <\\/ul>\\n <\\/div>\\n<\\/div><\\/div>\\n\\n\\n <div>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"add-to-cart holosmallbtn trans-03s js-add-to-cart-ship ty-js-add-to-cart\\\" data-sku=\"1086\\\">\\n <span class=\\\"holosmallbtn-top abs-overlay trans-02s\\\">BUY NOW<\\/span>\\n <span class=\\\"holosmallbtn-bottom abs-overlay trans-02s\\\"><\\/span>\\n <\\/a>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"more-details\\\">View more details<\\/a>\\n <\\/div>\\n \\n <p class=\\\"explain info\\\">\\n Upgrades that you buy can be found in your <a href=\\\"\\/account\\/pledges\\\">Hangar section<\\/a>.<br \\/>\\n Click \\\"Apply Upgrade\\\" inside the Upgrade Pledge to pick where you want to apply it.\\n <\\/p>\\n <\\/div>\\n\\n\\n\\n\"},\"code\":\"OK\",\"msg\":\"OK\"}";
String pattern="(?<=data-sku=)([\\\\]*\")(\\d+)";
Pattern p = Pattern.compile(pattern);
Matcher matcher = p.matcher(string);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.println(" End index: " + matcher.end() + " ");
System.out.println("number="+matcher.group(2));
}
I am trying to parse a HTML file using Jsoup. There are certain text in the HTML that doesn't come under an tags.
<li class="inactive">
<span class="status label">inactive</span>
<a href="/officers/144662696" class="officer inactive" title="more info on MILLTOWN CORPORATE SERVICES">
MILLTOWN CORPORATE SERVICES
</a>
member,
<span class="status label">inactive</span>
<a href="/companies/us_wv/193180" class="company inactive revoked_(failure_to_file_annual_report)" title="More Free And Open Company Data On EASTBRIDGE L.L.C. (West Virginia (US), 193180)">
EASTBRIDGE L.L.C.
</a>
(West Virginia (US),
<span class="start_date">25 May 2000</span>-<span class="end_date"> 1 Aug 2002</span>)
</li>
I am able to read all the content in a tag but I am trying to get the values (West Virginia US) and member.
Is there a way to get the values outside the classes and inside a li tag.
You are probably looking for something like Element#ownText.
This only gets the text of the current element and not a combined text of all children.
Element listItem = doc.select("li.inactive").first();
System.out.println(listItem.ownText()); // prints "member, (West Virginia (US), -)"
You can also use the previous tags to get the text nodes which are not embedded in any tags. If i get it right, you want to get each text node after each a tag. Try something like :
String html = "<li class=\"inactive\"> \n"
+ " <span class=\"status label\">inactive</span> \n"
+ " <a href=\"/officers/144662696\" class=\"officer inactive\" title=\"more info on MILLTOWN CORPORATE SERVICES\">\n"
+ " MILLTOWN CORPORATE SERVICES\n"
+ " </a>\n"
+ " member, \n"
+ " <span class=\"status label\">inactive</span> \n"
+ " <a href=\"/companies/us_wv/193180\" class=\"company inactive revoked_(failure_to_file_annual_report)\" title=\"More Free And Open Company Data On EASTBRIDGE L.L.C. (West Virginia (US), 193180)\">\n"
+ " EASTBRIDGE L.L.C.\n"
+ " </a> \n"
+ " (West Virginia (US), \n"
+ " <span class=\"start_date\">25 May 2000</span>-<span class=\"end_date\"> 1 Aug 2002</span>) \n"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for(Element e : links){
System.out.println(e.nextSibling().toString());
}
<p class="name">
<strong class="title displaynone"> :</strong>T-shritsT <span class="icon"></span></p>
<ul class="xans-element- xans-product xans-product-listitem">
<li class=" xans-record-"><strong class="title displaynone"><span style="font-size:12px;color:#555555;">price</span> :</strong> <span style="font-size:12px;color:#555555;"><s></s>$20</span></li>
In this code, I want to get text only "T-shrits" and price "$20" without ':' and "price"
This is my code,
Elements goods = document.select("p.name > a");
for (Element e :goods) {
System.out.println("------------------------------------------");
System.out.println("goods" + e.text()); }
Try this :
public class Test {
public static void main(String[] args) {
String s="<p class=\"name\">\n" +
"<strong class=\"title displaynone\"> :</strong>T-shritsT <span class=\"icon\"></span></p>\n" +
"<ul class=\"xans-element- xans-product xans-product-listitem\">\n" +
"<li class=\" xans-record-\"><strong class=\"title displaynone\"><span style=\"font-size:12px;color:#555555;\">price</span> :</strong> <span style=\"font-size:12px;color:#555555;\"><s></s>$20</span></li>";
Document document= Jsoup.parse(s);
document.select("strong").remove();
Whitelist whitelist = Whitelist.basic();
System.out.println(Jsoup.parse(Jsoup.clean(document.toString(), whitelist)).text());
}
}
output:
T-shritsT $20
<div>
<div class = "main">
<div class ="content">
<div class="content_left">
<div class="alisveris_context_box">
<ul class = "sinema_list">
<li>
<a href="blabla/12" title="asd">
<img src="http://asd.jpg">
<span class ="cartoon">
Textaa
</span>
How can I get the href value (blabla/12 in the example) and span value (Textare in the example)?
Lets say your html is the follow.
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://example.com/"
link.attr("href") will have your link.
Same for your span. Think for yourself :)
source: http://jsoup.org/cookbook/extracting-data/attributes-text-html
Elements elements = Jsoup.parse(html).select("div[class=main] div[class=content] div[class=content_left] div[class=alisveris_context_box] ul[class=sinema_list] li a");
String href = elements.first().attr("href");
String spanText = elements.first().select("span[class=cartoon]").first().text();
Using Jsoup you can easily find out
You will get span value by this
String st="<div> <div class = \"main\"> <div class =\"content\"> "
+ "<div class=\"content_left\"> <div class=\"alisveris_context_box\">"
+ " <ul class = \"sinema_list\"> <li> <a href=\"blabla/12\" title=\"asd\">"
+ "<img src=\"http://asd.jpg\"> <span class =\"cartoon\"> Textaa </span>";
String spanValue=Jsoup.parse(st).text();
and href value by
String href=Jsoup.parse(st).getElementsByTag("a").attr("href");
I'm having trouble with clicking at an element, which I find using text which is a variable. This is the code of the page:
<div class="recommendedProfileList fl">
<h3>
<ul class="ctrlResearchProfiles">
<li>
<li>
<li>
<li>
<li>
<li>
<span class="profileBtn ctrlSelectDefProfile ctrlClickSubmit" data-value="143" data-form="formChooseProfile" data-profileid="143">Sales manager</span>
<span class="profileTooltip" style="display: none;">
<span class="arrow"/>
<span class="profileTooltipContent">
</span>
and the name of the variable is profile. This is how I've tried to do this, but did not work:
WebDriverWait wait = new WebDriverWait(driver, 5);
wait.until(ExpectedConditions.elementToBeClickable(By.xpath("//*[text()=' + profile + ']")));
second:
driver.findElement(By.xpath("//*[text()=' + profile + ']"));
also:
driver.findElement(By.linkText("" +profile)).click();
Do you know how to click such element?
You are almost there buddy...
wait = new WebDriverWait(driver, 5);
wait.until(ExpectedConditions.elementToBeClickable(By.xpath("//*[text()='" + profile + "']")));
second:
driver.findElement(By.xpath("//*[text()='" + profile + "']"));
The thing that you missed was double quotes to insert ur variable values in xpath.