Html parse with Jsoup - java

<div>
<div class = "main">
<div class ="content">
<div class="content_left">
<div class="alisveris_context_box">
<ul class = "sinema_list">
<li>
<a href="blabla/12" title="asd">
<img src="http://asd.jpg">
<span class ="cartoon">
Textaa
</span>
How can I get the href value (blabla/12 in the example) and span value (Textare in the example)?

Lets say your html is the follow.
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://example.com/"
link.attr("href") will have your link.
Same for your span. Think for yourself :)
source: http://jsoup.org/cookbook/extracting-data/attributes-text-html

Elements elements = Jsoup.parse(html).select("div[class=main] div[class=content] div[class=content_left] div[class=alisveris_context_box] ul[class=sinema_list] li a");
String href = elements.first().attr("href");
String spanText = elements.first().select("span[class=cartoon]").first().text();

Using Jsoup you can easily find out
You will get span value by this
String st="<div> <div class = \"main\"> <div class =\"content\"> "
+ "<div class=\"content_left\"> <div class=\"alisveris_context_box\">"
+ " <ul class = \"sinema_list\"> <li> <a href=\"blabla/12\" title=\"asd\">"
+ "<img src=\"http://asd.jpg\"> <span class =\"cartoon\"> Textaa </span>";
String spanValue=Jsoup.parse(st).text();
and href value by
String href=Jsoup.parse(st).getElementsByTag("a").attr("href");

Related

Java selenium - skip span element within div

Here's my html structure and I am trying to skip a span within a div to get div's text only (which is dynamic) for testing.
<div class="items">
<div class="payment void" id="payment-000000899799">
<div class="payment__details">
<div class="method">CASH <span class="label"><span>Payment Voided</span></span></div>
<div class="date">2/12/2021, 3:02:15 PM</div>
</div>
<div class="payment__details">
<div class="amount">$20.00</div>
<div class="ref">Ref ID: REF-ID-01</div>
</div>
</div>
<div class="payment sale" id="payment-000000899806">
<div class="payment__details">
<div class="method">CASH </div>
<div class="date">2/12/2021, 3:02:21 PM</div>
</div>
<div class="payment__details">
<div class="amount">$100.00</div>
<div class="ref">Ref ID: REF-ID-02</div>
</div>
</div>
</div>
In my step definition I've
List<List<String>> data = capturedData.raw();
WebElement paymentDetails = driver.findElement(By.xpath("(//*[#class='payment__details']/div[#class='method'])[" + data.get(1).get(0) + "]"));
String paymentType = paymentDetails.getText();
System.out.println(paymentType); //This prints CASH Payment Voided
But actually I want only 'CASH' which is a text of div and skip the text 'Payment Voided' of span. And data is coming from a feature file.
How do I get text of div only and skip the text span which is inside the same div?
You cannot skip it directly since it is part of element.
1st Approach
String fulltext = driver.findElement(By.xpath("//*[#class='payment__details']/div[#class='method'][1]")).getText();
String spantext = driver.findElement(By.xpath("//*[#class='payment__details']/div[#class='method'][1]/span")).getText();
//Now replace span text with ""
String output = fulltext.replace(span,"");
2nd Approach
WebElement parent = driver.findElement(By.xpath("//*[#class='payment__details']/div[#class='method'][1]"));
JavascriptExecutor js = (JavascriptExecutor) driver;
String output = (String) js.executeScript("return arguments[0].firstChild.textContent",parent);

how to display span class field

i am trying to display two "text text-pass" from html in chrome browser to my print console, apparently, it did not work, any advise please?
my browser html code
<a href="/abc/123" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
<a href="/abc/1234" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
My code
String 123= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(123);
String 1234= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(1234);
You can use .findElements to get multiple elements with the same pattern, it will return a list collection.
UPDATE
Refers to your comment, you need put the string into a list again and check with the Collection.contains() method:
List<String> results = new ArrayList<>();
List<WebElement> elements = driver.findElements(By.xpath("//div[#class='sidebar-text']//span"));
for(WebElement element: elements) {
String attr = element.getAttribute("class");
results.add(attr);
System.out.println(attr);
}
if(results.contains("text text-fail")) {
System.out.println("this is list contains 'text text-fail'");
}
Try this Code :
String pass = driver.findElement(By.xpath("//*[#class='sidebar-text']/span")).getAttribute("class");
System.out.println(pass);

extracting Text non recursively with Jsoup

this is the code I'm trying to run :
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html); //connect to the page
Element element = doc.getAllElements().first(); //recive the names elements
System.out.println(element.text()); //prints "ZOLA (1)"
System.out.println(element.ownText()); // prints nothing
my goal is to extract only "ZOLA", without the text of the children node, but ownText prints nothing...
how should I do it?
The problem is that doc.getAllElements().first() returns
<html>
<head></head>
<body>
ZOLA <span class="tiny">(1)</span>
</body>
</html>
while you expect
ZOLA <span class="tiny">(1)</span>
The following should work for you:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Elements links = doc.getElementsByTag("a");
System.out.println(links.get(0));
System.out.println(links.get(0).ownText());
Output:
ZOLA <span class="tiny">(1)</span>
ZOLA
You can use this:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Element elementA = doc.selectFirst("a");
System.out.println(elementA.ownText()); // ZOLA

Regex to extract a Variable from a Java Request

so I have a tool that scans a API for changes. If he found a change, he get a String like:
word=\don\u2019t\ item-id=\"1086\">\n <span class=\
I want to extract the Number from item-id , however there are multiple Numbers in the response.
Is there a possible way to do so? (I also dont know if the Number will 4 digits or just 1-2)
so the Regex should search for something like "NUMBERS\" and print it. (for Java)
Based on your comment it looks like you are receiving JSON structure
{
...
"data":{
"html":".. <a .. data-sku=\"XXX\"> ..",
...
}
...
}
and you are interested in value of data-sku attribute.
In that case parse that JSON and traverse it to get HTML structure. You can use org.json.JSONObject for that (or other parser, pick one you like)
String response = "{\"success\":1,\"data\":{\"html\":\"<div class=\\\"inner\\\">\\n <span class=\\\"title js-title-eligible\\\">Upgrade available<\\/span>\\n <span class=\\\"title js-title-warning\\\"><strong>WARNING :<\\/strong> You don\\u2019t own a <span class=\\\"js-from-ship\\\"><\\/span><\\/span>\\n <p class=\\\"explain js-title-eligible\\\">Buy this upgrade and it will be applicable to your <span class=\\\"js-from-ship\\\"><\\/span> from the My Hangar section.<\\/p>\\n <p class=\\\"explain js-title-warning\\\">You can buy this upgrade but it will only be applicable on a <span class=\\\"js-from-ship\\\"><\\/span>.<\\/p>\\n\\n <div class=\\\"price\\\"><strong class=\\\"final-price\\\">\\u20ac5<span class='super'>.41 <span class='currency'>EUR<\\/span><\\/span><\\/strong><div class=\\\"taxes js-taxes\\\">\\n <div class=\\\"taxes-details trans-02s\\\">\\n <div class=\\\"arrow\\\"><\\/div>\\n Tax Included: <br \\/>\\n <ul>\\n <li>VAT 19%<\\/li>\\n <\\/ul>\\n <\\/div>\\n<\\/div><\\/div>\\n\\n\\n <div>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"add-to-cart holosmallbtn trans-03s js-add-to-cart-ship ty-js-add-to-cart\\\" data-sku=\\\"1086\\\">\\n <span class=\\\"holosmallbtn-top abs-overlay trans-02s\\\">BUY NOW<\\/span>\\n <span class=\\\"holosmallbtn-bottom abs-overlay trans-02s\\\"><\\/span>\\n <\\/a>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"more-details\\\">View more details<\\/a>\\n <\\/div>\\n \\n <p class=\\\"explain info\\\">\\n Upgrades that you buy can be found in your <a href=\\\"\\/account\\/pledges\\\">Hangar section<\\/a>.<br \\/>\\n Click \\\"Apply Upgrade\\\" inside the Upgrade Pledge to pick where you want to apply it.\\n <\\/p>\\n <\\/div>\\n\\n\\n\\n\"},\"code\":\"OK\",\"msg\":\"OK\"}";
JSONObject jsonObject = new JSONObject(response);
String html = jsonObject.getJSONObject("data") //pick data:{...} object
.getString("html"); //from that object get value of html:"..."
Now that you have html you can parse it with HTML parser (I am using jsoup)
Document doc = Jsoup.parse(html);
String dataSku = doc.select("a[data-sku]") //get "a" element with "data-sku" attribute
.attr("data-sku"); //value of that attribute
Output: 1086.
String string = "{\"success\":1,\"data\":{\"html\":\"<div class=\\\"inner\\\">\\n <span class=\\\"title js-title-eligible\\\">Upgrade available<\\/span>\\n <span class=\\\"title js-title-warning\\\"><strong>WARNING :<\\/strong> You don\\u2019t own a <span class=\\\"js-from-ship\\\"><\\/span><\\/span>\\n <p class=\\\"explain js-title-eligible\\\">Buy this upgrade and it will be applicable to your <span class=\\\"js-from-ship\\\"><\\/span> from the My Hangar section.<\\/p>\\n <p class=\\\"explain js-title-warning\\\">You can buy this upgrade but it will only be applicable on a <span class=\\\"js-from-ship\\\"><\\/span>.<\\/p>\\n\\n <div class=\\\"price\\\"><strong class=\\\"final-price\\\">\\u20ac5<span class='super'>.41 <span class='currency'>EUR<\\/span><\\/span><\\/strong><div class=\\\"taxes js-taxes\\\">\\n <div class=\\\"taxes-details trans-02s\\\">\\n <div class=\\\"arrow\\\"><\\/div>\\n Tax Included: <br \\/>\\n <ul>\\n <li>VAT 19%<\\/li>\\n <\\/ul>\\n <\\/div>\\n<\\/div><\\/div>\\n\\n\\n <div>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"add-to-cart holosmallbtn trans-03s js-add-to-cart-ship ty-js-add-to-cart\\\" data-sku=\"1086\\\">\\n <span class=\\\"holosmallbtn-top abs-overlay trans-02s\\\">BUY NOW<\\/span>\\n <span class=\\\"holosmallbtn-bottom abs-overlay trans-02s\\\"><\\/span>\\n <\\/a>\\n <a href=\\\"\\/pledge\\/Upgrades\\/Mustang-Alpha-To-Aurora-LN-Upgrade\\\" class=\\\"more-details\\\">View more details<\\/a>\\n <\\/div>\\n \\n <p class=\\\"explain info\\\">\\n Upgrades that you buy can be found in your <a href=\\\"\\/account\\/pledges\\\">Hangar section<\\/a>.<br \\/>\\n Click \\\"Apply Upgrade\\\" inside the Upgrade Pledge to pick where you want to apply it.\\n <\\/p>\\n <\\/div>\\n\\n\\n\\n\"},\"code\":\"OK\",\"msg\":\"OK\"}";
String pattern="(?<=data-sku=)([\\\\]*\")(\\d+)";
Pattern p = Pattern.compile(pattern);
Matcher matcher = p.matcher(string);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.println(" End index: " + matcher.end() + " ");
System.out.println("number="+matcher.group(2));
}

How do I correctly parse data using JSoup (java)

I want to parse the data out of this HTML (CompanyName, Location, jobDescription,...) using JSoup (java). I get stuck when trying to iterate the joblistings
The extract from the HTML is one of many "JOBLISTING" divs which I want to iterate and extract the Data out of it. I just can't handle how to iterate the specific div objects. Sorry for this noob question, but maybe someone can help me who already knows which function to use. Select?
<div class="between_listings"><!-- local.spacer --></div>
<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">
<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
<img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
</a>
</div>
<div class="job_info">
<div class="h3 job_title">
<a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
<span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
</a>
</div>
<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Delivery Hero Holding GmbH</span>
</div>
</div>
<div class="job_location_date">
<div class="job_location target-location">
<div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">
<div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality"> Berlin</span>
</div>
<span class="location_actions">
<a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
<span class="location-icon"><!-- --></span>
<span class="location-label">Google Maps</span>
</a>
</span>
</div>
</div>
<div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>
<div class="job_actions">
</div>
</div>
<div class="between_listings"><!-- local.spacer --></div>
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(jobListingElements);
Java code:
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements jobListingElements = ParseResult.select(".joblisting");
for (Element jobListingElement: jobListingElements) {
jobListingElement.select(".companyName span[itemprop=\"name\"]");
// other element properties
System.out.println(jobListingElements);
}
Thank you!
So you got your Jsoup document right? Than it seems pretty easy if the css class joblisting does not appear anywhere else.
Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
System.out.println(companyName);
System.out.println(jobTitle);
}
I don't know why the attribute [itemprop*=\"name\"] selector does not find the span (Further reading: http://jsoup.org/cookbook/extracting-data/selector-syntax )
Got it: span[itemprop=name] without any quotes or escapes. Other attributes or values also should work to get a more specific selection.

Categories