Extract all visible text from html

Extract all visible text from html - java

I am trying to create a search function in google chrome. Given a string it will highlight all areas containing this string. I use java. I
To do this, first I need to extract all visible text. I have tried to analyze html pages in order to figure out how to extract only text.
For sections that looks like this, it seems
To do this, I planned on using jsoup. I am not sure how to extract text from sections that looks like this. (This is a youtube comment with a "read more" link and "show less" link).
From this section, I try to extract "Not gonna lie, dat dog is ADORABLE" and ("Les mer" or "Vis mindre" depending on which of them is visible).
<div class="comment-renderer-text" tabindex="0" role="article">
<div class="comment-renderer-text-content">Not gonna lie, dat dog is ADORABLE</div>
<div class="comment-text-toggle hid">
<div class="comment-text-toggle-link read-more">
<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
<span class="yt-uix-button-content">Les mer
</span>
</button>
</div>
<div class="comment-text-toggle-link show-less hid">
<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
<span class="yt-uix-button-content">Vis mindre
</span>
</button>
</div>
</div>
</div>

I am going to assume that the html code given is already in a document named doc.
String text = doc.select("div.comment-renderer-text-content").first().text();
The doc.select command gets Elements that contain that specified HTML query. Then I get the first one and convert it to text.
More can be read here: Jsoup Selector
Edit:
You can use this code to get visible text rather than per class:
String text = doc.body().text();

Related

extracting attribute value of parent element using Selenium

Experienced with Java, pretty new to Selenium, locators, etc.
Buried deep in some HTML is several similar divisions:
<div tabgroup="topTabs__County Summary" sectiongroup class="field TextDescription tab">
<label for="request_48543">
<span class="label">Monument</span>
</label>
</div>
<div tabgroup="topTabs__County Summary" sectiongroup class="field DropDownList readonly tab">
<label for="request_48543">
<span class="label">Geolocation</span>
</label>
</div>
<div tabgroup="topTabs__County Summary" sectiongroup class="field SingleLineText tab">
<label for="request_48543">
<span class="label">Intersection</span>
</label>
</div
I need some Selenium magic to find a label with a specific value then backtrack to find that label's division and from that division extract the value of a given attribute. Drilling down seems fairly easy but how does one "back up" ?
For example, given "Geolocation" I'd like to extract "field DropDownList readonly tab"
I've tried things like
WebElement chill = m.findElement(By.xpath("../..//span[text='Geolocation']"));
to no avail

You reversed the order of going to the parent element, and you need () in text. The xpath should be
"//span[text()='Geolocation']/../.."
Another option is to look for an element that has a chilled with "Geolocation" text
"//div[.//span[text()='Geolocation']]"
this might give you more results, depends on the html structure that is not in the question. In that case you can add unique attribute, for example tabgroup
"//div[.//span[text()='Geolocation']][#tabgroup]"
this will return only <div> tag that has tabgroup attribute.
To extract the data use getAttribute("class") on chill WebElement

how to find the inner elements when all div class name and span class name are same using jsoup in java

<div class="xyOfqd">
<div class="hAyfc">
<div class="BgcNfc">Updated</div>
<span class="htlgb">
<div>
<span class="htlgb">July 14, 2018</span>
</div>
</span>
</div>
<div class="hAyfc">
<div class="BgcNfc">Size</div>
<span class="htlgb">
<div><span class="htlgb">3.9M</span></div>
</span>
</div>
</div>
I want all the text from above html using jsoup in java.
Like this
Updated
July 14, 2018
Size
3.9M
updated and size are constant but date and 3.9M are dynamic values.
Basically I am trying to scrap the values from google play store.

You have two issues here:
Finding the CSS selector of elements with the same name. This is the easier part, because they all have a different selector. If you use your browser's developer tools you will see that the selector of update is div.hAyfc:nth-child(1) > div:nth-child(1) and the selctor of size is div.hAyfc:nth-child(2) > div:nth-child(1).
Getting dynamic values - well, Jsoup cannot get dynamic values. You can try to find the ajax call that fetches those values and try to do the same with Jsoup, or use some other tool, like PhantomJS

JSoup scrape HTML document by attribute value

I want to make a dynamic website and need some pics off the internet. I decided to scrape them off flickr and include the owners on my website but am running into problems scraping. I'll post part of the HTML below but if you want to check the source code yourself, here's the website. https://www.flickr.com/explore
HTML:
<div class="thumb ">
<span class="photo_container pc_ju">
<a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="rapidnofollow photo-click"><img id="photo_img_15586482942" src="https://c2.staticflickr.com/4/3945/15586482942_6a7154363f_z.jpg"width="508" height="339" alt="Lake District" class="pc_img " border="0"><div class="play"></div></a>
</span>
<div class="meta">
<div class="title"><a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="title">Lake District</a></div>
<div class="attribution-block">
<span class="attribution">
<span>by </span>
******<a data-track="owner" href="/photos/sheilarogers13" title="sheilarogers22" class="owner">sheilarogers22</a>******
</span>
</div>
<span class="inline-icons">
<a data-track="favorite" href="#" class="rapidnofollow fave-star-inline canfave" title="Add this photo to your favorites?"><img width="12" height="12" alt="[★]" src="https://s.yimg.com/pw/images/spaceball.gif" class="img"><span class="fave-count count">99+</span></a>
<a title="Comments" href="#" class="rapidnofollow comments-icon comments-inline-btn">
<img width="12" height="12" alt="Comments" src="https://s.yimg.com/pw/images/spaceball.gif">
<span class="comment-count count">57</span>
</a>
<img width="12" height="12" alt="" src="https://s.yimg.com/pw/images/spaceball.gif">
</span>
</div>
</div>
I want the line where I put asterisks, in order to be able to give credit to the authors of the pictures.
My code:
Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
The above code however gives me all 4 data tracks in my div.meta though, and I only want the one that =owner.
I checked the JSoup documentation and it says that attributes with values are found using [attr=value], but I can't seem to get it to work. I've tried:
.select("[data-track=owner]")
.select("[data-track='owner']")
but neither work. Thoughts?

Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
Elements ownerElements = new Elements();
for(Element element:pgElem){
if(!element.getElementsByAttributeValueContaining("data-track","owner").isEmpty()){
ownerElements.add(element);
}
}
actually, I just gave it another spin and this works for me:
doc.select("div.thumb").select("div.meta").select("[data-track=owner]")

How do I determine radio button selection based on value using selenium webdriver?

I am trying to determine which of two radio buttons is selected and based on that select the other one. I'm using Java and selenium.
My HTML is:
<div class="row span-670px">
<h3>Turn on</h3>
<div class="field-row">
<div class="field-wrap radio-row clearfix ">
<input type="radio" name="choosePaymentModel" value="QUOTEHOLD" checked="checked" />
<label>
...
</label>
</div>
</div>
<div class="row last span-670px">
<h3>Turn off</h3>
<div class="field-row">
<div class="field-wrap radio-row clearfix ">
<input type="radio" name="choosePaymentModel" value="BASIC" />
<label>
...
</span>
</label>
</div>
</div>
The only thing that differs is the value attribute. The checked attribute will change based on which one is checked, so the only clear way to differentiate the two is by value. I can't seem to find the proper syntax to grab the correct radio buttons. When utilizing the IDE, the element identifiers swap out with each other depending on the selection so nothing is every unique.
Suggestions?

I had to use:
element = driver.findElement(By.xpath("//input[#name='choosePaymentModel' and #value='QUOTEHOLD']"));
and
element = driver.findElement(By.xpath("//input[#name='choosePaymentModel' and #value='BASIC']"));
to determine which was selected, but unfortunately the click methods did not work on them.
When playing with the IDE was lucky enough to find two separately bizzare elements to click on, which were not in fact elements that contained the "isSelected" values.
In either case, looks like I found the answer to my own problem.

String tempvalue[]=object.split(Concrete.VALUE_SPLIT);
//here I am splitting the values passed through data sheet against radio buttons
String Val_radio =Browser.driver.findElement(By.xpath(OR.getProperty(tempvalue[0])+data+OR.getProperty(tempvalue[1]))).getAttribute("value");
System.out.println(Val_radio);
Boolean radio = Browser.driver.findElement(By.xpath("//input[#name='radio' and #value="+"'"+Val_radio+"'"+"]")).isSelected();
if(radio.booleanValue()==true){
//do something here
}

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.

The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract all visible text from html - java

Related

extracting attribute value of parent element using Selenium

how to find the inner elements when all div class name and span class name are same using jsoup in java

JSoup scrape HTML document by attribute value

How do I determine radio button selection based on value using selenium webdriver?

Extracting href from a class within other div/id classes with jsoup

Categories

Resources