Extract text from only some divs in the same class with jsoup - java

I would like to extract a text from specific <div> of a website using jsoup, but I'm not sure how.
The problem is, that I want to get a text from div that has a class="name".
But, there can be more <div>s with this class (and I don't want to get the text from those).
It looks like this in the HTML file:
.
.
<div class="name">
Some text I don't want
<span class="a">Tree</span>
</div>
.
.
<div class="name">Some text I do want</div>
.
.
So the only difference there is that the <div> I want the text from does not have <span> inside of it. But I have not found a way to use that as a key to extract the text in jsoup.
Is it possible?

Use JSoup's selector syntax. For instance to select all div's with class = "name" use
Elements nameElements = doc.select("div.name");
Note that your text you "do" and "don't" want above are in the same relative HTML locations, and in fact I have no clue why you want one or the other. HTML and JSoup will see them the same.
If you want to avoid elements containing span elements, then one way is to iterate through the elements obtained above and test by selector if they have span elements or not:
Elements nameElements = doc.select("div.name");
for (Element element : nameElements) {
if (element.select("span").isEmpty()) {
System.out.println("No span");
System.out.println(element.text());
System.out.println();
} else {
System.out.println("span");
System.out.println(element.text());
System.out.println();
}
}

You can select all div elements with class="name", and then loop through them. Check if an element has child elements - if not, this is the div you want.

Related

How do I write a css-selector for text that is not inside any dom element [duplicate]

I am writing a JUnit test for a webpage, using Selenium, and I am trying to verify that the expected text exists within a page. The code of the webpage I am testing looks like this:
<div id="recipient_div_3" class="label_spacer">
<label class="nodisplay" for="Recipient_nickname"> recipient field: reqd info </label>
<span id="Recipient_nickname_div_2" class="required-field"> *</span>
Recipient:
</div>
I want to compare what is expected with what is on the page, so I want to use
Assert.assertTrue(). I know that to get everything from the div, I can do
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ");
but this will return "reqd info * Recipient:"
Is there any way to just get the text from the div ("Recipient") using cssSelector, without the other tags?
You can't do this with a CSS selector, because CSS selectors don't have a fine-grained enough approach to express "the text node contained in the DIV but not its other contents". You can do that with an XPath locator, though:
driver.findElement(By.xpath("//div[#id='recipient_div_3']/text()")).getText()
That XPath expression will identify just the single text node that is a direct child of the DIV, rather than all the text contained within it and its child nodes.
I am not sure if it is possible with one css locator, but you can get text from div, then get text from div's child nodes and subtract them. Something like that (code wasn't checked):
String temp = "";
List<WebElement> tempElements = driver.findElements(By.cssSelector("div[id='recipient_div_3'] *"));
for (WebElement tempElement : tempElements) {
temp =+ " " + tempElement.getText();
}
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ").replace(temp, "");
This is for case when you try to avoid using xpath. Xpath allows to do it:
//div[#id='recipient_div_3']/text()
You could also get the text content of an element and remove the tags with regexp. Also notice: you should use the reluctant quntifier
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
String getTextContentWithoutTags(WebElement element) {
return element.getText().replaceAll("<[^>]*?/>", "").trim();
}

How to select text in HTML tag without a tag around it (JSoup)

I would like to select the text inside the strong-tag but without the div under it...
Is there a possibility to do this with jsoup directly?
My try for the selection (doesn't work, selects the full content inside the strong-tag):
Elements selection = htmlDocument.select("strong").select("*:not(.dontwantthatclass)");
HTML:
<strong>
I want that text
<div class="dontwantthatclass">
</div>
</strong>
You are looking for the ownText() method.
String txt = htmlDocument.select("strong").first().ownText();
Have a look at various methods jsoup have to deal with it https://jsoup.org/apidocs/org/jsoup/nodes/Element.html. You can use remove(), removeChild() etc.
One thing you can do is use regex.
Here is a sample regex that matches start and end tag also appended by </br> tag
https://www.debuggex.com/r/1gmcSdz9s3MSimVQ
So you can do it like
selection.replace(/<([^ >]+)[^>]*>.*?<\/\1>|<[^\/]+\/>/ig, "");
You can further modify this regex to match most of your cases.
Another thing you can do is, further process your variable using javascript or vbscript:-
Elements selection = htmlDocument.select("strong")
jquery code here:-
var removeHTML = function(text, selector) {
var wrapped = $("<div>" + text + "</div>");
wrapped.find(selector).remove();
return wrapped.html();
}
With regular expression you can use ownText() methods of jsoup to get and remove unwanted string.
I guess you're using jQuery, so you could use "innerText" property on your "strong" element:
var selection = htmlDocument.select("strong")[0].innerText;
https://jsfiddle.net/scratch_cf/8ds4uwLL/
PS: If you want to wrap the retrieved text into a "strong" tag, I think you'll have to build a new element like $('<strong>retrievedText</strong>');

Java: How do I extract separated text from nested <div> in HTML?

for Example:
<div>
this is first
<div>
second
</div>
</div>
I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"
Help me out please!
EDIT
Using ownText() method will create problem in the following html code:
<div style="top:+0.2em; font-size:95%;">
the
<a href="/wiki/Free_content" title="Free content">
free
</a>
<a href="/wiki/Encyclopedia" title="Encyclopedia">
encyclopedia
</a>
that
<a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">
anyone can edit
</a>
.
</div>
It will print:
the that.
free
encyclopedia
anyone can edit
But it must be:
the
that
.
encyclopedia
anyone can edit
If i extract text for first it will show "this is first second"
Use ownText() instead of text() and you'll get only the element contains directly.
Here's an example:
final String html = "<div>\n"
+ " this is first\n"
+ " <div>\n"
+ " second\n"
+ " </div>\n"
+ "</div>";
Document doc = Jsoup.parse(html); // Get your Document from somewhere
Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text
Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();
System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);
You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)
Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...
div (Element)
this is first (TextNode)
div (Element)
second (TextNode)
The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".
So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.
Assuming you're using the w3c DOM API
http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
Elements divs=doc.getElementsByTag("div");
for (Element element : divs) {
System.out.println(element.text());
}
This should work if you are using jsoup HTML parser.

How do i get the text in jsoup link?

I am parsing an html page using jsoup. Here is what i did so far:
doc = Jsoup.connect("http://www.marketimyilmazlar.com/index.php?route=product/category&path=141_77").get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips.getElementById("content");
Elements allProductPricesOnPage = page_clip_content.getElementsByClass("price");
now, when i write:
allProductNamesOnPage.get(0);
it returns me the following:
<div class="name">
<a href="http://www.marketimyilmazlar.com/index.php?
route=product/product&path=141_77&product_id=4309"> here is the text</a>
</div>
What i want to do is, i want to get the "here is the text" part of that object. Can anyone help me with his?
Thanks
You might want to iterate over the Elements you have gathered and print their prices one by one:
Elements allProductPricesOnPage = page_clip_content
.getElementsByClass("price");
for (Element el : allProductPricesOnPage) {
System.out.println(el.text());
}
Gives,
19.99 TL KDV Dahil
9.99 TL KDV Dahil
14.99 TL KDV Dahil
What it does is, you are selecting Elements which implements Iterator (see javadoc here), which gives you an access to individual Element objects within your collection.
Each of these Element objects which are repeating within your HTML have relevant information you want to extract.
If you want to extract only the text, you can call the text() method:
String text = allProductNamesOnPage.get(0).text();
This method gets the text of an Element and its combined children. So if you want to ensure that you are only extracting text from the a element, call text() on the first child element:
String text = allProductNamesOnPage.get(0).child(0).text();
See here: http://jsoup.org/cookbook/extracting-data/attributes-text-html

jsoup to get a particular id from a html file

I have a html file like
<div class="student">
<h4 id="Classnumber100" class="studentheading">
<a id="studentlink22" href="/grade8/greg">22. Greg</a>
</h4>
<div class="studentcategories">
<div class="studentneighborhoods">
</div>
</div>
</div>
I want to use JSOUP to get the url = /grade8/greg and "22. Greg".
I tried with selector
Elements listo = doc.select("h4 #studentlink22");
I am not able to get the values.
Actually I want to select based on Classnumber100
There are 300 records in the HTML page , with the only thing consistent is " Classnumber100.
So I want my selector to select all the hrefs and text after classnumber100.
How can I do that.
I tried
doc.select("class#studentheading"); and many other possibilities but they are not working
First of all, multiple elements should not share the same id, so each of these elements should not have the id Classnumber100. However, if this is the case, then you can still select them using the selector [id=Classnumber100].
If you're only interested in the a tags inside, then you can use [id=Classnumber100] > a.
Upon re-reading the question, it appears that the h4 tags you're interested in share the class attribute of studentheading. In which case you can use the class selector, ie
doc.select(".studentheading > a")
The select method looks for the html tag, here h4 and a, and then secondarily the attributes if you tell it to do so. Have you gone to the jsoup site as the use of select is well described for this situation.
e.g.
// code not tested
Elements listo = doc.select("h4[id=Classnumber100]").select("a");
String text = listo.text(); // for "22. Greg"
String path = listo.attr("href"); // for "/grade8/greg"
.

Categories