Wrap contents of span - java

I have this span: <span style="font-weight:bold;">bold. </span> and a reference to it(the element) called span.
I want to wrap everything inside of my span element in some new tags, example <p> tags: <span style="font-weight:bold;"><p>bold. </p></span>
I know I can call span.wrap("<p></p>") but this wraps the span and not the spans contents. When I try to do span.append("<p>") the new tags are just created at the beginning of the contents and the same happens with appendElement.
What is the best way to wrap the contents of a span/element and not the whole element?
Update: Elements also has wrap but calling span.getAllElements() and then wrap on that provides the same result as span.wrap() and span.children() is 0 for this example.
Update 2: As a work around I was able to get the content with span.html(), store that as a temporary String, add the desired tags around that content and then set the spans content to that via the span.html(newContent); If there is not a better way I will just answer my own question.

in order to wrap text node use
span.childNode(0).wrap("<p>");
Edit:
an example with various use cases:
String html = "<span style=\"font-weight:bold;\">bold.</span><span></span><span><a>text</a></span>";
Document parsedDoc = Jsoup.parse(html);
Elements selects = parsedDoc.select("span");
for (Element span : selects) {
List<Node> childNodes = span.childNodes();
if (childNodes.size() > 0 && span.childNode(0).childNodes().size() == 0) {
span.childNode(0).wrap("<p>");
}
}

Related

How do I write a css-selector for text that is not inside any dom element [duplicate]

I am writing a JUnit test for a webpage, using Selenium, and I am trying to verify that the expected text exists within a page. The code of the webpage I am testing looks like this:
<div id="recipient_div_3" class="label_spacer">
<label class="nodisplay" for="Recipient_nickname"> recipient field: reqd info </label>
<span id="Recipient_nickname_div_2" class="required-field"> *</span>
Recipient:
</div>
I want to compare what is expected with what is on the page, so I want to use
Assert.assertTrue(). I know that to get everything from the div, I can do
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ");
but this will return "reqd info * Recipient:"
Is there any way to just get the text from the div ("Recipient") using cssSelector, without the other tags?
You can't do this with a CSS selector, because CSS selectors don't have a fine-grained enough approach to express "the text node contained in the DIV but not its other contents". You can do that with an XPath locator, though:
driver.findElement(By.xpath("//div[#id='recipient_div_3']/text()")).getText()
That XPath expression will identify just the single text node that is a direct child of the DIV, rather than all the text contained within it and its child nodes.
I am not sure if it is possible with one css locator, but you can get text from div, then get text from div's child nodes and subtract them. Something like that (code wasn't checked):
String temp = "";
List<WebElement> tempElements = driver.findElements(By.cssSelector("div[id='recipient_div_3'] *"));
for (WebElement tempElement : tempElements) {
temp =+ " " + tempElement.getText();
}
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ").replace(temp, "");
This is for case when you try to avoid using xpath. Xpath allows to do it:
//div[#id='recipient_div_3']/text()
You could also get the text content of an element and remove the tags with regexp. Also notice: you should use the reluctant quntifier
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
String getTextContentWithoutTags(WebElement element) {
return element.getText().replaceAll("<[^>]*?/>", "").trim();
}

How to select text in HTML tag without a tag around it (JSoup)

I would like to select the text inside the strong-tag but without the div under it...
Is there a possibility to do this with jsoup directly?
My try for the selection (doesn't work, selects the full content inside the strong-tag):
Elements selection = htmlDocument.select("strong").select("*:not(.dontwantthatclass)");
HTML:
<strong>
I want that text
<div class="dontwantthatclass">
</div>
</strong>
You are looking for the ownText() method.
String txt = htmlDocument.select("strong").first().ownText();
Have a look at various methods jsoup have to deal with it https://jsoup.org/apidocs/org/jsoup/nodes/Element.html. You can use remove(), removeChild() etc.
One thing you can do is use regex.
Here is a sample regex that matches start and end tag also appended by </br> tag
https://www.debuggex.com/r/1gmcSdz9s3MSimVQ
So you can do it like
selection.replace(/<([^ >]+)[^>]*>.*?<\/\1>|<[^\/]+\/>/ig, "");
You can further modify this regex to match most of your cases.
Another thing you can do is, further process your variable using javascript or vbscript:-
Elements selection = htmlDocument.select("strong")
jquery code here:-
var removeHTML = function(text, selector) {
var wrapped = $("<div>" + text + "</div>");
wrapped.find(selector).remove();
return wrapped.html();
}
With regular expression you can use ownText() methods of jsoup to get and remove unwanted string.
I guess you're using jQuery, so you could use "innerText" property on your "strong" element:
var selection = htmlDocument.select("strong")[0].innerText;
https://jsfiddle.net/scratch_cf/8ds4uwLL/
PS: If you want to wrap the retrieved text into a "strong" tag, I think you'll have to build a new element like $('<strong>retrievedText</strong>');

How do i get the text in jsoup link?

I am parsing an html page using jsoup. Here is what i did so far:
doc = Jsoup.connect("http://www.marketimyilmazlar.com/index.php?route=product/category&path=141_77").get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips.getElementById("content");
Elements allProductPricesOnPage = page_clip_content.getElementsByClass("price");
now, when i write:
allProductNamesOnPage.get(0);
it returns me the following:
<div class="name">
<a href="http://www.marketimyilmazlar.com/index.php?
route=product/product&path=141_77&product_id=4309"> here is the text</a>
</div>
What i want to do is, i want to get the "here is the text" part of that object. Can anyone help me with his?
Thanks
You might want to iterate over the Elements you have gathered and print their prices one by one:
Elements allProductPricesOnPage = page_clip_content
.getElementsByClass("price");
for (Element el : allProductPricesOnPage) {
System.out.println(el.text());
}
Gives,
19.99 TL KDV Dahil
9.99 TL KDV Dahil
14.99 TL KDV Dahil
What it does is, you are selecting Elements which implements Iterator (see javadoc here), which gives you an access to individual Element objects within your collection.
Each of these Element objects which are repeating within your HTML have relevant information you want to extract.
If you want to extract only the text, you can call the text() method:
String text = allProductNamesOnPage.get(0).text();
This method gets the text of an Element and its combined children. So if you want to ensure that you are only extracting text from the a element, call text() on the first child element:
String text = allProductNamesOnPage.get(0).child(0).text();
See here: http://jsoup.org/cookbook/extracting-data/attributes-text-html

Using W3C Document Object Model how to insert tag for particular text

For example i have html content like this.
<div>go to the text from here.<br> from there <br> Go to the text</div>
In the above content, i want to insert span tag for the word alone Like the below output using java.
I'm using org.w3c.dom package.
I tried but not able to make success
Element e = doc.createElement("span");
String text = preElement.getTextContent();
if(text.indexOf("text"){
e.setTextContent("text");
}
// Afterwards how to insert this to document. How to use insertBefore method for the //inbetween text.
Expected Output:
<div>go to the <span>text</span> from here.<br> from there <br> Go to the <span>text</span></div>
Please help.
You have to use the splitText method on your text node to split it into three nodes, isolating the word you need to wrap in your element. Then you only have to replace the text node you just isolated (use replaceChild) with the new element. There is no need to create a new text node, you can simply put the one you removed in the element you added.
Java implementation reference: http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Text.html#splitText%28int%29 http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html#replaceChild%28org.w3c.dom.Node,%20org.w3c.dom.Node%29.

How do I parse an HTML document with JSoup to get a list of links?

I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:
Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries
Below this tag there are doc.select("div.state_delimiter,ul") tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();. I see all the tags for each country.
How can I step through each div.state_delimiter, pull that text then go down until
there is a </ul> which defines the end of the states individual counties/cities links/text?
I was playing around with this and can do it by setting outerHtml() to a String and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text() and also tried attr("div.state_delimiter"), but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the <ul><li></li></ul> I want all the <li></li> under the <ul></ul> for each state. Looking to grab the http:// && html that goes along with it as easy as possible.
The <ul> containing the cities is the next sibling of the <div class="state_delimiter">. You can use Element#nextElementSibling() to grab it from that div on. Here's a kickoff example:
Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");
for (Element country : countries) {
System.out.println("Country: " + country.select("h1.continent_header").text());
Elements states = country.select("div.state_delimiter");
for (Element state : states) {
System.out.println("\tState: " + state.text());
Elements cities = state.nextElementSibling().select("li");
for (Element city : cities) {
System.out.println("\t\tCity: " + city.text());
}
}
}
The doc.select("div.state_delimiter,ul") doesn't do what you want. It returns all <div class="state_delimiter"> and <ul> elements of the document. Manually parsing it by string functions makes no sense if you've already a HTML parser at hands.

Categories