In jsoup Element.children() returns all children (descendants) of Element. But, I want the Element's first-level children (direct children).
Which method can I use?
Element.children() returns direct children only. Since you get them bound to a tree, they have children too.
If you need the direct children elements without the underlying tree structure then you need to create them as follows
public static void main(String... args) {
Document document = Jsoup
.parse("<div><ul><li>11</li><li>22</li></ul><p>ppp<span>sp</span</p></div>");
Element div = document.select("div").first();
Elements divChildren = div.children();
Elements detachedDivChildren = new Elements();
for (Element elem : divChildren) {
Element detachedChild = new Element(Tag.valueOf(elem.tagName()),
elem.baseUri(), elem.attributes().clone());
detachedDivChildren.add(detachedChild);
}
System.out.println(divChildren.size());
for (Element elem : divChildren) {
System.out.println(elem.tagName());
}
System.out.println("\ndivChildren content: \n" + divChildren);
System.out.println("\ndetachedDivChildren content: \n"
+ detachedDivChildren);
}
Output
2
ul
p
divChildren content:
<ul>
<li>11</li>
<li>22</li>
</ul>
<p>ppp<span>sp</span></p>
detachedDivChildren content:
<ul></ul>
<p></p>
This should give you the desired list of direct descendants of the parent node:
Elements firstLevelChildElements = doc.select("parent-tag > *");
OR You can also try to retrieve the parent element, get the first child node via child(int index) and then try to retrieve siblings of this child via siblingElements().
This will give you the list of first level children excluding the used child, however you'd have to add the child externally.
Elements firstLevelChildElements = doc.child(0).siblingElements();
You could always use the ELEMENT.child(index) with the index you can choose which child you want.
Here you can get the value of first-level children
Element addDetails = doc.select("div.container > div.main-content > div.clearfix > div.col_7.post-info > ul.no-bullet").first();
Elements divChildren = addDetails.children();
for (Element elem : divChildren) {
System.out.println(elem.text());
}
Related
I have a web page from which I have saved in an HtmlPage object. I applied an XPath and its result is being stored in a list.
List<?> items = null;
items = page.getByXPath("//div[contains(#class,'search-result-cards')]/div[contains(#class,'listContainer')]");
Now what I observed, is that when I iterate through these items, using HtmlElement, I get just the first line of the div tag which contains the class listContainer but not its child nodes. However, on using he.asXml() method, I get the complete information about the subnodes as well.
for(HtmlElement he : (List<HtmlElement>) items)
{
br.write("Printing just the element ::: "+he);
br.write(he.asXml());
}
Here, br is a BufferedWriter object which is being used to write the output to the file.
The issue is that I want all this information which is coming after I'm calling he.asXml() method in the HtmlElement object only. Is it possible? I tried typecasting directly a string to HtmlElement Object which didn't work. Can anyone please help?
Output
Printing just the element ::: HtmlDivision[<div class="listContainer" data-ptitle="3139847000" data-reactid="402">]
he.asXml() Output
<div class="listContainer" data-ptitle="3139847000" data-reactid="402">
<div class="imageContainer" data-reactid="403">
<div class="prodInfoContainer" data-reactid="406">
.
.
.
The dots represents these nodes keep on going, as the output is very large.
Let me know if any other information is needed that I may have not mentioned.
.toString() prints only the current DomElement, not the children.
You need to get the children, either by using XPath, something like:
List<HtmlElement> items = page.getByXPath("//div[contains(#class,'listContainer')]");
for (HtmlElement item : items) {
List<HtmlElement> children = item.getByXPath(".//div");
for (HtmlElement child : children) {
System.out.println(child);
}
}
Or
for (HtmlElement child : item.getHtmlElementDescendants()) {
System.out.println(child);
}
Here is my code:
Element current = doc.select("tr[class=row]").get(5);
for (Element td : current.children()) {
System.out.println(td.text());
}
How can I get an Element id in the loop?
Thanks!
In HTML id is a normal attribute, so you can simply call td.attr("id"):
Element current = doc.select("tr.row").get(5);
for (Element td : current.children()) {
System.out.println(td.attr("id"));
}
Note that there is also a selector for classes: tr.row.
JSoup supports many of the CSS selectors, so this could be rewritten with a single selector:
Elements elements = doc.select("tr.row:nth-of-type(6) > td");
for (Element element : elements) {
System.out.println(element.id());
}
I have an XML Document:
<entities xmlns="urn:yahoo:cap">
<entity score="0.988">
<text end="4" endchar="4" start="0" startchar="0">Messi</text>
<wiki_url>http://en.wikipedia.com/wiki/Lionel_Messi</wiki_url>
<types>
<type region="us">/person</type>
</types>
</entity>
</entities>
I have a TreeMap<String,String> data which stores the getTextContent() for both the "text" and "wiki_url" element. Some "entity"s will only have the "text" element (no "wiki_url") so i need a way of finding out when there is only the text element as the child and when there is a "wiki_url". I could use document.getElementByTag("text") & document.getElementByTag("wiki_url") but then I would lose the relationship between the text and the url.
I'm trying to get the amount of elements within the "entity" element by using:
NodeList entities = document.getElementsByTagName("entity"); //List of all the entity nodes
int nchild; //Number of children
System.out.println("Number of entities: "+ entities.getLength()); //Prints 1 as expected
nchild=entities.item(0).getChildNodes().getLength(); //Returns 7
However as shows above this returns 7 (which I don't understand, surely its 3 or 4 if you include the grandchild)
I was then going to use the number of children to cycle through them all to check if getNodeName().equals("wiki_url") and save it to data if correct.
Why is it that i am getting the number of children as 7 when I can only count 3 children and 1 grandchild?
The white-spaces following > of <entity score="0.988"> also count for nodes, similarly end of line chararcter between the tags are also parsed to nodes. If you are interested in a particular node with a name, add a helper method like below and call wherever you want.
Node getChild(final NodeList list, final String name)
{
for (int i = 0; i < list.getLength(); i++)
{
final Node node = list.item(i);
if (name.equals(node.getNodeName()))
{
return node;
}
}
return null;
}
and call
final NodeList childNodes = entities.item(0).getChildNodes();
final Node textNode = getChild(childNodes, "text");
final Node wikiUrlNode = getChild(childNodes, "wiki_url");
Normally when working with DOM, comeup with helper methods like above to simplify main processing logic.
There is this element which has child elements, those child elements again have child elements and so on. I would like to get all elements that are descendants of the element. Thanks.
Try this one:
(Java)
List<WebElement> childs = rootWebElement.findElements(By.xpath(".//*"));
(C#)
IReadOnlyList<IWebElement> childs = rootWebElement.FindElements(By.XPath(".//*"));
Try this one
List<WebElement> allDescendantsChilds = rootWebElement.findElements(By.xpath("//tr[#class='parent']//*"));
The above thing will gives you all descendant child elements (not only immediate child) of parent tr
Try this one:
List<WebElement> childs = rootWebElement.findElements(By.tagName(".//*"));
I have a xml structure as follows:
<rurl modify="0" children="yes" index="8" name="R-URL">
<status>enabled</status>
<rurl-link priority="3">http</rurl-link>
<rurl-link priority="5">http://localhost:80</rurl-link>
<rurl-link priority="4">abc</rurl-link>
<rurl-link priority="3">b</rurl-link>
<rurl-link priority="2">a</rurl-link>
<rurl-link priority="1">newlinkkkkkkk</rurl-link>
</rurl>
Now, I want to remove a child node, where text is equal to http. currently I am using this code:
while(subchilditr.hasNext()){
Element subchild = (Element)subchilditr.next();
if (subchild.getText().equalsIgnoreCase(text)) {
message = subchild.getText();
update = "Success";
subchild.removeAttribute("priority");
subchild.removeContent();
}
But it is not completely removing the sub element from xml file. It leaves me with
<rurl-link/>
Any suggestions?
You'll need to do this:
List<Element> elements = new ArrayList<Element>();
while (subchilditr.hasNext()) {
Element subchild = (Element) subchilditr.next();
if (subchild.getText().equalsIgnoreCase(text)) {
elements.add(subchild);
}
}
for (Element element : elements) {
element.getParent().removeContent(element);
}
If you try to remove an element inside of the loop you'll get a ConcurrentModificationException.
If you have the parent element rurl you can remove its children using the method removeChild or removeChildren.
Use removeChild()
http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Node.html#removeChild(org.w3c.dom.Node)