How can I get the first div after the h1 tag .
The html:
<h1> Shalom </h1>
<b> Tov </b>
<div> ddd </div> <! I need to take this div >
My java jsoup code
Elements apresh = doc.select("h1 ~ div");
String csdsdsdf = apresh.html();
System.out.printf(csdsdsdf);
But it doesn't work. Can you help me ?
From what you have mentioned in the comments, I believe you are trying to extract the first element from the matching elements based on your selector "h1 ~ div".
You can use the below provided method from the API.
public Element first(): Get the first matched element.
I've found two ways to do this:
Document doc = Jsoup.parse("<h1> Shalom </h1>" +
"<b> Tov </b>" +
"<div> ddd </div>");
// 1 Select DIV which is after B which is after H1.
System.out.println(doc.select("h1 + b + div"));
// 2 More flexible solution which involves going one level up to parent
// and then selecting the first DIV.
System.out.println(doc.select("h1").first().parent().select("div").first());
Related
I am working on a web scraper using Jsoup and want to pull a link out of a table.
This is what I'm looking at:
<ul class="inline-list indent>
<li>
::marker
Some Other Text
(Date & Time Stamp)
</li>
I want www.linkhere.com and Some Other Text. I have already figured out how to get Some Other Text, but I can't get www.linkhere.com.
This is what I tried:
Document results = Jsoup.connect(url).get();
tTable = ("li:nth-of-type(1)");
Element row : results.select("ul.indent.inline-list:nth-of-type(1)")
Element link = results.select("ul.indent.inline-list:nth-of-type(1) > a").first();
tName = row.select(tTable).text();
articleLink = link.attr("href");
System.out.println(tName);
System.out.println(articleLink);
This gives me the error:
NullPointerException: Cannot invoke "org.jsoup.nodes.Element.attr(String)" because "llink" is null
You're using such selector:
"ul.indent.inline-list:nth-of-type(1) > a"
The first part ul.indent.inline-list:nth-of-type(1) selects the first <ul> element. The second part > a expects that <a> will be direct child of <ul>. That will not match what you want because there's <li> element between them so the solution would be to use:
"ul.indent.inline-list:nth-of-type(1) > li > a"
or if your idea was to match the first <li> you have to use:
"ul.indent.inline-list > li:nth-of-type(1) > a"
I would like to extract a text from specific <div> of a website using jsoup, but I'm not sure how.
The problem is, that I want to get a text from div that has a class="name".
But, there can be more <div>s with this class (and I don't want to get the text from those).
It looks like this in the HTML file:
.
.
<div class="name">
Some text I don't want
<span class="a">Tree</span>
</div>
.
.
<div class="name">Some text I do want</div>
.
.
So the only difference there is that the <div> I want the text from does not have <span> inside of it. But I have not found a way to use that as a key to extract the text in jsoup.
Is it possible?
Use JSoup's selector syntax. For instance to select all div's with class = "name" use
Elements nameElements = doc.select("div.name");
Note that your text you "do" and "don't" want above are in the same relative HTML locations, and in fact I have no clue why you want one or the other. HTML and JSoup will see them the same.
If you want to avoid elements containing span elements, then one way is to iterate through the elements obtained above and test by selector if they have span elements or not:
Elements nameElements = doc.select("div.name");
for (Element element : nameElements) {
if (element.select("span").isEmpty()) {
System.out.println("No span");
System.out.println(element.text());
System.out.println();
} else {
System.out.println("span");
System.out.println(element.text());
System.out.println();
}
}
You can select all div elements with class="name", and then loop through them. Check if an element has child elements - if not, this is the div you want.
I would like to get the text(Which is null right now but get some text in future, so printing null should be fine for now) from second "109-top-dark-grey-block ng-binding" class . Tried tabIndex and nth-child both are not working.
"
<div class="122-top-section-btm-half">
<div class="108-top-grey-m12x3"></div>
<div class="109-top-dark-grey-block ng-binding">ab ab xyz</div>
</div>
"
"
<div class="d122-top-section-btm-half">
<div class="108-top-grey-m12x4"></div>
<div class="109-top-dark-grey-block ng-binding"></div>
"
Update
To get the text of the second div block nth-child should work. I tested the selector locally in chrome tools:
So in your Java:
String elementText = driver.findElement(By.cssSelector(".d122-top-section-btm-half:nth-child(2) .ng-binding")).getText();
Should do the trick - as the CSS spec says nth-child is 1 indexed - not 0 - so its the 2nd child.
Old Answer
Based on the HTML snippet you provided you could use a CSS selector. So you could do:
String elementText = driver.findElement(By.cssSelector(".d122-top-section-btm-half .109-top-dark-grey-block")).getText();
Or if you are just after the element with the ng-binding within your first div then it would be cleaner:
String elementText = driver.findElement(By.cssSelector(".d122-top-section-btm-half .ng-binding")).getText();
Both would return the element text - maybe take a look at CSS Selectors Guide to learn more.
for Example:
<div>
this is first
<div>
second
</div>
</div>
I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"
Help me out please!
EDIT
Using ownText() method will create problem in the following html code:
<div style="top:+0.2em; font-size:95%;">
the
<a href="/wiki/Free_content" title="Free content">
free
</a>
<a href="/wiki/Encyclopedia" title="Encyclopedia">
encyclopedia
</a>
that
<a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">
anyone can edit
</a>
.
</div>
It will print:
the that.
free
encyclopedia
anyone can edit
But it must be:
the
that
.
encyclopedia
anyone can edit
If i extract text for first it will show "this is first second"
Use ownText() instead of text() and you'll get only the element contains directly.
Here's an example:
final String html = "<div>\n"
+ " this is first\n"
+ " <div>\n"
+ " second\n"
+ " </div>\n"
+ "</div>";
Document doc = Jsoup.parse(html); // Get your Document from somewhere
Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text
Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();
System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);
You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)
Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...
div (Element)
this is first (TextNode)
div (Element)
second (TextNode)
The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".
So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.
Assuming you're using the w3c DOM API
http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
Elements divs=doc.getElementsByTag("div");
for (Element element : divs) {
System.out.println(element.text());
}
This should work if you are using jsoup HTML parser.
I have a html file like
<div class="student">
<h4 id="Classnumber100" class="studentheading">
<a id="studentlink22" href="/grade8/greg">22. Greg</a>
</h4>
<div class="studentcategories">
<div class="studentneighborhoods">
</div>
</div>
</div>
I want to use JSOUP to get the url = /grade8/greg and "22. Greg".
I tried with selector
Elements listo = doc.select("h4 #studentlink22");
I am not able to get the values.
Actually I want to select based on Classnumber100
There are 300 records in the HTML page , with the only thing consistent is " Classnumber100.
So I want my selector to select all the hrefs and text after classnumber100.
How can I do that.
I tried
doc.select("class#studentheading"); and many other possibilities but they are not working
First of all, multiple elements should not share the same id, so each of these elements should not have the id Classnumber100. However, if this is the case, then you can still select them using the selector [id=Classnumber100].
If you're only interested in the a tags inside, then you can use [id=Classnumber100] > a.
Upon re-reading the question, it appears that the h4 tags you're interested in share the class attribute of studentheading. In which case you can use the class selector, ie
doc.select(".studentheading > a")
The select method looks for the html tag, here h4 and a, and then secondarily the attributes if you tell it to do so. Have you gone to the jsoup site as the use of select is well described for this situation.
e.g.
// code not tested
Elements listo = doc.select("h4[id=Classnumber100]").select("a");
String text = listo.text(); // for "22. Greg"
String path = listo.attr("href"); // for "/grade8/greg"
.