I am parsing an html page using jsoup. Here is what i did so far:
doc = Jsoup.connect("http://www.marketimyilmazlar.com/index.php?route=product/category&path=141_77").get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips.getElementById("content");
Elements allProductPricesOnPage = page_clip_content.getElementsByClass("price");
now, when i write:
allProductNamesOnPage.get(0);
it returns me the following:
<div class="name">
<a href="http://www.marketimyilmazlar.com/index.php?
route=product/product&path=141_77&product_id=4309"> here is the text</a>
</div>
What i want to do is, i want to get the "here is the text" part of that object. Can anyone help me with his?
Thanks
You might want to iterate over the Elements you have gathered and print their prices one by one:
Elements allProductPricesOnPage = page_clip_content
.getElementsByClass("price");
for (Element el : allProductPricesOnPage) {
System.out.println(el.text());
}
Gives,
19.99 TL KDV Dahil
9.99 TL KDV Dahil
14.99 TL KDV Dahil
What it does is, you are selecting Elements which implements Iterator (see javadoc here), which gives you an access to individual Element objects within your collection.
Each of these Element objects which are repeating within your HTML have relevant information you want to extract.
If you want to extract only the text, you can call the text() method:
String text = allProductNamesOnPage.get(0).text();
This method gets the text of an Element and its combined children. So if you want to ensure that you are only extracting text from the a element, call text() on the first child element:
String text = allProductNamesOnPage.get(0).child(0).text();
See here: http://jsoup.org/cookbook/extracting-data/attributes-text-html
Related
I am writing a JUnit test for a webpage, using Selenium, and I am trying to verify that the expected text exists within a page. The code of the webpage I am testing looks like this:
<div id="recipient_div_3" class="label_spacer">
<label class="nodisplay" for="Recipient_nickname"> recipient field: reqd info </label>
<span id="Recipient_nickname_div_2" class="required-field"> *</span>
Recipient:
</div>
I want to compare what is expected with what is on the page, so I want to use
Assert.assertTrue(). I know that to get everything from the div, I can do
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ");
but this will return "reqd info * Recipient:"
Is there any way to just get the text from the div ("Recipient") using cssSelector, without the other tags?
You can't do this with a CSS selector, because CSS selectors don't have a fine-grained enough approach to express "the text node contained in the DIV but not its other contents". You can do that with an XPath locator, though:
driver.findElement(By.xpath("//div[#id='recipient_div_3']/text()")).getText()
That XPath expression will identify just the single text node that is a direct child of the DIV, rather than all the text contained within it and its child nodes.
I am not sure if it is possible with one css locator, but you can get text from div, then get text from div's child nodes and subtract them. Something like that (code wasn't checked):
String temp = "";
List<WebElement> tempElements = driver.findElements(By.cssSelector("div[id='recipient_div_3'] *"));
for (WebElement tempElement : tempElements) {
temp =+ " " + tempElement.getText();
}
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ").replace(temp, "");
This is for case when you try to avoid using xpath. Xpath allows to do it:
//div[#id='recipient_div_3']/text()
You could also get the text content of an element and remove the tags with regexp. Also notice: you should use the reluctant quntifier
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
String getTextContentWithoutTags(WebElement element) {
return element.getText().replaceAll("<[^>]*?/>", "").trim();
}
I would like to extract a text from specific <div> of a website using jsoup, but I'm not sure how.
The problem is, that I want to get a text from div that has a class="name".
But, there can be more <div>s with this class (and I don't want to get the text from those).
It looks like this in the HTML file:
.
.
<div class="name">
Some text I don't want
<span class="a">Tree</span>
</div>
.
.
<div class="name">Some text I do want</div>
.
.
So the only difference there is that the <div> I want the text from does not have <span> inside of it. But I have not found a way to use that as a key to extract the text in jsoup.
Is it possible?
Use JSoup's selector syntax. For instance to select all div's with class = "name" use
Elements nameElements = doc.select("div.name");
Note that your text you "do" and "don't" want above are in the same relative HTML locations, and in fact I have no clue why you want one or the other. HTML and JSoup will see them the same.
If you want to avoid elements containing span elements, then one way is to iterate through the elements obtained above and test by selector if they have span elements or not:
Elements nameElements = doc.select("div.name");
for (Element element : nameElements) {
if (element.select("span").isEmpty()) {
System.out.println("No span");
System.out.println(element.text());
System.out.println();
} else {
System.out.println("span");
System.out.println(element.text());
System.out.println();
}
}
You can select all div elements with class="name", and then loop through them. Check if an element has child elements - if not, this is the div you want.
for Example:
<div>
this is first
<div>
second
</div>
</div>
I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"
Help me out please!
EDIT
Using ownText() method will create problem in the following html code:
<div style="top:+0.2em; font-size:95%;">
the
<a href="/wiki/Free_content" title="Free content">
free
</a>
<a href="/wiki/Encyclopedia" title="Encyclopedia">
encyclopedia
</a>
that
<a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">
anyone can edit
</a>
.
</div>
It will print:
the that.
free
encyclopedia
anyone can edit
But it must be:
the
that
.
encyclopedia
anyone can edit
If i extract text for first it will show "this is first second"
Use ownText() instead of text() and you'll get only the element contains directly.
Here's an example:
final String html = "<div>\n"
+ " this is first\n"
+ " <div>\n"
+ " second\n"
+ " </div>\n"
+ "</div>";
Document doc = Jsoup.parse(html); // Get your Document from somewhere
Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text
Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();
System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);
You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)
Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...
div (Element)
this is first (TextNode)
div (Element)
second (TextNode)
The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".
So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.
Assuming you're using the w3c DOM API
http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
Elements divs=doc.getElementsByTag("div");
for (Element element : divs) {
System.out.println(element.text());
}
This should work if you are using jsoup HTML parser.
I have this span: <span style="font-weight:bold;">bold. </span> and a reference to it(the element) called span.
I want to wrap everything inside of my span element in some new tags, example <p> tags: <span style="font-weight:bold;"><p>bold. </p></span>
I know I can call span.wrap("<p></p>") but this wraps the span and not the spans contents. When I try to do span.append("<p>") the new tags are just created at the beginning of the contents and the same happens with appendElement.
What is the best way to wrap the contents of a span/element and not the whole element?
Update: Elements also has wrap but calling span.getAllElements() and then wrap on that provides the same result as span.wrap() and span.children() is 0 for this example.
Update 2: As a work around I was able to get the content with span.html(), store that as a temporary String, add the desired tags around that content and then set the spans content to that via the span.html(newContent); If there is not a better way I will just answer my own question.
in order to wrap text node use
span.childNode(0).wrap("<p>");
Edit:
an example with various use cases:
String html = "<span style=\"font-weight:bold;\">bold.</span><span></span><span><a>text</a></span>";
Document parsedDoc = Jsoup.parse(html);
Elements selects = parsedDoc.select("span");
for (Element span : selects) {
List<Node> childNodes = span.childNodes();
if (childNodes.size() > 0 && span.childNode(0).childNodes().size() == 0) {
span.childNode(0).wrap("<p>");
}
}
I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:
Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries
Below this tag there are doc.select("div.state_delimiter,ul") tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();. I see all the tags for each country.
How can I step through each div.state_delimiter, pull that text then go down until
there is a </ul> which defines the end of the states individual counties/cities links/text?
I was playing around with this and can do it by setting outerHtml() to a String and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text() and also tried attr("div.state_delimiter"), but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the <ul><li></li></ul> I want all the <li></li> under the <ul></ul> for each state. Looking to grab the http:// && html that goes along with it as easy as possible.
The <ul> containing the cities is the next sibling of the <div class="state_delimiter">. You can use Element#nextElementSibling() to grab it from that div on. Here's a kickoff example:
Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");
for (Element country : countries) {
System.out.println("Country: " + country.select("h1.continent_header").text());
Elements states = country.select("div.state_delimiter");
for (Element state : states) {
System.out.println("\tState: " + state.text());
Elements cities = state.nextElementSibling().select("li");
for (Element city : cities) {
System.out.println("\t\tCity: " + city.text());
}
}
}
The doc.select("div.state_delimiter,ul") doesn't do what you want. It returns all <div class="state_delimiter"> and <ul> elements of the document. Manually parsing it by string functions makes no sense if you've already a HTML parser at hands.