JSoup Parsing issues

JSoup Parsing issues - java

I am working on Android application that parses a website but I can't seem to get Jsoup to work.
I am trying to parse this html:
Here's a pic
My code just now is:
Document doc = null;
try{
doc = Jsoup.connect("URL").get();
Elements tds = doc.select("table.tr>td");
for (Element td : tds) {
String tdText = td.text();
System.out.println(tdText);
}
}
At the moment it does not return anything but if I print 'doc' it return the whole website.
I am trying to extract the following information:
Drower, E. S. (Ethel Stefana), Lady, b. 1879, With or without the &nbsp.
But I can't seam to get it to work.
Thanks for your help!

You got the selector wrong: it picks td children of a table element with class tr, while you probably want td cells in tr rows in a table. I believe you could get at them just by using "td" as selector.
However, that's a bit too generic, since it's going to pick every cell in the table. If the cell you need is always the third cell in the rows of that table, you can refine the selector to pick only those: "td:eq(2)". You should really get a knack of JSoup selectors, and experiment a little bit to see how much you are able to restrict the data extracted from the document to just the elements you really need.
To obtain the text after the <script> element in the fourth cell you could use something along the following snippet:
Element td = doc.select("td:eq(3)").first();
System.out.println(td.text());
because, from a little experiment of mine, it seems that JavaScript code inside <script> tags is skipped when asking the text of an element that contains one of those.
You would use a for loop rather than first, though, since there are as many fourth cells as there are rows in your document, and you got a lot of them.

Related

How to find xpath element partially using selenium with java?

I'm trying to select partially an element with xpath in my selenium code.
My xpath is
//iron-pages[#id='pages']//span[.='s8718216']
what I want is to select any element starting with s, the element just after span.
I tried this:
//iron-pages[starts-with(span[.='s'])
It doesn't work for me.
Can someone help me.

I think this xpath should work //iron-pages[starts-with(text(),'s')]
Or second try:
//iron-pages[starts-with(.,'s')] <- . instead of text() checks element for more properties. Not only text.
There are many properties that might contain text like innerText, innerHTML etc.
EDIT:
I just read your question again. You want to select element right after span so:
//iron-pages[#id='pages']//span[starts-with(text(),'s')] <- it will select span elements starting with text s.
If you want child elements you can use
//iron-pages[#id='pages']//span//*[starts-with(text(),'s')]

Your xpath should be
//iron-pages[starts-with(span[.='s'])//following-sibling::i[1]
it will get the next element that start with span with text s

Get the child of an HTML element by its XPath

I have a really simple ask, i have a div in a html page that i can acces by his XPath which is : //div[2]/div/div[2]/div[2]/div[2].
I want a XPath that would give all the child of this div but i can't find it.
I need to get the element with the findElements method of Selenium, but what i've tested does not work:
My HTML code looks like this :
<div>
<input/>
<span/>
<p></p>
</div>
And the XPath I want to use like this :
//div[2]/div/div[2]/div[2]/div[2]/child
And my Java Selenium script like this :
List<WebElement> listElement = driver.findElements(By.xpath(xpath));
for(WebElement element : listElement) {
System.out.println(element.getAttribute("id"));
}
What XPath should I put to get the child of the div?
EDIT 1: I did use the * and it's working, but when i count the number of element it prints me 6. Does the * consider the children of his children as his own children ??

If the div in the HTML fragment in your question is located at
//div[2]/div/div[2]/div[2]/div[2]
then the input child element would be here:
//div[2]/div/div[2]/div[2]/div[2]/input
and all of the children elements would be here:
//div[2]/div/div[2]/div[2]/div[2]/*
Update:
EDIT 1: I did use the * and it's working, but when i count the number
of element it prints me 6. Does the * consider the children of his
children as his own children ??
No, div/* selects only the immediate children elements of the parent div.
If you're being surprised by a greater number of children than expected, it may be that the base XPath is selecting multiple elements, and you're then selecting the children of more than just the targeted div element.
Update 2:
If you cannot post a MCVE, and you're still puzzled about the number of children returned by,
//div[2]/div/div[2]/div[2]/div[2]/*
try challenging yourself on the XPath you've provided as the base:
//div[2]/div/div[2]/div[2]/div[2]/
First, try
//div[2]
Does it really select the single div as you expect?
Then try
//div[2]/div
Again, see if this one really selects the single div you expect.
Continue in this manner until you get to a place where the reality of the selected elements deviates from your expectations. There your answer will be obvious, or your have a more specific question to ask.
Without seeing your XML/HTML, this is about as good as the advice can get.

List<WebElement> childs = driver.findElements(By.xpath("//div[2]/div/div[2]/div[2]/div[2]/*"));
In that case selenium searchs for all child elements from path div[2]/div/div[2]/div[2]/div[2]/ and set them in list collection as WebElement object.

Apache POI for docx Insert Text On Specific Page

I'm trying to make Table of Contents for my Word document docx.
Apache POI is still too buggy. The document.createTOC() does not produce anything unless placed at the end. Sometimes, it doesn't give the correct page numbers.
The document.enforceUpdateFields() doesn't do anything!
So I thought I make my own method that creates the Table of Content. However, I will call it at the end but I need it to be inserted at the beginning!
In other words, suppose my document at some point in my program has some text on the first page and the second page. And I haven't yet saved it; how do I insert at the beginning of first page?

I haven't tried this yet. But, after you write the document. Reload it again
Then try the following:
List<XWPFParagraph> paragraphs = document.getParagraphs();
XWPFRun run = paragraphs.get(0).insertNewRun(0); // first paragraph, 0 is the position
run.setText("your data here");

Get content of list of span elements with HTMLUnit and XPath

I want to get a list of values from an HTML document. I am using HTMLUnit.
There are many span elements with the class topic. I want to extract the content within the span tags:
<span class="topic">
Lean Startup
</span>
My code looks like this:
List<?> topics = (List)page.getByXPath("//span[#class='topic']/text()");
However whenever I try to iterate over the list I get a NoSuchElementException. Can anyone see an obvious mistake? Also links to good tutorials would be appreciated.

If you know you'll always have an <a> then just add it to the XPath and then get the text() from the a.
If you don't really know if you always will have an a in there then I'd recommend to use the .asText() method that all HtmlElement and their descendants have.
So first get each of the spans:
List<?> topics = (List)page.getByXPath("//span[#class='topic']");
And then, in the loop, get the text inside each of the spans:
topic.asText();

text() will only extract the text from that element, and that example you've given has no text component, only a child element.
Try this instead:
List<?> topics = (List)page.getByXPath("//span[#class='topic']");

Getting element in WebDriver, not by xpath

I am using webdriver for test automation on site which code is auto-generated probably in GWT.
All id's are in form like "x-auto-4009" which is not to reliable way of getting to elements.
I have page with something like form.
It's like
Label name | <---- Input ---->
Label name | <---- Input ---->
Label name | <---- Input ---->
Each new line is coded as new table in html.
Can you tell me what is the best way of getting to specific Input in more generic way?
I wrote a method that takes a label name and then it finds all elements by tag TR. Next it get's theese allRows and scans them for labelname. If it's found then i find within that row elements with tag input and that is my goal. It works fine but it takes some time to do find all those elements and loop through them.
I don't want to use xpath or locating elements through those fragile id's.
Can you recommend me any other way of doing that?
Thanks in advance,
regards.

If you can't use xpath, I think the only other way to do this would be to alter the code for the page.
It is possible to set these GWT ID values so that they have a clearer and more consistent value. See: https://developers.google.com/web-toolkit/doc/latest/DevGuideUiCss#widgets.
Less appealing is the option to give each input a class or name attribute by which you'd identify them, and then search By.class() or By.name().

Why not xpath? That's the most direct way to do it. Your existing approach is the slower alternative.
Use this xpath selection: <label> with text containing the "name", then its parent <td>, then its parent <tr>, then the first <input> in that row:
WebElement input = driver.findElement(By.xpath(
"//label[contains(text(), '"+name+"')]/../../input"
));
I have not tested this. Might have to adjust to your table structure, too.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup Parsing issues - java

Related

How to find xpath element partially using selenium with java?

Get the child of an HTML element by its XPath

Apache POI for docx Insert Text On Specific Page

Get content of list of span elements with HTMLUnit and XPath

Getting element in WebDriver, not by xpath

Categories

Resources