htmlunit refering to element - java

I would like to programmatically issue a click on a web page and download the CSV file from the website. I'm trying to implement this logic with HtmlUnit library. The code that I'm trying to use:
HtmlPage historicalDataHtmlPage = webClient.getPage("https://finance.yahoo.com/quote/GOOG/history?p=GOOG");
HtmlPage downloadHtmlPage = historicalDataHtmlPage.getAnchorByText("Download Data").click();
HtmlUnit seems to have a problem finding this particular element (not sure why - i tested this solution on other websites and it seems to find anchors by text).
Can you please advise how can I fix this error or how can I refer to the Download Data element in any other way?
Thank you.

Please go through the links mentioned by #pvg and modify your question as per the guidelines.
Having said that, have you tried using getAnchors() which returns a list List<HtmlAnchor> and see what is the text you are getting inside the tag you want. Maybe there are other elements which do not match your text assumption of "Download Data".
Would have posted this as a comment but lack the reputation to do so.

Looks like the page starts with minimal content and adds you anchor later by doing some background requests.
Try to wait some seconds before you start searching for the anchor.
Also it is a good idea to use Page.asXML() to get an idea of the current status of the page (e.g. before and after the wait to see if something has changed.

Related

Can I obtain this input element with Selenium?

I'm new to Selenium.
I have been looking all the ways possible to resolve this problem (at this point I think it is just unsolvable) I have a web page (can't share) with this input:
/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td/div/iframe/#document/html/body/div[2]/div/iframe/#document/html/frameset/frame/#document/html/frameset/frameset/frameset/frame/#document/html/body/div/form/table/tbody/tr[2]/td[2]/input
(As you can see the structure has a mix of frames, iframes and framesets; some of them have id and other names some none)
The problem is that Selenium will not find the element by any method I
have tested
. First I tried with a simple driver.findElement(By.all of them)
After they didn't work I start looking on the web and I couldn't find why I couldn't get a handle of this.
It is the only input in the web page and has an id as attribute, so it should be easy.
This is the class where I have been trying to make it work, If you want to check my unsuccessful work I focused my last efforts on the attempt number 8.
As you can see I have been trying to obtain it by several ways, the frames really seemed an issue but it was not them, nor the waiting.
Is it really no way of getting this element? Is it one of those cases
where Selenium can't automate? Or it is that I'm missing something.
object IS visible and there is not even a scroll bar, the whole page fits in the screen perfectly, Xpath was one of the first choices I tested, didn't work
Thank you in advance.
I don't know if this is the problem, but is the element you trying to use visible or you need to scroll in order to view it? If it's not visible try using this to scroll a little bit :
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("window.scrollBy(0,1000)");
Another possible solution is to try use the absolute xpath instead of relative. There are several tools to find the absolute xpath of an element in an html page. Then you can use
driver.findElement(By.xpath(absoluteXpath));
After a lot of try I realized that selenium was switching to a different frame at the last switchTo. This is probably a bug, but I modified the way I was getting the last switch to to instead of
var wait = (new WebDriverWait(driver, secsToWait));
wait.until(ExpectedConditions.frameToBeAvailableAndSwitchToIt(frameNameOrId));
To
driver.switchTo().frame(driver.findElement(By.xpath("//frame[#"+attribute+"='"+frameNameOrId+"']")));
So it finally worked and obtained the input normally.

Outputting Search results using Jsoup in java

I'm trying to create a Java Program, where I can insert a String into a search bar and then record/print out the results.
This site is: http://maple.fm/khroa
I'm fairly new to JSoup and I've spent several hours just reading the html code regarding that page and have come across variables that could be used to insert the String that I need and get results, although I'm not sure how to exactly do that. Would someone be able to point me to the right direction?
I think you missed the point of JSOUP.
JSOUP can parse a page that is already loaded - it is not used to interact with a page (as you want). You could use Selenium to interact with the page (http://www.seleniumhq.org/) and then use JSOUP to parse the loaded page's source code.
In this case, the search results seem to be all loaded when the page load, and the Item Search function only filters the (already existing) results with Javascript.
There are no absolute links you could use to get results to a particular search.

Roblox Forum, Web Scraping App (Android) Questions

I've been waiting for an idea and I think I finally have one. I am going to attempt to make a Android App using Web Scraping that will allow me to navigate and use the forums on Roblox (.com if you really want to look it up) better than I can now. Not only are the forums pretty bad in general but they are even worse on my Android Device (Samsung Galaxy Player). Can anyone give me an pointers or advice? I'm not sure what libraries I should use... This is my first big attempt at coding :)
Oh, Obviously I would want to give it a feature to reply to posts but I'm not sure how login for that type of thing would work...
EDIT: I got the idea from this application: GooglePlay, Github
You should look up how to get the data from the website, and you should also make sure that you understand html. You also need a simple way to handle html.
Get the page (use the example in the question): Read data from webpage
A bit about html: http://www.w3schools.com/html/default.asp
Handle html: http://jsoup.org/cookbook/extracting-data/dom-navigation
To login you should do a post request with the login information to the standard login page, then you keep the cookie that were generated and pass it with your other requests.
Little about handle cookies: Java: Handling cookies when logging in with POST
Some things you also might want to think about:
Linear or branched view of posts in the forum?
Should you get a message if someone post a new post?
A own search function?
Signature?
You have to use JSOUP libaray of java ,you can easily parse the html data through this library. Example: In doc object you are getting complete web page
File input = new File(url);
Document doc = null;
doc = Jsoup.connect(url).get();
Elements headlinesCat1 = doc.select("div[class=abc");

How to programmatically submit a filled form and scrape the resulting page?

I want to retrieve a set of results, which consist of all results produced by (looping) all the options of one of the request-form fields.
I'm using Java language, and HtmlUnit API.
I have managed to do this looping form-fill using the URL to 'fill' the field's variables (I don't know if its the best method, and actually am quite worried it's one of the worst...But it's the one i could do with the knowledge i have).
But i'm having problems figuring out how to make the program submit the form in order to reach the result page, and on how to download (scrape) that page before moving to the next.
NOTES:
-If you have a better way of filling the 'request-form', that is welcome as well.
UPDATE:
This solves the issues when using HtmlUnit API (thank you, touti):
HtmlPage resultado = pageNow.getElementByName("buscar").click();
System.out.println(resultado.asText());
A better way than loading both the request and response pages is still hugely welcome tough!
you can simulate using Jquery the click on your submit input like this
$("#submit_id").trigger("click");

Extract All Images From HTML Using JAVA

I want to get the list of all Image urls from HTML source of a webpage(Both abosulte and relative urls). I used Jsoup to parse the HTML but its not giving all images. For example when I am parsing google.com HTML source its showing zero images..In google.com HTML source image links are in form..
"background:url(/intl/en_com/images/srpr/logo1w.png)
And in rediff.com the images links are in form..
videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bappi-da-the-first-indian-in-grammy-jury/2684982","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/v3np2zgbla4vdccf.D.0.bappi.jpg","Bappi Da - the first Indian In Grammy jury","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:33)");
j = 1
videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bebo-shahid-jab-they-met-again-/2681664","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/ra8p9eeig8zy5qvd.D.0.They-Met-Again.jpg","Bebo-Shahid : Jab they met again!","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:17)");
All images are not with in "img" tags..I also want to extract images which are not even with in "img" tags as shown in above HTML source.
How can I do this..?Please help me on this..
Thanks
This is going to be a bit difficult, I think. You basically need a library that will download a web page, construct the page's DOM and execute any javascript that may alter the DOM. After all that is done you have to extract all the possible images from the DOM. Another possible option is to intercept all calls by library to download resources, examine the URL and if the URL is an image record that URL.
My suggestion would be to start by playing with HtmlUnit(http://htmlunit.sourceforge.net/gettingStarted.html.) It does a good job of building the DOM. I'm not sure what types of hooks it has, for intercepting the methods that download resources. Of course if it doesn't provide you with the hooks you can always use AspectJ or simply modify the HtmlUnit source code. Good luck, this sounds like a reasonably interesting problem. You should post your solution, when you figure it out.
If you just want every image referred to in the page, can't you just scan the HTML and any linked javascript or CSS with a simple regex? How likely is it you'd get [-:_./%a-zA-Z0-9]*(.jpg|.png|.gif) in the HTML/JS/CSS that's not an image? I'd guess not very likely. And you should be allowing for broken links anyway.
Karthik's suggestion would be more correct, but I imagine it's more important to you to just get absolutely everything and filter out uninteresting images.

Categories