I am trying to download the contents of a site. The site is a magneto site where one can filter results by selecting properties on the sidebar. See zennioptical.com for a good example.
I am trying to download the contents of a site. So if we are using zennioptical.com as an example i need to download all the rectangular glasses. Or all the plastic etc..
So how do is send a request to the server to display only the rectangular frames etc?
Thanks so much
You basic answer is you need to do a HTTP GET request with the correct query params. Not totally sure how you are trying to do this based on your question, so here are two options.
If you are trying to do this from javascript you can look at this question. It has a bunch of answers that show how to perform AJAX GETs with the built in XMLHttpRequest or with jQuery.
If you are trying to download the page from a java application, this really doesn't involve AJAX at all. You'll still need to do a GET request but now you can look at this other question for some ideas.
Whether you are using javascript or java, the hard part is going to be figuring out the right URLs to query. If you are trying to scrape someone else's site you will have to see what URLs your browser is requesting when you filter the results. One of the easiest ways to see that info is in Firefox with the Web Console found at Tools->Web Developer->Web Console. You could also download something like Wireshark which is a good tool to have around, but probably overkill for what you need.
EDIT
For example, when I clicked the "rectangle frames" option at zenni optical, this is the query that fired off in the Web Console:
[16:34:06.976] GET http://www.zennioptical.com/?prescription_type=single&frm_shape%5B%5D=724&nav_cat_id=2&isAjax=true&makeAjaxSearch=true [HTTP/1.1 200 OK 2328ms]
You'll have to do a sufficient number of these to figure out how to generate the URLs to get the results you want.
DISCLAIMER
If you are downloading someone's else data, it would be best to check with them first. The owner of the server may not appreciate what they might consider stealing their data/work. And then depending on how you use the data you pull down, you could be venturing into all sorts of ethical issues... Then again, if you are downloading from your own site, go for it.
Related
I am in the process of writing a program whose purpose is centered around generating custom URLs for intelius.com and then extracting data from them with selenium. I have observed interesting behavior that I am unsure how to address.
My program creates URLs after the following pattern: https://intelius.com/people-search/LASTNAME/CITY-STATE, but I have found that attempting to access these constructed links consistently leads to a timeout error.
For example, http://intelius.com/people-search/Williams/Brooklyn-NY does not load the expected results page
Digging around in the website's source, I have found what appears to be a link validator script — what exactly that means, I do not know — and am unsure how to proceed.
How exactly would I go about authenticating my queries, without programming selenium to manually input the data into the search textbox and to press the submit button? Is my link-construction approach flawed in some blatantly obvious manner? I am a bit lost and would appreciate some direction. Thanks!
I think your problem is using http instead of https, and omitting www from URL. So this works:
https://www.intelius.com/people-search/Williams/Brooklyn-NY
The problem lies in the way the URL being formed. You need to construct and pass the arguments the way the web application understands it. The following works -
https://www.intelius.com/people-search/William-Brooklyn/NY
I'm trying to read in the HTML from a webpage and parse information from it using a URLConnection in Java. It works, but the page only loads part of the content, the rest is loaded as the user scrolls down the page. Is there any way for a Java program to trigger this? My program doesn't actually open the webpage in a browser, just a connection to the page. If it's relevant, I can add the URL I'm accessing.
I've been trying to find the answer, and found a few similar topics on here, most of them without answers. However, I eventually made my way to this topic, which sounds like what I need, but I looked at the URLs of the calls being made and they're not always the same, so I can't just type them into the program. I looked at the topic it was supposedly a duplicate of, but that didn't seem to apply to my problem either, unless I misunderstood something. Is there any way to find these URLs each time the program runs, or any way to trick the connection into thinking I'm scrolling down the page? Or can I make a general "request" or "POST" as I've seen in some related topics, that will automatically call the appropriate URL (An explanation of a "POST" would be appreciated as well)?
I am searching for a way to know when the user leaves the page and has not saved the changes then show wicket's modal(preferable but could be a confirmation box).
Additional info:
the solution should have minimal effects in code, because I have about 30 pages that will have the behavior, actually all my web pages extends from one called LayoutPage, something similar to this
I tried with pure JavaScript solution like in this question, but the application send a lot of data via AJAX requests, so I couldn't determine a nice way to know if the data has been sent to the server
Ihen I start to look in the source code of the Form.class of Wicket. It has a nice method called isSubmitted(); I could use it if I was able to know from wicket if the user is about to quit the page.
I don't want to write a validation for each page in the system.
Simply generate your browser onbeforeunload using https://cwiki.apache.org/WICKET/calling-wicket-from-javascript.html. In the callback you can then check the state of your form or page.
Although I've been programming for a few years I've only really dabbled in the web side of things, it's been more application based for computers up until now. I was wondering, in java for example, what library defined function or self defined function I would use to have a program launch a web browser to a certain site? Also as an extension to this how could I have it find a certain field in the website like a search box for instance (if it wasnt the current target of the cursor) and then populate it with a string and submit it to the server? (maybe this is a kind of find by ID scenario?!)
Also, is there a way to control whethere this is visible or not to the user. What I mean is, if I want to do something as a background task whilst the user carries on using the program, I will want the program to be submitting data to a webpage without the whole visual side of things that would interrupt the user?
This may be basic but like I say, I've never tried my hand at it so perhaps if someone could just provide some rough code outlines I'd really appreciate it.
Many thanks
I think Selenium might be what you are looking for.
Selenium allows you to start a Web browser, launch it to a certain website and interact with it. Also, there is a Java API (and a lot of other languages, by the way) allowing you to control the launched browser from a Java application.
There are some tweaking to do, but you can also launch Selenium in background, using a headless Web browser.
as i understand it you want to submit data to a server via the excisting webinterface?
in that case you need to find out how the URL for the request is build and then make a http-call using the corresponding URL
i advice reading this if it involves a POST submit
I had to take a surveymonkey survey today, and the format was as follows: a question was asked, then after hitting the next button, the answer was displayed as "Answer: _" along with an explanation. For kicks, I'd like to make a program that could take this survey, answering any letter, then going to the next page and reading the answer, then going back and changing the answer to the correct one, then going 2 pages ahead and repeating.
I am familiar with Java and Python, but I'm not sure how to make them be able to "know" where the button is, and how to "read" text without unnecessary image recognition.
This is just a fun project, nothing serious, but I would appreciate any ideas to get me started.
Assuming that the text was just that (text rather than images), there are a few useful tools for you:
.Net WebControl - I've scripted this before from .Net. It has the advantage of making all of the JS on the page still work. I know this isn't Java, but it is surprisingly easy to work with for this kind of task.
Selenium - It is primarily a web testing framework, but it would be easy to script it from Java to auto-submit forms.
TagSoup for Java - If the pages do not have significant javascript code that needs to run, there are many HTML parsers for Java that could potentially be used to develop a scraper.
Would it be unrealistic to make it post to the survey monkey pages? You could then do some regex's to pull "answer:__" out and look for that pattern in the original page. It would definitely be easier than trying to click things in a browser, etc. Basically, write a java app or python for that matter that does http posts to the survey pages in order and uses regex's to find the next page, etc and then use a stack to keep track of the history.
Edit if this isn't clear, let me know, I'll clarify
Edit 2: I completely forgot about HTMLUnit, my bad. It is a testing framework like suggested by jsight but specifically for Java and functions very similarly to JUnit, however, because it is designed for testing web applications, it can be used to automate interactions with other sites
You can do it using a simple image search. First screenshot the a unique part of the button and save it. This will be used as the relative reference on where you click the mouse. Then during the actual running of the application, have a screenshot of the entire screen and find a part matching the previously saved image and then let the mouse click on the appropriate location based on the button image location.