Scraping data from a java generated webpage with R [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
The Dutch government publicizes the subsidies it gives on a website:
http://www.hetlnvloket.nl/databank-eu-subsidiegegevens-2012#
However, it is not straighforward to get the data from the website. If you go to the site, choose 'Gemeenschappelijk Landbouw Beleid' (Common Agricultural Polici, the EU subsidy schedule) and press 'zoek' (zoek means 'search') at the bottom of the page you get a table from 100 entries. But I can't get it into R. It seems the page is generated with a JavaScript after you press 'zoek'.
My questions are:
How do I scrape this from the website?
How do I get the other 900 pages (there are a total of 90K records)
I asked the government to give me this data in XLS but they won't, for 'privacy reasons'. But this way nobody can check. I don't like that. ;-)

If you don't see the url change, the request is usually done via ajax, or via a post-request to the same page. In this case it is done via an ajax POST-request to a certain page with some parameters. To find out what page is loaded with what parameters, open your developer console. You can do so by right-clicking in most browsers and clicking 'inspect element', or by hitting F12. Go to the network tab and click the search button. You'll see a request in the network tab pop up. Inspect this request. You'll notice that this is a request to /pls/feed/glb2012. You can find the request parameters around there too.
As for the question "how" to scrape this. Use a programming language and your favorite scraping library. To suggest a library is out of the scope for StackOverflow.

Use a tool better suited for scraping than R. For example, Scrapy or BeautifulSoup in python, Mojolicious or Web::Scraper in perl, ... You want to scrape with a scraping tool, output data in csv (or something similarly standard), and then get it into R.
You need to figure out what the browser-server communication is exactly. The data is probably not at the url you see when you go to that page. A quick capture in Wireshark and look at the HTTP requests will show that.
It looks like, based on your level of experience (and likely, not wanting to learn new tools just for this) you probably want to have someone do it for you. Post it on elance, make sure whoever does it has done a bunch of scraping projects, it should take only a couple of hours max.
If you do want to do it, then follow scraping tutorials and cookbook examples, but remember to check the actual communication in Wireshark as you do that.

Related

Web Development: Dynamically linking to other related webpages [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 months ago.
Improve this question
I've almost 'finished' my first website. Which consists of an authors anthology, essentially 100's of pages containing individual articles. It's built with HTML and CSS.
At the bottom of each page I want to link to related pages, displaying the title and an image. How can I auto-populate these boxes by fetching related pages from a database?
I'm just looking for someone to point me in the right direction, I will try to teach myself. I assume there would be some server side scripting? or loading the data into a sql?
There are two approaches.
Lets compare and contrast!
1. Server Side:
You write code on the server (in PHP or Python or Java or whatever) to create HTML files programmatically, which will have the relevant links.
Pros:
You are in full control
Cons:
Resource intensive (relatively speaking)
Longer initial loading time (waiting for server to create a new page, per request)
It's the way it was always done.
2. Client Side:
You write code on the client (browser) that receives from the server just the data (perhaps JSON?), and figures out how to display it on its own. Perhaps using Angular, or React.
Pros:
Very light-weight on the server
HTML pages can be hosted cheaply (S3, DropBox, what have you)
Cons:
Content is fetched and analyzed on the fly, making the page feel slow if you're not careful
Bloats the front end, kinda harder to grasp
As a sub-topic of the Client Side, There's a new hotness in town, and it's called Serverless. You don't have to write a back end, and you focus 100% on the front end.
If you really have to make calculations outside of the users browser you can use cloud functions (like Amazon Lambda), but I don't think that's your use case.
For your use case, you can access a database straight from the front end, without needing any back end. See: Firebase.
You'll need some sort of server side program. Something that can query the database, then return the results either through an api or process it all server side and return the html. Below are some frameworks that can help.
Java:
Play, Spring, Javalin, Dropwizard, etc.
Python:
Django

HTTP 429 Too Many Requests when accessing a Reddit .json page only once using Java [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am getting a HTTP 429 Too Many Requests response when I attempt to access any Reddit page using a .json extension using Java.
I am using Java code found here without any modification (except to change the target URL). I am attempting to access URLs such as the following:
http://www.reddit.com/r/news.json
http://www.reddit.com/r/news/comments/3aqhk7/a_17yearold_invented_an_ingenious_way_to.json
I can access these pages just fine using a browser, but cannot access them programmatically despite the fact I am making a single request each time and waiting in between. Reddit returns this message when more than 30 requests are made in a minute, but I am making far less than that and no-one else on my network uses Reddit.
Is anyone familiar with this and why I might be getting these errors? Would there be a better way to approach this using Java?
Make sure to use a custom user-agent string - see the 4th bullet point on the API rules:
Change your client's User-Agent string to something unique and descriptive, including the target platform, a unique application identifier, a version string, and your username as contact information, in the following format:
<platform>:<app ID>:<version string> (by /u/<reddit username>)
Example:
User-Agent: android:com.example.myredditapp:v1.2.3 (by /u/kemitche)
Many default User-Agents (like "Python/urllib" or "Java") are drastically limited (emphasis mine) to encourage unique and descriptive user-agent strings.
Including the version number and updating it as your build your application allows us to safely block old buggy/broken versions of your app.
NEVER lie about your user-agent. This includes spoofing popular browsers and spoofing other bots. We will ban liars with extreme prejudice.

ideas for apps that show news headlines (what is such a feature called and what I should research to implement it) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
My friend wants an Android app in android that will show him news from various selected websites, about some particular topics (health, electronics, etc) and not main headline news. Now I assume the websites in question should provide some method for extracting their news for this to be possible, but I am not sure how I should go about starting to write to write this app. I'm pretty sure news websites probably have a way to pick out any and all news that they give out, but like I said, my app should be able to only get the news pertaining to some selected topics (like health or entertainment)
What is this feature called? How should I go about starting this? Any ideas for what keywordds to search for would be helpful. Thank you.
First and foremost, it sounds like you need a little more background information before you are going to be able to tackle something like this:
RSS feeds with Java Tutorial
Documentation on using RSS (For additional resources not provided above that may be of interest).
Another resource with code examples
Those resources are provided assuming that you are tackling this project to get a better understanding of newsfeeds in Java programming language. If your friend is really looking for a newsreader that filters results as you described, it would be easier and more practical for both parties to download a few apps and try them out.
Article on some of the best news readers for android.
Hope this is helpful to you and your friend.
Dude you should follow some of the steps mentioned below to get news in your app:
Search through the web for some news sites that provide some RSS feeds .
Select the news genre like Top-Stories , Movies etc whatever you want int the RSS feeds menu.
A RSS feed page will open on selecting a specific genre , which consist of news related to the the genre.
There on that page , right click and view page source.
The Page source will start with tags and then study the whole xml pattern , there's all the news within that page source.
Now all you need to do is make some internet calls from your app,do SAXparsing or XMLPullParsing whatever you want just to extract data from that page link.
Now display that data in your app in form of List or whatever view you want.
I hope this helps you buddy. :)

Interacting with an AJAX site from Java

I am trying to download the contents of a site. The site is a magneto site where one can filter results by selecting properties on the sidebar. See zennioptical.com for a good example.
I am trying to download the contents of a site. So if we are using zennioptical.com as an example i need to download all the rectangular glasses. Or all the plastic etc..
So how do is send a request to the server to display only the rectangular frames etc?
Thanks so much
You basic answer is you need to do a HTTP GET request with the correct query params. Not totally sure how you are trying to do this based on your question, so here are two options.
If you are trying to do this from javascript you can look at this question. It has a bunch of answers that show how to perform AJAX GETs with the built in XMLHttpRequest or with jQuery.
If you are trying to download the page from a java application, this really doesn't involve AJAX at all. You'll still need to do a GET request but now you can look at this other question for some ideas.
Whether you are using javascript or java, the hard part is going to be figuring out the right URLs to query. If you are trying to scrape someone else's site you will have to see what URLs your browser is requesting when you filter the results. One of the easiest ways to see that info is in Firefox with the Web Console found at Tools->Web Developer->Web Console. You could also download something like Wireshark which is a good tool to have around, but probably overkill for what you need.
EDIT
For example, when I clicked the "rectangle frames" option at zenni optical, this is the query that fired off in the Web Console:
[16:34:06.976] GET http://www.zennioptical.com/?prescription_type=single&frm_shape%5B%5D=724&nav_cat_id=2&isAjax=true&makeAjaxSearch=true [HTTP/1.1 200 OK 2328ms]
You'll have to do a sufficient number of these to figure out how to generate the URLs to get the results you want.
DISCLAIMER
If you are downloading someone's else data, it would be best to check with them first. The owner of the server may not appreciate what they might consider stealing their data/work. And then depending on how you use the data you pull down, you could be venturing into all sorts of ethical issues... Then again, if you are downloading from your own site, go for it.

Script to take web survey for me

I had to take a surveymonkey survey today, and the format was as follows: a question was asked, then after hitting the next button, the answer was displayed as "Answer: _" along with an explanation. For kicks, I'd like to make a program that could take this survey, answering any letter, then going to the next page and reading the answer, then going back and changing the answer to the correct one, then going 2 pages ahead and repeating.
I am familiar with Java and Python, but I'm not sure how to make them be able to "know" where the button is, and how to "read" text without unnecessary image recognition.
This is just a fun project, nothing serious, but I would appreciate any ideas to get me started.
Assuming that the text was just that (text rather than images), there are a few useful tools for you:
.Net WebControl - I've scripted this before from .Net. It has the advantage of making all of the JS on the page still work. I know this isn't Java, but it is surprisingly easy to work with for this kind of task.
Selenium - It is primarily a web testing framework, but it would be easy to script it from Java to auto-submit forms.
TagSoup for Java - If the pages do not have significant javascript code that needs to run, there are many HTML parsers for Java that could potentially be used to develop a scraper.
Would it be unrealistic to make it post to the survey monkey pages? You could then do some regex's to pull "answer:__" out and look for that pattern in the original page. It would definitely be easier than trying to click things in a browser, etc. Basically, write a java app or python for that matter that does http posts to the survey pages in order and uses regex's to find the next page, etc and then use a stack to keep track of the history.
Edit if this isn't clear, let me know, I'll clarify
Edit 2: I completely forgot about HTMLUnit, my bad. It is a testing framework like suggested by jsight but specifically for Java and functions very similarly to JUnit, however, because it is designed for testing web applications, it can be used to automate interactions with other sites
You can do it using a simple image search. First screenshot the a unique part of the button and save it. This will be used as the relative reference on where you click the mouse. Then during the actual running of the application, have a screenshot of the entire screen and find a part matching the previously saved image and then let the mouse click on the appropriate location based on the button image location.

Categories