I am trying to get article headlines from NY times .
But I think the html is generated by javascript, as it is only visible when I use the 'inspect element' on firefox.
How can I get to the articles? Probably, one of the ways is to emulate a browser but that seems like overkill.
I would prefer to do this in Java but Python is okay too. Your help is appreciated!
edit:
I tried using the api. But there are a lot of bad urls (page not found). Anyone has any more ideas on how to get the urls and headlines?
Selenium is probably what you're looking for; it's a browser automation framework.
You can use Python but Selenium actually uses Firefox to parse a site's content (last time I heard).
You can get the python version here but there are other options.
You could try to use a browser without GUI like HtmlUnit. It has good JavaScript-support and you're able to read the contents of the page from your Java-program.
As an alternative solution to this particular problem, how about using the New York Times API? They provide JSONP for JavaScript support. Using the API is probably more future-proof if they ever change the site layout.
Related
I have everything set up to run a headless browser using Selenium in Java. I cant figure out now what I need to do to extract elements from this ReactJS website (the site contains either ReactJS or Javascript I'm not sure).
I feel like I may be approaching this wrong and/or missing some libraries that would help me along the way.
Any help greatly appreciated.
If you do Web Interface tests or scraping with Java + Selenium, I advise you to use the NoraUi Open Source Framework.
I cannot find any direct method like isDisplayed() in Jsoup Element.
I can check the input with type = "hidden" by using the following code.
"HIDDEN".equals(elm.attr("type").toUpperCase())
But I need the CSS hidden to be captured as well. And also the inherited hidden elements.
Pshemo said it already in his comment: JSOUP is not a JavaScript interpreter. And JSOUP does not combine external CSS info into html. JSOUP just interprets html, and it is very good at this. Nothing much more but also nothing much less. You can also access the internet and load html pages with JSOUP, but that is really the limit of it.
About your problem: You should think hard if it is really needed to know if an element is visible or hidden. If it is in your context, you problably need a testing framework that behaves like a browser. For Java there are very good bindings to selenium webdriver. This drives a real browser to load and test pages. You can also scrape the content with selenium. I have good experience using both, selenium for accessing web content and then switching over to JSOUP for actually scraping. In your case you can use the powerful webdriver API directly to find out if an element is hidden or not.
Selenium webdriver is able to work with Firefox, Chrome and a bunch of other browsers. If you need a lightweight alternative you may use a headless browser. For that there exists PhantomJs, which is exellenttly supported by selenium. Or HTMLUnit, which is even lighter and uses the Java Rhino interpreter for JavaScript.
You see, there are quite some options to choose from to achieve what you want. Just not JSOUP, although it is a great library.
Our website UI is build in javaScript, JQuery. Lots of task is performed by jquery. UI doing lots of filter , sorting searching , tabular display.
Now we want to make sure that web site works fine even javascript is disabled on browser?
SPRING MVC <>
can any body give me idea how can we achieve this?
Can any other technology will do over it?
can GWT will be able to achieve this ?
If your website is built using JavaScript technology itself, then unless you build it WITHOUT JavaScript, there is no way you can achieve this.
GWT is out because it basically compile some Java-ish code into javascript. With spring MVC, you can rewrite your site to use only clients view (JSP) and server side actions (MVC controllers) for sorting, ...
What you were supposed to do with your particular requirement was to build a basic site that worked without JavaScript and then use JavaScript to make it much more whizzbang(Pretty with cool effects)!! :-D But since you have already built the site the only solution I can think off the top of my head is to crate a new basic site with HTML and make it the default site. From that basic you can check if JavaScript is enabled and then redirect the user to the whizzbang site(JavaScript enabled one) with a simple JavaScript redirect!
Some sites can degrade gracefully from a faster, slicker version that uses JS, to another that does not. Unfortunately that does not seem the case with your site.
One strategy is to define a redirect in HTML (should be set for 5-10 seconds1)that points to needjs.html that explains to the user that:
Sorry, we do not have the ability to provide this site without JS.
In JS, cancel the redirect.
The majority of sites now-a-days presume there's javascript. If you want to check how your site behaves, turn off javascript in the browser. If you want content for when javascript is disabled only, put it in a <noscript> tag, but be aware GoogleBot (SEO) runs without javascript, and will hit this. How to make the site function nicely without javascript? Build some ninja html and css and do all your work server-side. But again, since most every site presumes javascript is enabled, users who disable it are already familiar with how broken they've made the web. It may just be sufficient to put a <noscript> that includes a message about how this site requires javascript.
Is it possible to use Java to build a web browser like Internet Explorer that will open all the web pages and display all the contents?
The only valid answer to that question is:
Yes, it's possible to use Java to build a web browser.
However, a web browser is an exceptionally complex piece of software. Even Google, when building its Google Chrome browser, used existing technology to do it, rather than inventing their own browser from scratch.
If your goal is anything other than building and marketing your own browser, you may want to reconsider what exactly you want to accomplish, in order to find a more direct approach.
I advise you to take a look at the Lobo Browser project, an open-source java-written web browser. Take a look at the source and see how they did it.
Yes, it is possible. JWebPane is a work in progress migration of Webkit. It is supposed to be included in JDK7 but I wouldn't hold my breath.
JWebPane browser = new JWebPane();
new JFrame("Browser").add(browser);
browser.load(someURL);
Yes, it's possible, and here's what you would need to start looking at.
First, search for an HTML renderer in Java. An example would be JWebEngine. You can start by manually downloading HTML pages and verifying that you can view them.
Second, you need to handle the networking piece. Read a tutorial on sockets, or use an HTTP Client such as the Apache HTTPClient project.
Edit:
Just to add one more thought, you should be honest with yourself about why you would work on this project. If it's to rebuild IE, FF, that is unrealistic. However, what you might get out of it is learning what the major issues are with browser development, and that may be worthwhile.
Take a look at the JEditorPane class. It can be used to render HTML pages and could form the basis of a simple browser.
Yes. One of the projects in Java After Hours shows you how to build a simple web browser. It's not nearly as full-featured as IE or Firefox of course (it's only one chapter in the book), but it will show you how to get started.
The hardest thing will be the rendering component. Java7 will include JWebPane, that internally uses WebKit. Here you can find some screenshots.
I develop this browser for my college project may be this helpful for you
My Button is open source java web browser.
Develop for school and college projects and learning purpose.
Download source code extract .zip file and copy “mybutton” folder from “parser\mybutton” to C:\
Import project “omtMyButton” in eclipse.
Require Java 6.
Download .exe and source code :
https://sourceforge.net/projects/omtmybutton/files/
My task is to create a simple web browser in Java.
So far it can only read HTML pages.
I'm using standard JEditorPane component to display webpages.
Now I was wondering is there any way you could explain me how can I manage to display at least some simple pages that contain CSS/Javascript.
If you could point me to some useful links or appropriate examples I would be very happy.
Well, my advice would be to look at open source rendering engines such as Gecko - https://developer.mozilla.org/en/Gecko_FAQ
You can embed Gecko with Java using the JREX library - http://jrex.mozdev.org/
Starting from scratch with a problem like this is a very big task, and as your username is AmateurProgrammer, I wouldn't recommend it.
There alrady is some prior art for the Java browser segment.
concerning javascript, you will have to use a javascript interpreter in Java. A renowned one is Rhino (by Mozilla). Its integration may reveals to be an interesting challenge.
concerning CSS, it seems the question has already been asked ...