I'm writing a program (in Java) that needs to extract links from webpages. I'm using htmlParser (http://htmlparser.sourceforge.net/) but I'm only able to extract html links (defined with <a href="...">) and I don't know how to handle javascript code to extract links from... can you help me??
You can use Rhino with DOM environment, written in JavaScript.
By the way it is written by John Resig.
HTML Parser from sourceforge is useful. I have used it to parse a whole bunch of HTML already. However, parsing JS is different. Cheers.
This is probally the most comprehensive tool out there. Rhino . Everything you want to do can be done with Rhino.
Related
Reading Jsoup's documentation I didn't understand if Jsoup applies Tidy before parsing a html file.
In this case, it's possible to disable tidy?
Did you know other Java HTML5 parsers without tidy-fication of the page source?
Thanks.
This Oracle article might solve your problem. It's native API, and it does what you want. Simple and effective
How can I open a website and return some information from it in Java ? For example I want to go to http://xyz.com, enter my family name and return my national code
You can use java.net.HttpURLConnection to connect to a website. For scraping information from the loaded website you can use a Java HTML Parser library (for example JSoup) to be able to traverse through the DOM and/or retrieve relevant pieces of information from the DOM.
With Selenium, that is a tool for testing web applications, you can do all that you say. Try to check its documentation
This is an example case example in java
If that site returns information in XML format, then its possible to do XML parsing to get the result you desire.
SAX is really handy in XML parsing in these cases.
I need to be able to manipulate the content of an xhtml file (modify some text within), then write everything back to an xhtml file (could be the same) to be able to user it later. Is it possible with Jsoup or need another library/code to do it? thanks
Yes, this is possible with Jsoup alone. You don't need any extra libraries. Just have a look at the Jsoup Cookbook. If it is really XHTML you could even do it with any XML DOM implementation; Jsoup doesn't come with a formatter for the output so you would just toString the modified Document with no further control which may or may not be acceptable.
XSLT might be your friend in this case. Give it a Google anyway, but it sounds like a good starting point.
I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]
I have to extract some information from a web page, and reformat it for the user.
Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.
Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?
Cheers
Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:
http://jsoup.org/
I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html
It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).
We've used HTTPUnit do do this in the past.
jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).