Writing an xhtml file after reading its content from Jsoup - java

I need to be able to manipulate the content of an xhtml file (modify some text within), then write everything back to an xhtml file (could be the same) to be able to user it later. Is it possible with Jsoup or need another library/code to do it? thanks

Yes, this is possible with Jsoup alone. You don't need any extra libraries. Just have a look at the Jsoup Cookbook. If it is really XHTML you could even do it with any XML DOM implementation; Jsoup doesn't come with a formatter for the output so you would just toString the modified Document with no further control which may or may not be acceptable.

XSLT might be your friend in this case. Give it a Google anyway, but it sounds like a good starting point.

Related

easiest document format for displaying pages containing images and text in java

I want to be able to open up documents containing a combination of one or two pictures and text from java. The documents don't have to be pretty, but I need to be able to switch documents relatively quickly. I'm trying to figure out what the easiest method to do this is.
I can save the documents in whatever format is easiest for me, for instance html or PDF. But the documents must be somewhat easy to modify or generate new ones. I don't care if the document is displayed within a java frame or by an external tool so long as the tool is common enough to be installed on most OS and I can switch documents quickly and without too much hassle. This is an internal tool so it doesn't have to work at professional level quality.
Unfortunately, various company limitations make it a real hassle to get approval to use open source packages that haven't been pre-approved. So I can't do the obvious thing and grab an open source implementation of PDF or HTML reader for java.
So, any suggestions on the easiest format for my documents and how to read it?
You can use XHTML. So, your document will be directory that contains HTML document and image files as-is. you do not need anything beyond JDK to implement this and can use any browser to view such document. Modification is easy too.
Note: I said XHTML as a HTML that can be parsed using regular XML parser. I think it is the best choice for you.

Reading HTML+JavaScript using Java

I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]

Parsing HTML from a web page

I have to extract some information from a web page, and reformat it for the user.
Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.
Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?
Cheers
Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:
http://jsoup.org/
I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html
It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).
We've used HTTPUnit do do this in the past.
jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).

What is the best way to screen scrape poorly formed XHTML pages for a java app

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.
Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.
Run the XHTML through something like JTidy, which should give you back valid XML.
You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.
It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.
I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.
I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.

How to parse javascript for links with java?

I'm writing a program (in Java) that needs to extract links from webpages. I'm using htmlParser (http://htmlparser.sourceforge.net/) but I'm only able to extract html links (defined with <a href="...">) and I don't know how to handle javascript code to extract links from... can you help me??
You can use Rhino with DOM environment, written in JavaScript.
By the way it is written by John Resig.
HTML Parser from sourceforge is useful. I have used it to parse a whole bunch of HTML already. However, parsing JS is different. Cheers.
This is probally the most comprehensive tool out there. Rhino . Everything you want to do can be done with Rhino.

Categories