How to parse and modify HTML file in Java - java

I am doing a project wherein I need to read an HTML file and identify specific tags, modify the contents of the tag, and create a new HTML file. Is there a library that parses HTML tags and is capable of writing the tags back to a new file?

Check out http://jsoup.org, it has a friendly dom-like API, for simple tasks you don't need to parse the html.

if you want to modify web page and return modified content, I thnk the best way is to use XSL transformation.
http://en.wikipedia.org/wiki/XSLT

There are too many HTML parsers. You could use JTidy, NekoHTML or check TagSoup.
I usually prefer parsing XHTML with the standard Java XML Parsers, but you can't do this for any type of HTML.

Look at http://java-source.net/open-source/html-parsers for a list of java libraries that parse html files into java objects that can be manipulated.
If the html files you are working with are well formed (xhtml) then you can also use XML libraries in java to find particular tags and modify them. The IO itself should be handled by the particular libraries you are using.
If you choose to manually parse the strings you could use regular expressions to find particular tags and use the java io libraries to write to the files and create new html documents. But this method reinvents the wheel so to speak because you have to manage tag opening and closing and all of those things are handled by pre-existing libraries.

Related

Is there an official Java API to parse HTML?

I have an HTML files in which I need to add some attributes to HTML tags using Java, but for some (stupid) reason I can't use a third-party libraries, so the question is, is there an official Java API to parse HTML files? if not, what other options do I have? I'm thinking in adding the attributes without paring the files but I'm guessing that may cause problems later.

Convert PDF file to a single HTML file

I am trying to convert a PDF document to a single HTML file in java. Most of the converters online converts one PDF file to multiple HTML files. I want to convert the whole PDF to a single HTML file.
Any suggestions?
Any suggestions?
You might always write some code using the JSoup API to write a single document that incorporates the body of each of the multiple HTML files. Combining styles & style-sheets (CSS) might be a bit more tricky (especially if the original HTML uses 'id' elements).
Though I find it hard to believe there is not a converter out there in which 'single document' is an option. I recommend searching further.
I think it should be possible to parse your PDF document with itext and then generate your html file.
I must admit I haven't checked if it is doable though.
Have you looked at http://www.jpedal.org/html_index.php which has an optiont to write to single file.

Reading HTML+JavaScript using Java

I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]

Getting the DOM for an xhtml document in Java

I'm building a mini web browser that parses XHTML and Javascript using Javacc and Java, and I need to build the DOM. Is there any tool that can help me get the DOM and manipulate its nodes without having to build it manually as my browser parses the document?
Try using JDOM or Dom4J or reading this question about XML parsers for Java
If you want to handle HTML as found in the wild, trying using JTidy, which will attempt to recover badly formatted HTML for you before rendering it to a DOM.
I'm not sure why you think you need JavaCC to parse an XHTML document. If it's truly valid XHTML, then it's valid XML, and that means that any XML DOM parser will be able to deliver a DOM that you can manipulate. Why not just use the DOM parser that's built into Java or Xerces from Apache or JDOM or DOM4J? Writing your own using JavaCC can be a valuable learning exercise, but I doubt that it'll be better than what you already have available to you.

generating HTML from XSL-FO using Java

I have some PDF files generated based on some XSL-FO documents and I now need this content in HTML too. I am using FOP for creating the PDF files but this does not support HTML as an output format.
My question is this: Is there a Java library of some sort that can create HTML files based on XSL-FO documents, or can I do this with throwing XSLT at it. Can I somehow extend FOP to create me this type of output?
If XSLT is the only way to go, is there one already created? (I imagine I am not the first dude wanting this)
Thank you all!
You could use the Render-X provided FO2HTML stylesheet to convert the XSL-FO into XHTML output. It converts <block> elements into <div>, <inline> into <span>, etc.
I have used it, and it works great.

Categories