Getting the DOM for an xhtml document in Java - java

I'm building a mini web browser that parses XHTML and Javascript using Javacc and Java, and I need to build the DOM. Is there any tool that can help me get the DOM and manipulate its nodes without having to build it manually as my browser parses the document?

Try using JDOM or Dom4J or reading this question about XML parsers for Java
If you want to handle HTML as found in the wild, trying using JTidy, which will attempt to recover badly formatted HTML for you before rendering it to a DOM.

I'm not sure why you think you need JavaCC to parse an XHTML document. If it's truly valid XHTML, then it's valid XML, and that means that any XML DOM parser will be able to deliver a DOM that you can manipulate. Why not just use the DOM parser that's built into Java or Xerces from Apache or JDOM or DOM4J? Writing your own using JavaCC can be a valuable learning exercise, but I doubt that it'll be better than what you already have available to you.

Related

Can you scrape a CSV using Jsoup?

I am looking for a Java tool to scrape a CSV from a website and then parse the data. Jsoup seems like a viable option. Is there a way to scrape a CSV file and then save the information to a database using Jsoup?
Or is it strictly for scraping HTML code? Thanks.
No, it ain't gonna work. Look at the Jsoup description:
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
What you are asking for is how to parse CSV file in Java. This question might be helpful for you:
Fast CSV parsing

Retrieving the DOM view of a webpage in Java not just Source

Is it possible to get the Dom view of a page, such as what you see in chrome when you click inspect element, compared to view source. I need to access this through java, and currently i can just get the source.
Thanks guys.
IMHO you have to follow the links too... so once you have the source, you need to parse it. You can then insert the links' content (like CSS or script) in the original DOM.
HTML can be messy. In the past I've used TagSoup to parse HTML and generate XML in the form of a stream of SAX events, and then used JDOM to build an in-memory DOM-like tree version of the XML, which worked well. Then you can use other libraries like Saxon to execute xpath, xslt or xquery against that XML tree.

Writing an xhtml file after reading its content from Jsoup

I need to be able to manipulate the content of an xhtml file (modify some text within), then write everything back to an xhtml file (could be the same) to be able to user it later. Is it possible with Jsoup or need another library/code to do it? thanks
Yes, this is possible with Jsoup alone. You don't need any extra libraries. Just have a look at the Jsoup Cookbook. If it is really XHTML you could even do it with any XML DOM implementation; Jsoup doesn't come with a formatter for the output so you would just toString the modified Document with no further control which may or may not be acceptable.
XSLT might be your friend in this case. Give it a Google anyway, but it sounds like a good starting point.

Reading HTML+JavaScript using Java

I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]

How to parse and modify HTML file in Java

I am doing a project wherein I need to read an HTML file and identify specific tags, modify the contents of the tag, and create a new HTML file. Is there a library that parses HTML tags and is capable of writing the tags back to a new file?
Check out http://jsoup.org, it has a friendly dom-like API, for simple tasks you don't need to parse the html.
if you want to modify web page and return modified content, I thnk the best way is to use XSL transformation.
http://en.wikipedia.org/wiki/XSLT
There are too many HTML parsers. You could use JTidy, NekoHTML or check TagSoup.
I usually prefer parsing XHTML with the standard Java XML Parsers, but you can't do this for any type of HTML.
Look at http://java-source.net/open-source/html-parsers for a list of java libraries that parse html files into java objects that can be manipulated.
If the html files you are working with are well formed (xhtml) then you can also use XML libraries in java to find particular tags and modify them. The IO itself should be handled by the particular libraries you are using.
If you choose to manually parse the strings you could use regular expressions to find particular tags and use the java io libraries to write to the files and create new html documents. But this method reinvents the wheel so to speak because you have to manage tag opening and closing and all of those things are handled by pre-existing libraries.

Categories