Retrieving the DOM view of a webpage in Java not just Source

Retrieving the DOM view of a webpage in Java not just Source - java

Is it possible to get the Dom view of a page, such as what you see in chrome when you click inspect element, compared to view source. I need to access this through java, and currently i can just get the source.
Thanks guys.

IMHO you have to follow the links too... so once you have the source, you need to parse it. You can then insert the links' content (like CSS or script) in the original DOM.

HTML can be messy. In the past I've used TagSoup to parse HTML and generate XML in the form of a stream of SAX events, and then used JDOM to build an in-memory DOM-like tree version of the XML, which worked well. Then you can use other libraries like Saxon to execute xpath, xslt or xquery against that XML tree.

Related

Writing an xhtml file after reading its content from Jsoup

I need to be able to manipulate the content of an xhtml file (modify some text within), then write everything back to an xhtml file (could be the same) to be able to user it later. Is it possible with Jsoup or need another library/code to do it? thanks

Yes, this is possible with Jsoup alone. You don't need any extra libraries. Just have a look at the Jsoup Cookbook. If it is really XHTML you could even do it with any XML DOM implementation; Jsoup doesn't come with a formatter for the output so you would just toString the modified Document with no further control which may or may not be acceptable.

XSLT might be your friend in this case. Give it a Google anyway, but it sounds like a good starting point.

Getting the DOM for an xhtml document in Java

I'm building a mini web browser that parses XHTML and Javascript using Javacc and Java, and I need to build the DOM. Is there any tool that can help me get the DOM and manipulate its nodes without having to build it manually as my browser parses the document?

Try using JDOM or Dom4J or reading this question about XML parsers for Java
If you want to handle HTML as found in the wild, trying using JTidy, which will attempt to recover badly formatted HTML for you before rendering it to a DOM.

I'm not sure why you think you need JavaCC to parse an XHTML document. If it's truly valid XHTML, then it's valid XML, and that means that any XML DOM parser will be able to deliver a DOM that you can manipulate. Why not just use the DOM parser that's built into Java or Xerces from Apache or JDOM or DOM4J? Writing your own using JavaCC can be a valuable learning exercise, but I doubt that it'll be better than what you already have available to you.

How to parse and modify HTML file in Java

I am doing a project wherein I need to read an HTML file and identify specific tags, modify the contents of the tag, and create a new HTML file. Is there a library that parses HTML tags and is capable of writing the tags back to a new file?

Check out http://jsoup.org, it has a friendly dom-like API, for simple tasks you don't need to parse the html.

if you want to modify web page and return modified content, I thnk the best way is to use XSL transformation.
http://en.wikipedia.org/wiki/XSLT

There are too many HTML parsers. You could use JTidy, NekoHTML or check TagSoup.
I usually prefer parsing XHTML with the standard Java XML Parsers, but you can't do this for any type of HTML.

Look at http://java-source.net/open-source/html-parsers for a list of java libraries that parse html files into java objects that can be manipulated.
If the html files you are working with are well formed (xhtml) then you can also use XML libraries in java to find particular tags and modify them. The IO itself should be handled by the particular libraries you are using.
If you choose to manually parse the strings you could use regular expressions to find particular tags and use the java io libraries to write to the files and create new html documents. But this method reinvents the wheel so to speak because you have to manage tag opening and closing and all of those things are handled by pre-existing libraries.

Retrieving a web page including embedded objects

I'd like to fetch a web page including images, flash animations and other embedded objects. What's a straightforward way of achieving this?

Writing a web-crawler in the java programming language.
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

Use an open source HTML Parser such as HTMLCleaner - http://java-source.net/open-source/html-parsers/htmlcleaner or CyberNekoHtml - http://java-source.net/open-source/html-parsers/nekohtml.
Once you have used a parser to create a representation of the DOM of the web page, you can then load/download images and other embedded objects that exist in the DOM by performing queries on the DOM and extracting relevant src attributes from the HTML elements.

try web-harvest

Is XPath a better a way to read configuration file than using SAX / DOM parsers?

Is XPath a better way to read configuration file than DOM and SAX?
If yes,
Why does not log4j uses XPathExpression to read configuration file?
If No,
What method should I choose so that I do not have to modify the code if my configuration file changes?
Update:
#kdgregory
Normally you are aware of the parameter you are seeking in a configuration file (even the complete path to the node). Why not use XPathExpression in that case? Does that makes processing slow as every time background parsing takes place?

Why does not log4j uses XPathExpression to read configuration file?
Partly because XPath wasn't part of the JDK until 1.5, and Log4J's XML configuration predates that release.
But probably more because the Log4J configuration file is a simple hierarchical structure, and it's easy to traverse such a structure and set configuration options. If you look at the source code for org.apache.log4j.xml.DOMConfigurator, you'll see a very short dispatch loop that looks at the element name hands it off to an object-specific parser.
Is XPath a better way to read configuration file than DOM and SAX?
XPath is not a replacement for DOM and SAX, it's an additional layer on top of them. Yes, you can pass any InputSource to XPathExpression, and you can create an InputSource from any InputStream, but then the parsing happens behind the scenes. And if you have to execute multiple XPath expressions over the same file, it makes a lot of sense to parse it once into a DOM.
What method should I choose so that I do not have to modify the code if my configuration file changes?
Configuration files normally change because your code changes, not the other way around: you add a feature that you want to configure, and then write the code to configure it.
So, when should XPath be used? Where it has more advantage?
One good use for XPath is when you need to extract specific pieces out of a file, particularly if the order in which you extract them does not correspond to the file's document order.
And finally: I strongly recommend using Apache Digester, or any of a number of Java->XML serialization libraries, rather than explicit XPath.

I find XPath useful when I only want a few pieces of information from a complicated document, and don't want to bother dealing with the whole structure. For example, you can select elements in a web page with XPath expressions when using the selenium testing framework.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.