how to remove the warnings in Jtidy in java - java

I am using Jtidy parser in java.
URL url = new URL("www.yahoo.com");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
InputStream in = conn.getInputStream();
doc = new Tidy().parseDOM(in, null);
when I run this, "doc = new Tidy().parseDOM(in, null);"
I am getting some warnings as follows:
Tidy (vers 4th August 2000) Parsing "InputStream"
line 140 column 5 - Warning: <table> lacks "summary" attribute
InputStream: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
InputStream: Document content looks like HTML 4.01 Transitional
1 warnings/errors were found!
These warnings are getting displayed automatically on console. But I don't want these
warnings to be displayed on my console after running
doc = new Tidy().parseDOM(in, null);
Please help me,how to do this,how to remove these warnings from console.

Looking at the Documentation I found a few methods which may do what you want.
There is setShowErrors, setQuiet and setErrout. You may want to try the following:
Tidy tidy = new Tidy();
tidy.setShowErrors(0);
tidy.setQuiet(true);
tidy.setErrout(null);
doc = tidy.parseDOM(in, null);
One of them may be enough already, these were all the options I found. Note that this will simply hide the messages, not do anything about them. There is also setForceOutput to get the output, even if errors were generated.

If you want to redirect the JTidy warnings to (say) a log4j logger, read this blog entry.
If you simply want them to go away (along with other console output), then use System.setOut() and/or System.setErr() to send the output to a file ... or a black hole.
For JTidy release 8 (or later), the Tidy.setMessageListener(TidyMessageListener) method deals with the messages more gracefully.
Alternatively, you could send a bug report to webmaster#yahoo.com. :-)

Writer out = new NullWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);

Looking at the documentation I found another method that seems a bit nicer to me in this particular case: setShowWarnings(boolean). This method will hide the warnings, but errors will still be thrown.
For more info look here:
http://www.docjar.com/docs/api/org/w3c/tidy/Tidy.html#setShowWarnings(boolean)

I think this is the nicest solution, based on the answer of Joost:
Tidy tidy = new Tidy();
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
tidy.setQuiet(true);
All three are necessary.

Related

Parsing data to DocumentHTML

Everything is okay when I read the data from webpage using InputStreamReader.
I have problem with parsing data to DocumentHTML.
Main reason is that the HTML script has some special characters which are used incorrectly.
There is an & sign twice ( "&&" ) and I believe that is causing the code to crash.
My code looks like this:
URL url = new URL(PageUrl);
URLConnection conn = url.openConnection();
// ... omitted ...
// parsing
HTMLDocument doc = (HTMLDocument)db.parse(conn.getInputStream());
Since I am making an Android application, I don't use standard parsing functions since the DocumentHTML object is going to be too large.
I found many existing examples of parsing HTML like using jsoup but they are not what I want.
I want to write my own code for parsing so that the HTMLDocument object will be kept small.
Why dont you use all the available Html parsers that are available in java?
They have community support they so are the best option.
Open Source HTML Parsers in Java

why can't I load this URL with JDOM? Browser spoofing?

I'm writing some code to load and parse HTML docs from the web.
I'm using JDOM like so:
SAXBuilder parser = new SAXBuilder();
Document document = (Document)parser.build("http://www.google.com");
Element rootNode = document.getRootElement();
/* and so on ...*/
It works fine like that. However, when I change the URL to some other web sites, like "http://www.kijiji.com", for example, the parser.build(...) line hangs.
Any idea why it hangs? I'm wondernig if it might be because kijiji knows I'm not a "real" web browser -- perhaps I have to spoof my http request so it looks like it's coming from IE or something like that?
Any ideas are useful, thanks!
Rob
I think a few things may be going on here. The firdt issue is that you cannot parse regular HTML with JDOM, HTML is not XML....
Secondly, when I run kijiji.com through JDOM I get an immediate HTTP_400 response
When I parse google.com I get an immediate XML error about well-formedness.
If you happen to be parsing xhtml at some point though, you will likely run in to this problem here: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/
XHTML has a doctype that references other doctypes, etc. Thes each take 30 seconds to load from w3c.org....

Java getting source code from a website

I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?

Generating Output in JAVA

Can we generate an .html doc using java? Usually we get ouput in cmd prompt wen we run java programs. I want to generate output in the form of .html or .doc format is their a way to do it in java?
For HTML
Just write data into .html file (they are simply text files with .html extension), using raw file io operation
For Example :
StringBuilder sb = new StringBuilder();
sb.append("<html>");
sb.append("<head>");
sb.append("<title>Title Of the page");
sb.append("</title>");
sb.append("</head>");
sb.append("<body> <b>Hello World</b>");
sb.append("</body>");
sb.append("</html>");
FileWriter fstream = new FileWriter("MyHtml.html");
BufferedWriter out = new BufferedWriter(fstream);
out.write(sb.toString());
out.close();
For word document
This thread answers it
HTML is simply plain text with a bunch of tags, as others have answered. My suggestion, if you are doing something that is more complex than just outputting a basic HTML snippet, is to use a template engine such as StringTemplate.
StringTemplate lets you create a text file (actually, a HTML file) that looks like this:
<html>
<head>
<title>Example</title>
</head>
<body>
<p>Hello $name$</p>
</body>
</html>
That is your template. Then in your Java code, you would fill in the $name$ placeholder like this and then output the resulting HTML page:
StringTemplate page = group.getInstanceOf("page");
page.setAttribute("name", "World");
System.out.println(page.toString());
This will print out the following result on your screen:
<html>
<head>
<title>Example</title>
</head>
<body>
<p>Hello World</p>
</body>
</html>
Of course, the above example Java code isn't the complete code, but it illustrates how to use a template that's still valid HTML (makes it easier to edit in a HTML editor) while keeping your Java code simple (by avoiding having a bunch of HTML tags in your System.out.println statements).
As for MS Office .doc format, that is more complex and you can look into Apache POI for that.
I already felt that need in the past and I end up developing a java library--HtmlFlow (deployed at Maven Central Repository)--that provides a simple API to write HTML in a fluent style. Check it here: https://github.com/fmcarvalho/HtmlFlow.
You can use HtmlFlow with, or without, data binding, but here I present an example of binding the properties of a Task object into HTML elements. Consider a Task Java class with three properties: Title, Description and a Priority and then we can produce an HTML document for a Task object in the following way:
import htmlflow.HtmlView;
import model.Priority;
import model.Task;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
public class App {
private static HtmlView<Task> taskDetailsView(){
HtmlView<Task> taskView = new HtmlView<>();
taskView
.head()
.title("Task Details")
.linkCss("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css");
taskView
.body().classAttr("container")
.heading(1, "Task Details")
.hr()
.div()
.text("Title: ").text(Task::getTitle)
.br()
.text("Description: ").text(Task::getDescription)
.br()
.text("Priority: ").text(Task::getPriority);
return taskView;
}
public static void main(String [] args) throws IOException{
HtmlView<Task> taskView = taskDetailsView();
Task task = new Task("Special dinner", "Have dinner with someone!", Priority.Normal);
try(PrintStream out = new PrintStream(new FileOutputStream("Task.html"))){
taskView.setPrintStream(out).write(task);
Runtime.getRuntime().exec("explorer Task.html");
}
}
}
Output is just output. What it means and how you use it is entirely up to you.
If you System.out.println('<p>Hello world!</p>'); you just produced HTML.
The .doc format is obviously a bit trickier, since it's not a simple matter of putting in tags, but there are libraries to get the job done. Google can suggest more than a few.
HTML is just plain text. Just write the HTML code to a file or standard out.
Word files are more complicated. Have a look at libraries such as Apache POI.
I don't know why you say this:
Usually we get ouput in cmd prompt wen
we run java programs .
I've been running some java programs today, but they do not do anything with a cmd prompt. If you use system.out.println, yes, but most advanced programs have a little bit more for communciation. Like an interface :)
What you want to do is look into file handlers. Open (or create) a file, write content to that file, and close it. Then you have a file. You can write anything you want to that file, so obviously also something that would make it an HTML or a doc. It's easy to find howtos on file-writing
Check this:
try {
BufferedWriter out = new BufferedWriter(new FileWriter("outfilename.html"));
out.write("aString"); //Here you pass your output
out.close();
} catch (IOException e) {
}
You will need to import BufferedWriter, FileWriter and IOException, wich are under java.io
The "aString" should be a String variable that stores html code or doc xml
Sure.
The general approach: You create the document in memory, namely in a StringBuilder and write the content of that builder to a file.
StringBuilder htmlBuilder = new StringBuilder();
htmlBuilder.append("<html><body>");
htmlBuilder.append("Hello world!");
htmlBuilder.append("</body></html>\n");
FileWriter writer = new FileWriter(System.getProperty("user.home") + "/hello.html");
writer.write(htmlBuilder.toString());
writer.close();
Put this in a main method, execute and you'll find a html file in your home directory
To generate an HTML document, you should write to a file. Since HTML is a text format, you would write to a text file. Doing this requires these classes
java.io.File - this represents locations in your file system
java.io.FileWriter - this establishes a connection from your program to a file
java.io.BufferedWriter -this enables buffered writing of text, which is much faster
java.io.IOException - one of these nasties is thrown if there is a problem writing to
the file. It is a checked (vs. runtime) exception and you must handle it.
The Head First Java book contains a very nice coverage of these classes and show you how to use them. To use these you must first know about exception handling. That is also covered in Head First Java.
I hope this gets you started.
A very straightforward and reliable approach to creation of plain HTML may be based on a SAX handler and default XSLT transformer, the latter having intrinsic capability of HTML output:
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("myfile.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory saxFactory =
(SAXTransformerFactory) TransformerFactory.newInstance();
TransformerHandler tHandler = saxFactory.newTransformerHandler();
tHandler.setResult(streamResult);
Transformer transformer = tHandler.getTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.ENCODING, encoding);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
writer.write("<!DOCTYPE html>\n");
writer.flush();
tHandler.startDocument();
tHandler.startElement("", "", "html", new AttributesImpl());
tHandler.startElement("", "", "head", new AttributesImpl());
tHandler.startElement("", "", "title", new AttributesImpl());
tHandler.characters("Hello".toCharArray(), 0, 5);
tHandler.endElement("", "", "title");
tHandler.endElement("", "", "head");
tHandler.startElement("", "", "body", new AttributesImpl());
tHandler.startElement("", "", "p", new AttributesImpl());
tHandler.characters("5 > 3".toCharArray(), 0, 5); // note '>' character
tHandler.endElement("", "", "p");
tHandler.endElement("", "", "body");
tHandler.endElement("", "", "html");
tHandler.endDocument();
writer.close();
Note that XSLT transformer will release you from the burden of escaping special characters like >, as it takes necessary care of it by itself.
And it is easy to wrap SAX methods like startElement() and characters() to something more convenient to one's taste...
And it may be worth noting that dealing without templates and document allocation in memory (e.g. DOM) gives you more freedom in terms of the resulting document size...
If you have some document-like data (structured), I'll suggest to use DOM (document object model) and than convert it in desired format (xml, html, doc, whatever). But if you have just some application output, you can easily wrap it with html. Not necessarily within java - you can also store your program's output in plain text file and convert it in html later (add body, paragprahs, headers and other HTML elements).

How to use HTML Parser to get complete information about all tags in the HTML page

I am using HTML Parser to develop an application.
The code below is not able to get the entire set of tags in the page.
There are some tags which are missed out and the attributes and text body of them are also missed out.
Please help me to explain why is this happening.....or suggest me other way....
URL url = new URL("...");
PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Enumeration e=attributes.getAttributeNames();
pw.println("Element Name :"+element.getName());
while(e.hasMoreElements())
{
Object key=e.nextElement();
Object val=attributes.getAttribute(key);
int startOffset = element.getStartOffset();
int endOffset = element.getEndOffset();
int length = endOffset - startOffset;
String text=htmlDoc.getText(startOffset, length);
pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");
}
}
}
I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.
General idea:
You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :
<span class="price"> $7.95</span>
if you are looking for this "price", then you are interested in span tags with class "price".
HTML Parser has a filter-by-attribute functionality.
filter = new HasAttributeFilter("class", "price");
When you parse using a filter, you will get a list of Nodes that you can do a instanceof operation on them to determine if they are of the type you are interested in, for span you'd do something like
if (node instanceof Span) // or any other supported element.
See list of supported tags here.
An example with HTML Parser to grab the meta tag that has description about a site:
Tag Sample :
<meta name="description" content="Amazon.com: frankenstein: Books"/>
Code:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
//<meta name="description" content="Some texte about the site." />
HasAttributeFilter filter = new HasAttributeFilter("name", "description");
try {
parser.setResource("http://www.youtube.com");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
// Prints: "YouTube is a place to discover, watch, upload and share videos."
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
As per the comments:
actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???
Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallow on an User-Agent of *, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.
Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.
Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.
You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever.
I believe you would have better results using, as an example, NekoHtml.
Or another simple library you can use is jtidy that can clean up your html before parsing it.
Hope this helps.
http://sourceforge.net/projects/jtidy/
Ciao!

Categories