NekoHtml with multiple values in a attribute - java

I have a problem with NekoHTML
I try to parse the HTML pages, where Some eventually tags have attributes of type class. These attributes must have multiple values​​, such as class = 'ui-dialog ui-widget ui-widget-content ui-corner-all ui-draggable' The problem is that using NekoHtml, is extracted only the first value of the class and then: class = 'ui-dialog'
The code I use is as follows, but I can not understand why this behavior, and if I can change it.
URL newURL = new URL(url);
Reader source = new BufferedReader(new InputStreamReader(newURL.openStream(), "UTF-8"));
doc = parseHTML(source);
EDIT
public Document parseHTML(Reader in) throws SAXException, IOException {
DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces", false);
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true);
XMLDocumentFilter[] filters = { new Purifier() };
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
parser.parse(new InputSource(in));
return parser.getDocument();
}
an example of page is http://www.123premier.com/used-cars/2257/hyundai-matrix/#
the using the xpath expression //A[#class='addthis_button_facebook at300b'] it doesn't find anything. But if i use the firebugs in firefox it give me the correct node.
the code for the xpath expression is the following
XPath xPath = XPathFactory.newInstance().newXPath();
DTMNodeList result = (DTMNodeList) xPath.evaluate("//A[#class='addthis_button_facebook at300b']", doc, XPathConstants.NODESET);
Thank you very much.

Related

Get xml attribute value from string [duplicate]

This question already has answers here:
In Java, how do I parse XML as a String instead of a file?
(6 answers)
Closed 9 years ago.
I'm trying to create a RESTful webservice using a Java Servlet. The problem is I have to pass via POST method to a webserver a request. The content of this request is not a parameter but the body itself.
So I basically send from ruby something like this:
url = URI.parse(#host)
req = Net::HTTP::Post.new('/WebService/WebServiceServlet')
req['Content-Type'] = "text/xml"
# req.basic_auth 'account', 'password'
req.body = data
response = Net::HTTP.start(url.host, url.port){ |http| puts http.request(req).body }
Then I have to retrieve the body of this request in my servlet. I use the classic readline, so I have a string. The problem is when I have to parse it as XML:
private void useXML( final String soft, final PrintWriter out) throws ParserConfigurationException, SAXException, IOException, XPathExpressionException, FileNotFoundException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true); // never forget this!
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(soft);
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("//software/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
out.println(nodes.item(i).getNodeValue());
}
}
The problem is that builder.parse() accepts: parse(File f), parse(InputSource is), parse(InputStream is).
Is there any way I can transform my xml string in an InputSource or something like that? I know it could be a dummy question but Java is not my thing, I'm forced to use it and I'm not very skilled.
You can create an InputSource from a string by way of a StringReader:
Document doc = builder.parse(new InputSource(new StringReader(soft)));
With your string, use something like :
ByteArrayInputStream input =
new ByteArrayInputStream(yourString.getBytes(perhapsEncoding));
builder.parse(input);
ByteArrayInputStream is an InputStream.

DomParser from string XML getting null document

I'm trying to parse a xml string using domParser but when I trying to get the document it shows [#document: null] and it doesn't contain the data of xml passing.
The code is something like that:
Document doc = null;
DOMParser parser = new DOMParser();
logger.debug("Parsing");
InputSource IS = new InputSource(new StringReader(nameFile));
parser.parse(IS);
doc = parser.getDocument();
NodeList NL = doc.getElementsByTagName("element");
The problem starts when doc = parser.getDocument().
It returns [#document=null]. So the NodeList can't find the element that I'm looking for.
My XML is quite big. It contains around 50K character.
My question is, what are the possible issue that introducing this problem?
For your information, this application with the same code works in OAS with JDK1.4 now I'm transfering the application to Weblogic 12c with JDK 1.6.
Thanks in advance.
UPDATED:
Sorry for not mentioning nameFile data type. nameFile is a xml data in string format.
UPDATED2:
I've tried with a simple xml but no luck.
Example:
1st Example: this string is without any space ->
nameFile = "<?xml version='1.0'?><company><staff id='1001'><firstname>yong</firstname><lastname>mook kim</lastname><nickname>mkyong</nickname><salary>100000</salary></staff><staff id='2001'><firstname>low</firstname><lastname>yin fong</lastname><nickname>fong fong</nickname><salary>200000</salary></staff></company>";
2nd Example:
nameFile = "<message>Hello</message>
None of this is working. Always returns [#document:null]
I assume 'nameFile' in your code snippet is a string! The following works perfectly for me.
String nameFile= "<message>HELLO World</message>";
DOMParser parser = new DOMParser();
try {
parser.parse(new InputSource(new java.io.StringReader(nameFile)));
Document doc = parser.getDocument();
String message = doc.getDocumentElement().getTextContent();
System.out.println(message);
} catch (SAXException e) {
// handle SAXException
} catch (IOException e) {
// handle IOException
}

Convert html String to org.w3c.dom.Document in Java

To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!
You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.

how to get only <html> data </html> from internet using java?

I'm using following code for retrieving data from internet but I get HTTP headers also which is useless for me.
URL url = new URL(webURL);
URLConnection conn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
how can I get html data only not any headers or whatsoever.
regards
Retrieving and parsing a document using TagSoup:
Parser p = new Parser();
SAX2DOM sax2dom = new SAX2DOM();
URL url = new URL("http://stackoverflow.com");
p.setContentHandler(sax2dom);
p.parse(new InputSource(new InputStreamReader(url.openStream())));
org.w3c.dom.Node doc = sax2dom.getDOM();
The TagSoup and SAX2DOM packages are:
import org.ccil.cowan.tagsoup.Parser;
import org.apache.xalan.xsltc.trax.SAX2DOM;
Writing the contents to System.out:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
These all come from import javax.xml.transform.*
You are retrieving correct data using URLConnecton. However if you want to read/access a particular html tag you must have to use HTML parser. I suggest you to use jSoup.
Example:
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("http://your_url/").get();
org.jsoup.nodes.Element head=doc.head(); // <head> tag content
org.jsoup.nodes.Element body=doc.body(); // <body> tag content
System.out.println(doc.text()); // Only text inside the <html>
You can parse the complete data to search for the string and accept the data only between html tags
You are meaning to translate html into text? If so, you can use org.htmlparser.*. Take a loo at http://htmlparser.sourceforge.net/

Building a DOM Document with tagsoup

I cannot make TagSoup work. I'm using the code that follows, but when I print the Node returned by the parser (the line with System.err.println(doc);) , I always get "[#document: null]".
I don't know how to find the bug in this code or, whichever it is, the origin of the problem. Please help!
public final Document parseDOM(final File fileToParse) {
Parser p = new Parser();
SAX2DOM sax2dom = null;
org.w3c.dom.Node doc = null;
try {
URL url = new URL("http://stackoverflow.com/");
p.setFeature(Parser.namespacesFeature, false);
p.setFeature(Parser.namespacePrefixesFeature, false);
sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(new InputStreamReader(url.openStream())));
doc = sax2dom.getDOM();
System.err.println(doc);
} catch (Exception e) {
// TODO handle exception
e.printStackTrace();
}
return doc.getOwnerDocument();
}
From the documentation on getOwnerDocument:
When this node is a Document or a DocumentType which is not used with any Document yet, this is null.
Since getDOM in your case should return a Document, you could simply cast the return value or change the type of doc to Document.
Your parser is working, but you just can't print out a node like that. The easiest way to print out a node and all its children is to use an XML Serializer like this:
Writer out = new StringWriter();
XMLSerializer serializer = new XMLSerializer(out, new OutputFormat());
serializer.serialize(doc);
System.out.println(out.toString());

Categories