Convert html String to org.w3c.dom.Document in Java - java

To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!

You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.

Related

How to mock original XML output to use in test class in Java?

I am writing Junit test cases for an Application and again and again I have to build dummy Document object and set root element and every other elements according to my original response so as to pass in my Mockito.when(m1()).thenReturn(respDoc). The code is something like this
Code Starts
Document respDoc = new Document();
Element elem = new Element("RootElement");
respDoc.setRootElement(elem);
Element node1 = new Element("Nodes").addContent("FirstNode");
elem.add(node1);
**And So On...**
Code Ends
Sometimes the response xml is so big that it takes all of my time just to create this Document object. Is there any way where I can just pass this whole XML and it gives me the desired output.
Please let me know if theres any confusion.
Thanks in advance!
Assuming Document is from JDOM Library:
You need the following dependency:
<dependency>
<groupId>org.jdom</groupId>
<artifactId>jdom2</artifactId>
<version>2.0.6.1</version>
</dependency>
Then you parse it:
import org.jdom2.Document;
import org.jdom2.input.SAXBuilder;
String FILENAME = "src/main/resources/staff.xml";
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(new File(FILENAME));
Mockito.when(m1()).thenReturn(doc);
Reference: https://mkyong.com/java/how-to-read-xml-file-in-java-jdom-example/
Well the documentation states you can use an InputStream:
http://www.jdom.org/docs/apidocs/org/jdom2/input/SAXBuilder.html
Example:
String yourXmlString = "<xml>";
InputStream stringStream = new ByteArrayInputStream(yourXmlString .getBytes("UTF-8"));
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(stringStream );

Link XML and XSD using java

i'm trying to write the header for an xml file so it would be something like this:
<file xmlns="http://my_namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://my_namespace file.xsd">
however, I can't seem to find how to do it using the Document class in java. This is what I have:
public void exportToXML() {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.newDocument();
doc.setXmlStandalone(true);
doc.createTextNode("<file xmlns=\"http://my_namespace"\n" +
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" +
"xsi:schemaLocation=\"http://my_namespace file.xsd\">");
Element mainRootElement = doc.createElement("MainRootElement");
doc.appendChild(mainRootElement);
for(int i = 0; i < tipoDadosParaExportar.length; i++) {
mainRootElement.appendChild(criarFilhos(doc, tipoDadosParaExportar[i]));
}
Transformer tr = TransformerFactory.newInstance().newTransformer();
tr.transform(new DOMSource(doc),
new StreamResult(new FileOutputStream(filename)));
} catch (Exception e) {
e.printStackTrace();
}
}
I tried writing it on the file using the createTextNode but it didn't work either, it only writes the version before showing the elements.
PrintStartXMLFile
Would appreciate if you could help me. Have a nice day
Your createTextNode() method is only suitable for creating text nodes, it's not suitable for creating elements. You need to use createElement() for this. If you're doing this by building a tree, then you need to build nodes, you can't write lexical markup.
I'm not sure what MainRootElement is supposed to be; you've only given a fragment of your desired output so it's hard to tell.
Creating a DOM tree and then serializing it is a pretty laborious way of constructing an XML file. Using something like an XMLEventWriter is easier. But to be honest, I got frustrated by all the existing approaches and wrote a new library for the purpose as part of Saxon 10. It's called simply "Push", and looks something like this:
Processor proc = new Processor();
Serializer serializer = proc.newSerializer(new File(fileName));
Push push = proc.newPush(serializer);
Document doc = push.document(true);
doc.setDefaultNamespace("http://my_namespace");
Element root = doc.element("root")
.attribute(new QName("xsi", "http://www.w3.org/2001/XMLSchema-instance", "schemaLocation"),
"http://my_namespace file.xsd");
doc.close();

XML parsing of a .txt file printing nothing on the console [duplicate]

This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers

CSS parser parsing string content

I am trying to use CSS Parser in a java project to extract the CSS rules/DOM from a String of the text input.
All the examples that I have come across take the css file as input. Is there a way to bypass the file reading and work with the string content of the css file directly.
Because the class that I am working on gets only the string content of the css file and all the reading has already been taken care of.
Right now I have this, where the 'cssfile' is the filepath for css file being parsed.
InputStream stream = oParser.getClass().getResourceAsStream(cssfile);
InputSource source = new InputSource(new InputStreamReader(stream));
CSSOMParser parser = new CSSOMParser();
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
System.out.println("Number of rules: " + ruleList.getLength());
Reference link
A workaround that I found was to create a Reader using a StringReader with the contents and set the characterStream for the Input source. But there should be a better way to do this..
InputSource inputSource = new InputSource();
Reader characterStream = new StringReader(cssContent);
inputSource.setCharacterStream(characterStream);
CSSStyleSheet stylesheet = cssParserObj.parseStyleSheet(source, null,
null);
CSSRuleList ruleList = stylesheet.getCssRules();

Parse Web Site HTML with JAVA [duplicate]

This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers

Categories