setting namespace true while parsing in dom4j - java

I am parsing one xml document using dom4j as below
SAXReader reader = new SAXReader();
document = reader.read("C:/test.xml");
But it does not keep the namespaces that were there when i write the xml as below
FileOutputStream fos = new FileOutputStream("c:/test.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(document);
writer.flush();
how to do this using dom4j.I am using dom4j because of code easiness.

Don't agree. This snippet
System.out.println(new SAXReader()
.read(new ByteArrayInputStream("<a:c xmlns:a='foo'/>"
.getBytes(Charset.forName("utf-8")))).getRootElement()
.getNamespaceURI());
will print
foo
Your problem is that the SAXReader#read(String) method takes a System IDargument, not a file name. Instead, try feeding the reader a File, or an InputStream, or a URL.
reader.read(new File("C:/test.xml"))

Related

CSS parser parsing string content

I am trying to use CSS Parser in a java project to extract the CSS rules/DOM from a String of the text input.
All the examples that I have come across take the css file as input. Is there a way to bypass the file reading and work with the string content of the css file directly.
Because the class that I am working on gets only the string content of the css file and all the reading has already been taken care of.
Right now I have this, where the 'cssfile' is the filepath for css file being parsed.
InputStream stream = oParser.getClass().getResourceAsStream(cssfile);
InputSource source = new InputSource(new InputStreamReader(stream));
CSSOMParser parser = new CSSOMParser();
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
System.out.println("Number of rules: " + ruleList.getLength());
Reference link
A workaround that I found was to create a Reader using a StringReader with the contents and set the characterStream for the Input source. But there should be a better way to do this..
InputSource inputSource = new InputSource();
Reader characterStream = new StringReader(cssContent);
inputSource.setCharacterStream(characterStream);
CSSStyleSheet stylesheet = cssParserObj.parseStyleSheet(source, null,
null);
CSSRuleList ruleList = stylesheet.getCssRules();

Which is the best way to write a XML Document to a file in java?

I am trying to write an XML file. I was able to create the Document using the following code. I want to write this Document to a file with indent support. Currently my code looks like this.
Which is a better technology to parse XMl and write to a file.
public void writeXmlToFile(Document dom) throws IOException {
OutputFormat format = new OutputFormat(dom);
format.setIndenting(true);
XMLSerializer serializer = new XMLSerializer ( new FileOutputStream(
new File("sample.xml")), format);
serializer.serialize(dom);
}
or is using transformer a better approach.
public void writeXMLToFile(DOcument dom) throws TransformerException, IOException {
TransformerFactory transFact = TransformerFactory.newInstance();
Transformer trans = transFact.newTransformer();
trans.setOutputProperty(OutputKeys.ENCODING, "utf-8");
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
trans.setOutputProeprty("{http://xml.apache.org/xslt}indent-amount", "2");
StreamResult resut = new StreamResult(new FileWriter(output));
DOMSource source = new DOMSource(xmlDOC);
trans.transform(source, result);
writer.close();
}
What is the difference between the two approaches? And which of these techniques provide better performance?
To answer your question, I would suggest a third way which is the W3C proposed DOM Load and Save API. The code is self-explaining.
DOMImplementationLS ls = (DOMImplementationLS)
DOMImplementationRegistry.newInstance().getDOMImplementation("LS");
// Gets a basic document from string.
LSInput input = ls.createLSInput();
String xml = "<bookstore city='shanghai'><a></a><b/></bookstore>";
InputStream istream = new ByteArrayInputStream(xml.getBytes("UTF-8"));
input.setByteStream(istream);
LSParser parser = ls.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = parser.parse(input);
// Creates a LSSerializer object and saves to file.
LSSerializer serializer = ls.createLSSerializer();
serializer.getDomConfig().setParameter("format-pretty-print", true);
LSOutput output = ls.createLSOutput();
OutputStream ostream = new FileOutputStream("c:\\temp\\foo.xml");
output.setByteStream(ostream);
serializer.write(document, output);
Unlike XmlSerializer which is more or less a pre-standard, this approach is preferred as it is supported by all compliant implementations. The performance largely depends on vendor implementation though.

Problems parsing a table inside an RTF file using Apache Tika

I'm trying to parse a RTF file using Apache Tika. Inside the file there is a table with
several columns.
The problem is that the parser writes out the result without any information in which column the value was.
What I'm doing right now is:
AutoDetectParser adp = new AutoDetectParser(tc);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(file);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
BodyContentHandler handler = new BodyContentHandler();
InputStream fis = new FileInputStream(file);
adp.parse(fis, handler, metadata, new ParseContext());
fis.close();
System.out.println(handler.toString());
It works but I need to know like meta-information.
Is there already a Handler which outputs something like HTML with a structure of the read RTF file?
I would suggest that rather than asking Tika for the plain text version, then wondering where all your nice HTML information has gone, you instead just ask Tika for the document as XHTML. You'll then be able to process that to find the information you want on your RTF File
If you look at the Tika Examples or the Tika Unit Tests, you'll see this same pattern for an easy way to get the XHTML output
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));
parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();

How to write contents of Document Object to String in NekoHTML?

I am using NekoHTML to parse contents of some HTML file..
Everything goes okay except for extracting the contents of the Document Object to some string.
I've tried uses
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
But nothing appears returned.
The problem where in Oracle App server 10.3.1.4 http://m-hewedy.blogspot.com/2011/04/oracle-application-server-overrides.html
Posible solution:
//this nekohtml
DOMParser parser = new DOMParser();
parser.parse(archivo);
//this xerces
OutputFormat format = new OutputFormat(parser.getDocument());
format.setIndenting(true);
//print xml for console
//XMLSerializer serializer = new XMLSerializer(System.out, format);
//save xml in string var
OutputStream outputStream = new ByteArrayOutputStream();
XMLSerializer serializer = new XMLSerializer(outputStream, format);
//process
serializer.serialize(parser.getDocument());
String xmlText = outputStream.toString();
System.out.println(xmlText);
//to generate a file output use fileoutputstream instead of system.out
//XMLSerializer serializer = new XMLSerializer(new FileOutputStream(new File("book.xml")), format);
Url: http://totheriver.com/learn/xml/xmltutorial.html#6.2
See e) Serialize DOM to FileOutputStream to generate the xml file "book.xml" .

Java, XML DocumentBuilder - setting the encoding when parsing

I'm trying to save a tree (extends JTree) which holds an XML document to a DOM Object having changed it's structure.
I have created a new document object, traversed the tree to retrieve the contents successfully (including the original encoding of the XML document), and now have a ByteArrayInputStream which has the tree contents (XML document) with the correct encoding.
The problem is when I parse the ByteArrayInputStream the encoding is changed to UTF-8 (in the XML document) automatically.
Is there a way to prevent this and use the correct encoding as provided in the ByteArrayInputStream.
It's also worth adding that I have already used the
transformer.setOutputProperty(OutputKeys.ENCODING, encoding) method to retrieve the right encoding.
Any help would be appreciated.
Here's an updated answer since OutputFormat is deprecated :
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(document), new StreamResult(writer));
String output = writer.getBuffer().toString().replaceAll("\n|\r", "");
The second part will return the XML Document as String
// Read XML
String xml = "xml"
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));
// Append formatting
OutputFormat format = new OutputFormat(document);
if (document.getXmlEncoding() != null) {
format.setEncoding(document.getXmlEncoding());
}
format.setLineWidth(100);
format.setIndenting(true);
format.setIndent(5);
Writer out = new StringWriter();
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.serialize(document);
String result = out.toString();
I solved it, given alot of trial and errors.
I was using
OutputFormat format = new OutputFormat(document);
but changed it to
OutputFormat format = new OutputFormat(d, encoding, true);
and this solved my problem.
encoding is what I set it to be
true refers to whether or not indent is set.
Note to self - read more carefully - I had looked at the javadoc hours ago - if only I'd have read more carefully.
This worked for me and is very simple. No need for a transformer or output formatter:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(inputStream);
is.setEncoding("ISO-8859-1"); // set your encoding here
Document document = builder.parse(is);

Categories