JDOM, umaluts and UTF-8

JDOM, umaluts and UTF-8 - java

I have issue with using utf-8 with JDOM parser. My code won't write encoded in utf-8.
Here's code:
import java.io.FileWriter;
import java.io.IOException;
import org.jdom2.Element;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.Document;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
public class Jdom2 {
public static void main(String[] args) throws JDOMException, IOException {
String textValue = null;
try {
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build("path/to/file.xml");
Element rootNode = doc.getRootElement();
Element application = rootNode.getChild("application");
Element de_DE = application.getChild("de_DE");
textValue = de_DE.getText();
System.out.println(textValue);
application.getChild("de_DE").setText(textValue);
Format format = Format.getPrettyFormat();
format.setEncoding("UTF-8");
XMLOutputter xmlOutput = new XMLOutputter(format);
xmlOutput.output(doc, new FileWriter("path/to/file.xml"));
} catch (IOException io) {
io.printStackTrace();
} catch (JDOMException e) {
e.printStackTrace();
}
}
}
So line System.out.println(textValue); prints text "Mit freundlichen Grüßen", which is exactly what is read from xml file, but when it writing down, I get "Mit freundlichen Gr����"
Here's my xml file
<?xml version="1.0" encoding="UTF-8"?>
<applications>
<application id="default">
<de_DE>Mit freundlichen Grüßen</de_DE>
</application>
</applications>
How will I achieve writing "Mit freundlichen Grüßen" instead of "Mit freundlichen Gr����" ?
Thanks in advance!

This is probably the issue:
xmlOutput.output(doc, new FileWriter("path/to/file.xml"));
FileWriter always uses the platform default encoding. If that's not UTF-8, you've got problems.
I suggest you use a FileOutputStream instead, and let JDOM do the right thing. (Also, I strongly suspect you should keep a reference to the stream so you can close it in a finally block...)

I would totally go for #Jon Skeet answer, but if the code is in the 3rd party library that you cannot modify there are two other options:
Run your JVM with additional environment variable specifying file.encoding:
java -Dfile.encoding=UTF-8 … Jdom2
You can try to modify the Charset.defaultCharset which stores encoding used by FileWritter. Keep in mind that it's a very hacky solution that might break at some point and I would avoid it at all:
Field field = null;
for (Field f : Charset.class.getDeclaredFields()) {
if (f.getName().equals("defaultCharset")) {
field = f;
}
}
field.setAccessible(true);
Field modifiersField = Field.class.getDeclaredField("modifiers");
modifiersField.setAccessible(true);
modifiersField.setInt(field, field.getModifiers() & ~Modifier.FINAL);
field.set(null, StandardCharsets.UTF_8);
One last remark. Both solution might introduce side effects in other parts of your application because you are changing default encoding that might be used by some libraries.

Related

Converting XML not work UTF-8 XMLOutputter Java

I allready saw other questions about the same problem but i still get an error. Hier is the small part of code where i try to modify exosting xml files. But it modifies some characters in text.
import org.jdom2.Document;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import java.io.FileOutputStream;
import java.io.IOException;
public class ModyfyXml {
public static void main(String[] args) throws JDOMException, IOException {
try {
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build("F:\\c\\test.xml");
XMLOutputter xmlOutput = new XMLOutputter();
Format format = Format.getPrettyFormat();
format.setEncoding("UTF-8");
xmlOutput.setFormat(format);
xmlOutput.output(doc, (new FileOutputStream("F:\\c\\test2.xml")));
}catch (IOException io) {
io.printStackTrace();
} catch (JDOMException e) {
e.printStackTrace();
}
}}
Hier a small xml file that i try to modify (in this case just copy)
<?xml version="1.0" encoding="utf-8"?><page>
䕶法喇嘛所居此處𡸁仲無妻室亦降神附體
</page>
After program start i get the following:
<?xml version="1.0" encoding="UTF-8"?>
<page>䕶法喇嘛所居此處𡸁仲無妻室亦降神附體</page>
Some chineese characters can't be right transformed

Dang I never noticed this bug in JDOM 2.
You will have the same results with any non-BMP character. You can try with the emoji mania of these latest years and see you get the same results.
It happens because of the escape strategy automatically set for UTF-whatever encodings. What it does is rather wrong.
That will be fixed if you replace the strategy with one that doesn't escape anything beside XML reserved chars:
format.setEscapeStrategy((c) -> false);

parsing an XML file with DOM error [duplicate]

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf

Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.

I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}

Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);

Loading local chunks in DOM while parsing a large XML file in SAX (Java)

I've an xml file that I would avoid having to load all in memory.
As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)
My current problem is that I would like to process the file "by chunk" which means:
Parse the file and find a relevant tag (node)
Load this tag entirely in memory (like we would do it in DOM)
Do the process of this entity (that local chunk)
When I'm done with the chunk, release it and continue to 1. (until "end of file")
In a perfect world I'm searching some something like this:
// 1. Create a parser and set the file to load
IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
void aNodeIsFound(saxNode aNode)
{
// 5. Inflate the current node i.e. load it (and all its content) in memory
DomNode d = aNode.expand();
// 6. Do something with the inflated node (method to be defined somewhere)
doThingWithNode(d);
}
});
// 7. Start the parser
p.start();
I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.
Is there any Java framework or library relevant to this kind of task?

UPDATE
You could also just use the javax.xml.xpath APIs:
package forum7998733;
import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
public class XPathDemo {
public static void main(String[] args) throws Exception {
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
InputSource xml = new InputSource(new FileReader("BigFile.xml"));
Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
System.out.println(result);
}
}
Below is a sample of how it could be done with StAX.
input.xml
Below is some sample XML:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
Demo
In this example a StAX XMLStreamReader is used to find the node that will be converted to a DOM. In this example we convert each statement fragment to a DOM, but your navigation algorithm could be more advanced.
package forum7998733;
import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
xsr.nextTag(); // Advance to statements element
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
DOMResult domResult = new DOMResult();
t.transform(new StAXSource(xsr), domResult);
DOMSource domSource = new DOMSource(domResult.getNode());
StreamResult streamResult = new StreamResult(System.out);
t.transform(domSource, streamResult);
}
}
}
Output
<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
...stuff...
</statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
...stuff...
</statement>

It could be done with SAX... But I think the newer StAX (Streaming API for XML) will serve your purpose better. You could create an XMLEventReader and use that to parse your file, detecting which nodes adhere to one of your criteria. For simple path-based selection (not really XPath, but some simple / delimited path) you'd need to maintain a path to your current node by adding entries to a String on new elements or cutting of entries on an end tag. A boolean flag can suffice to maintain whether you're currently in "relevant mode" or not.
As you obtain XMLEvents from your reader, you could copy the relevant ones over to an XMLEventWriter that you've created on some suitable placeholder, like a StringWriter or ByteArrayOutputStream. Once you've completed the copying for some XML extract that forms a "subdocument" of what you wish to build a DOM for, simply supply your placeholder to a DocumentBuilder in a suitable form.
The limitation here is that you're not harnessing all the power of the XPath language. If you wish to take stuff like node position into account, you'd have to foresee that in your own path. Perhaps someone knows of a good way of integrating a true XPath implementation into this.
StAX is really nice in that it gives you control over the parsing, rather than using some callback interface through a handler like SAX.
There's yet another alternative: using XSLT. An XSLT stylesheet is the ideal way to filter out only relevant stuff. You could transform your input once to obtain the required fragments and process those. Or run multiple stylesheets over the same input to get the desired extract each time. An even nicer (and more efficient) solution, however, would be the use of extension functions and/or extension elements.
Extension functions can be implemented in a way that's independent from the XSLT processor being used. They're fairly straightforward to use in Java and I know for a fact that you can use them to pass complete XML extracts to a method, because I've done so already. Might take some experimentation, but it's a powerful mechanism. A DOM extract (or node) is probably one of the accepted parameter types for such a method. That'd leave the document building up to the XSLT processor which is even easier.
Extension elements are also very useful, but I think they need to be used in an implementation-specific manner. If you're okay with tying yourself to a specific JAXP setup like Xerces + Xalan, they might be the answer.
When going for XSLT, you'll have all the advantages of a full XPath 1.0 implementation, plus the peace of mind that comes from knowing XSLT is in really good shape in Java. It limits the building of the input tree to those nodes that are needed at any time and is blazing fast because the processors tend to compile stylesheets into Java bytecode rather than interpreting them. It is possible that using compilation instead of interpretation loses the possibility of using extension elements, though. Not certain about that. Extension functions are still possible.
Whatever way you choose, there's so much out there for XML processing in Java that you'll find plenty of help in implementing this, should you have no luck in finding a ready-made solution. That'd be the most obvious thing, of course... No need to reinvent the wheel when someone did the hard work.
Good luck!
EDIT: because I'm actually not feeling depressed for once, here's a demo using the StAX solution I whipped up. It's certainly not the cleanest code, but it'll give you the basic idea:
package staxdom;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class DOMExtractor {
private final Set<String> paths;
private final XMLInputFactory inputFactory;
private final XMLOutputFactory outputFactory;
private final DocumentBuilderFactory docBuilderFactory;
private final Stack<QName> activeStack = new Stack<QName>();
private boolean active = false;
private String currentPath = "";
public DOMExtractor(final Set<String> paths) {
this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
inputFactory = XMLInputFactory.newFactory();
outputFactory = XMLOutputFactory.newFactory();
docBuilderFactory = DocumentBuilderFactory.newInstance();
}
public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {
final XMLEventReader reader = inputFactory.createXMLEventReader(input);
XMLEventWriter writer = null;
StringWriter buffer = null;
final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();
XMLEvent currentEvent = reader.nextEvent();
do {
if(active)
writer.add(currentEvent);
if(currentEvent.isEndElement()) {
if(active) {
activeStack.pop();
if(activeStack.isEmpty()) {
writer.flush();
writer.close();
final Document doc;
final StringReader docReader = new StringReader(buffer.toString());
try {
doc = builder.parse(new InputSource(docReader));
} finally {
docReader.close();
}
//TODO: use doc
//Next bit is only for demo...
outputDoc(doc);
active = false;
writer = null;
buffer = null;
}
}
int index;
if((index = currentPath.lastIndexOf('/')) >= 0)
currentPath = currentPath.substring(0, index);
} else if(currentEvent.isStartElement()) {
final StartElement start = (StartElement)currentEvent;
final QName qName = start.getName();
final String local = qName.getLocalPart();
currentPath += "/" + local;
if(!active && paths.contains(currentPath)) {
active = true;
buffer = new StringWriter();
writer = outputFactory.createXMLEventWriter(buffer);
writer.add(currentEvent);
}
if(active)
activeStack.push(qName);
}
currentEvent = reader.nextEvent();
} while(!currentEvent.isEndDocument());
}
private void outputDoc(final Document doc) {
try {
final Transformer t = TransformerFactory.newInstance().newTransformer();
t.transform(new DOMSource(doc), new StreamResult(System.out));
System.out.println("");
System.out.println("");
} catch(TransformerException ex) {
ex.printStackTrace();
}
}
public static void main(String[] args) {
final Set<String> paths = new HashSet<String>();
paths.add("/root/one");
paths.add("/root/three/embedded");
final DOMExtractor me = new DOMExtractor(paths);
InputStream stream = null;
try {
stream = DOMExtractor.class.getResourceAsStream("sample.xml");
me.parse(stream);
} catch(final Exception e) {
e.printStackTrace();
} finally {
if(stream != null)
try {
stream.close();
} catch(IOException ex) {
ex.printStackTrace();
}
}
}
}
And the sample.xml file (should be in the same package):
<?xml version="1.0" encoding="UTF-8"?>
<root>
<one>
<two>this is text</two>
look, I can even handle mixed!
</one>
... not sure what to do with this, though
<two>
<willbeignored/>
</two>
<three>
<embedded>
<and><here><we><go>
Creative Commons Legal Code
Attribution 3.0 Unported
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
DAMAGES RESULTING FROM ITS USE.
License
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
CONDITIONS.
</go></we></here></and>
</embedded>
</three>
</root>
EDIT 2: Just noticed in Blaise Doughan's answer that there's a StAXSource. That'll be even more efficient. Use that if you're going with StAX. Will eliminate the need to keep some buffer. StAX allows you to "peek" at the next event, so you can check if it's a start element with the right path without consuming it before passing it into the transformer .

ok thanks to your pieces of code, I finally end up with my solution:
Usage is quite intuitive:
try
{
/* CREATE THE PARSER */
XMLParser parser = new XMLParser();
/* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
/* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
XMLHandler handler = new XMLHandler()
{
public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
{
// DO SOMETHING WITH THE FOUND XML NODE
System.out.println("Node found");
System.out.println(node.toString());
}
};
/* ATTACH THE FILTER WITH THE HANDLER */
parser.addFilterWithHandler(filter, handler);
/* SET THE FILE TO PARSE */
parser.setFilePath("/path/to/bigfile.xml");
/* RUN THE PARSER */
parser.parse();
}
catch (Exception ex)
{
ex.printStackTrace();
}
Note:
I made a XMLNodeFoundNotifier and a XMLStackFilter interface to easily integrate or build your own handler / filter.
Normally you should be able to parse very large files with this class. Only the returned nodes are actually loaded into memory.
You can enable attributes support in uncommenting the right part in the code, I disabled it for simplicity reasons.
You can use as many filters per handler as you need and conversely
All the of the code is here:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;
/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {
abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}
/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}
/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {
abstract boolean isRelevant(Stack fullPath);
}
/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {
Pattern relevantExpression;
XMLRegexFilter(String filterRules) {
relevantExpression = Pattern.compile(filterRules);
}
/* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
* OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
#Override
public boolean isRelevant(Stack fullPath) {
/* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
* ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
* FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
String stackPath = XMLParser.StackToString(fullPath);
Matcher m = relevantExpression.matcher(stackPath);
return m.matches();
}
}
/* THE MAIN PARSER'S CLASS */
public class XMLParser {
HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
HashMap<Integer, Integer> feedingStreams;
Stack<HashMap> currentStack;
String filePath;
XMLParser() {
currentStack = new <HashMap>Stack();
filterHandler = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
feedingStreams = new <Integer, Integer>HashMap();
}
public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
filterHandler.put(f, h);
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
/* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT
* I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
* DO SO
*/
public static String StackToString(Stack<HashMap> s) {
int k = s.size();
if (k == 0) {
return null;
}
StringBuilder out = new StringBuilder();
out.append(s.get(0).get("tag"));
for (int x = 1; x < k; ++x) {
HashMap node = s.get(x);
out.append('/').append(node.get("tag"));
/*
// UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH
ArrayList <String[]>attributes = (ArrayList)node.get("attr");
if (attributes.size()>0)
{
out.append("[");
for (int i = 0 ; i<attributes.size(); i++)
{
String[]keyValuePair = attributes.get(i);
if (i>0) out.append(",");
out.append(keyValuePair[0]);
out.append("=\"");
out.append(keyValuePair[1]);
out.append("\"");
}
out.append("]");
}*/
}
return out.toString();
}
/*
* ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
* WE THEN RETRIEVE THE DATA FROM IT.
*/
private StringBuilder getChunk(int from, int to) throws Exception {
int length = to - from;
FileReader f = new FileReader(filePath);
BufferedReader br = new BufferedReader(f);
br.skip(from);
char[] readb = new char[length];
br.read(readb, 0, length);
StringBuilder b = new StringBuilder();
b.append(readb);
return b;
}
/* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
HashMap h = new HashMap();
ArrayList attributes = new ArrayList();
for (int i = 0; i < xsr.getAttributeCount(); i++) {
String[] s = new String[2];
s[0] = xsr.getAttributeName(i).toString();
s[1] = xsr.getAttributeValue(i);
attributes.add(s);
}
h.put("tag", xsr.getName());
h.put("attr", attributes);
return h;
}
public void parse() throws Exception {
FileReader f = new FileReader(filePath);
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(f);
Location previousLoc = xsr.getLocation();
while (xsr.hasNext()) {
switch (xsr.next()) {
case XMLStreamConstants.START_ELEMENT:
currentStack.add(XSRNode2HashMap(xsr));
for (XMLStackFilter filter : filterHandler.keySet()) {
if (filter.isRelevant(currentStack)) {
feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
}
}
previousLoc = xsr.getLocation();
break;
case XMLStreamConstants.END_ELEMENT:
Integer stream = null;
if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
// FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
for (XMLStackFilter filter : filterHandler.keySet()) {
if (filter.isRelevant(currentStack)) {
XMLNodeFoundNotifier h = filterHandler.get(filter);
StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
h.nodeFound(aChunk, filter);
}
}
feedingStreams.remove(currentStack.hashCode());
}
previousLoc = xsr.getLocation();
currentStack.pop();
break;
default:
break;
}
}
}
}

A little while since i did SAX, but what you want to do is process each of the tags until you find the end tag for the group you want to process, then run your process, clear it out and look for the next start tag.

RTF to Plain Text in Java

How do you convert an RTF string to plain text in Java? The obvious answer is to use Swing's RTFEditorKit, and that seems to be the common answer around the Internet. However the write method that claims to return plain text isn't actually implemented... it's hard-coded to just throw an IOException in Java6.

I use Swing's RTFEditorKit in Java 6 like this:
RTFEditorKit rtfParser = new RTFEditorKit();
Document document = rtfParser.createDefaultDocument();
rtfParser.read(new ByteArrayInputStream(rtfBytes), document, 0);
String text = document.getText(0, document.getLength());
and thats working.

Try Apache Tika: http://tika.apache.org/0.9/formats.html#Rich_Text_Format

You might consider RTF Parser Kit as a lightweight alternative to the Swing RTFEditorKit. The line below shows plain text extraction from an RTF file. The RTF file is read from the input stream, the extracted text is written to the output stream.
new StreamTextConverter().convert(new RtfStreamSource(inputStream), outputStream, "UTF-8");
(full disclosure: I'm the author of RTF Parser Kit)

Here is the full code to parse & write RTF as a plain text
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import javax.swing.text.BadLocationException;
import javax.swing.text.Document;
import javax.swing.text.rtf.RTFEditorKit;
public class rtfToJson {
public static void main(String[] args)throws IOException, BadLocationException {
// TODO Auto-generated method stub
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
FileInputStream fis = new FileInputStream("C:\\SampleINCData.rtf");
InputStreamReader i =new InputStreamReader(fis,"UTF-8");
rtf.read(i,doc,0);
// System.out.println(doc.getText(0,doc.getLength()));
String doc1 = doc.getText(0,doc.getLength());
try{
FileWriter fw=new FileWriter("B:\\Sample INC Data.txt");
fw.write(doc1);
fw.close();
}catch(Exception e)
{
System.out.println(e);
}
System.out.println("Success...");
}
}

Save data in an XML file

I have an application where in i need to save the data input by a user in a form in an XML file at a specified location and i need to perform this using Java . I am relatively very new to XML handling in java. I would like some suggestions as to how to start the task .
Any code snippets and links will be helpful ...
Thank You

There is very good framework JAXB for this also there is Simple
But I have used this XStream
Person joe = new Person("Joe", "Walnes");
joe.setPhone(new PhoneNumber(123, "1234-456"));
joe.setFax(new PhoneNumber(123, "9999-999"));
Now, to convert it to XML, all you have to do is make a simple call to XStream:
String xml = xstream.toXML(joe);
The resulting XML looks like this:
<person>
<firstname>Joe</firstname>
<lastname>Walnes</lastname>
<phone>
<code>123</code>
<number>1234-456</number>
</phone>
<fax>
<code>123</code>
<number>9999-999</number>
</fax>
</person>
Also See
JAXB
where-i-can-find-a-detailed-comparison-of-java-xml-frameworks

There are many open source libraries, but I would simply use JAXB, the standard. Although I have to say the XStream library suggested by other answerers looks very promising, too!

Consider using xstream (http://x-stream.github.io/). XStream's API is very simple:
YourObjectGraph yourData=buildYourData();
XStream xstream=new XStream();
String yourXML=xstream.toXml(yourData);
// do something with your shiny XML
Importing is just as easy:
YourObjectGraph yourData=(YourObjectGraph)xstream.fromXml(yourXml);

I would start by looking at the XStream library. It's very simple to convert POJOs (plain old java objects) to and from XML. I have a blog post detailing some of the gotchas.

You can also use the java.util.Properties to save and load properties as XML file
to save xml :
storeToXML(OutputStream os, String comment);
storeToXML(OutputStream os, String comment, String encoding);
to load xml :
loadFromXML(InputStream in)
here is an example :
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.InvalidPropertiesFormatException;
import java.util.Properties;
public class Main {
public static void main(String[] args) {
File file = new File(getPath());
if (!file.exists()) {
Properties p1 = new Properties();
p1.setProperty("A", "Amir Ali");
try {
writeXML(p1);
System.out.println("xml saved to " + getPath());
} catch (IOException e) {
e.printStackTrace();
}
}else {
try {
Properties p2 = readXML();
System.out.println(p2.getProperty("A"));
} catch (InvalidPropertiesFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public static void writeXML(Properties properties) throws IOException {
if (properties != null) {
OutputStream os = new FileOutputStream(getPath());
properties.storeToXML(os, null);
}
}
public static Properties readXML() throws InvalidPropertiesFormatException, IOException {
InputStream is = new FileInputStream(getPath());
Properties p = new Properties();
p.loadFromXML(is);
return p;
}
private static String getPath() {
return System.getProperty("user.home") + File.separator + "properties.xml";
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JDOM, umaluts and UTF-8 - java

Related

Converting XML not work UTF-8 XMLOutputter Java

parsing an XML file with DOM error [duplicate]

Loading local chunks in DOM while parsing a large XML file in SAX (Java)

RTF to Plain Text in Java

Save data in an XML file

Categories

Resources