Does reading a text file in Java have a maximum line length?

Does reading a text file in Java have a maximum line length? - java

I'm reading in an XML configuration file that I don't control the format of, and the data I need is in the last element. Unfortunately, that element is a base64 encoded serialised Java class (yes, I know) that is 31200 characters in length.
Some experimenting seems to show that not only can the Java XML/XPath libraries not see the value in this element (they silently set the value to a blank string), if I just read the file into a string and print it out to console, everything (even a closing element on the next line) gets printed, but not this one element.
Finally, if I manually go into the file and break the line into rows, Java can see the line, although this obviously breaks XML parsing and deserialisation. It also isn't practical as I want to make a tool that will work across many such files.
Is there some line length limit in Java that stops this working? Can I get around it with a third party library?
EDIT: here's the XML-related code:
FileInputStream fstream = new FileInputStream("path/to/xml/file.xml");
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document d = db.parse(fstream);
String s = XPathFactory.newInstance().newXPath().compile("//el1").evaluate(d);

For reading a large xml file, you can use SAX parser.
In addition to that reading the values inside the "characters" in the SAX parser should be read using "String Buffer" instead of String.
You can check out the SAX parser here.

I wondered if it might be possible to do some pre-processing to the XML as you read it in.
I've been having a play to see if I could break down the long element into a list of sub-elements. Then this could be parsed and the sub-elements could be built back into a string. My testing threw up the fact that my initial guess of 4500 characters per sub element was still a bit high for my XML parsing to cope with, so I just arbitrarily picked 1000 and it seems to cope with that.
Anyway, this might help, it might not, but here's what I came up with:
private static final String ELEMENT_TO_BREAK_UP_OPEN = "<element>";
private static final String ELEMENT_TO_BREAK_UP_CLOSE = "</element>";
private static final String SUB_ELEMENT_OPEN = "<subelement>";
private static final String SUB_ELEMENT_CLOSE = "</subelement>";
private static final int SUB_ELEMENT_SIZE_LIMIT = 1000;
public static void main(final String[] args) {
try {
/* The XML currently looks like this:
*
* <root>
* <element> ... Super long input with 30000+ characters ... </element>
* </root>
*
*/
final File file = new File("src\\main\\java\\longxml\\test.xml");
final BufferedReader reader = new BufferedReader(new FileReader(file));
final StringBuffer buffer = new StringBuffer();
String line = reader.readLine();
while( line != null ) {
if( line.contains(ELEMENT_TO_BREAK_UP_OPEN) ) {
buffer.append(ELEMENT_TO_BREAK_UP_OPEN);
String substring = line.substring(ELEMENT_TO_BREAK_UP_OPEN.length(), (line.length() - ELEMENT_TO_BREAK_UP_CLOSE.length()) );
while( substring.length() > SUB_ELEMENT_SIZE_LIMIT ) {
buffer.append(SUB_ELEMENT_OPEN);
buffer.append( substring.substring(0, SUB_ELEMENT_SIZE_LIMIT) );
buffer.append(SUB_ELEMENT_CLOSE);
substring = substring.substring(SUB_ELEMENT_SIZE_LIMIT);
}
if( substring.length() > 0 ) {
buffer.append(SUB_ELEMENT_OPEN);
buffer.append(substring);
buffer.append(SUB_ELEMENT_CLOSE);
}
buffer.append(ELEMENT_TO_BREAK_UP_CLOSE);
}
else {
buffer.append(line);
}
line = reader.readLine();
}
reader.close();
/* The XML now looks something like this:
*
* <root>
* <element>
* <subElement> ... First Part of Data ... </subElement>
* <subElement> ... Second Part of Data ... </subElement>
* ... Multiple Other SubElements of Data ..
* <subElement> ... Final Part of Data ... </subElement>
* </element>
* </root>
*/
//This parses the xml with the new subElements in
final InputSource src = new InputSource(new StringReader(buffer.toString()));
final Node document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(src).getFirstChild();
//This gives us the first child (element) then that's children (subelements)
final NodeList childNodes = document.getFirstChild().getChildNodes();
//Then concatenate them back into a big string.
final StringBuilder finalElementValue = new StringBuilder();
for( int i = 0; i < childNodes.getLength(); i++ ) {
final Node node = childNodes.item(i);
finalElementValue.append( node.getFirstChild().getNodeValue() );
}
//At this point do whatever you need to do. Decode, Deserialize, etc...
System.out.println(finalElementValue.toString());
}
catch (final Exception e) {
e.printStackTrace();
}
}
There are a few issues with this in terms of it's general application:
It does rely on the element you want to break up being uniquely identifiable. (But I'm guessing the logic to find the element can be improved quite a bit)
It relies on knowing the format of the XML and hoping that doesn't change. (Only in the latter parsing section, you could potentially parse it better with xPath once it has been broken into subelements)
Having said all of that, you do end up with a parsable XML string, which you can build your encoded string from, so this might help you on your way to a solution.

Related

Fastest way to read a large XML file in Java

I'm working on a java project to optimize existing code. Currently i'm using BufferedReader/FileInputStream to read content of an XML file as String in Java.
But my question is , is there any faster way to read XML content.Are SAX/DOM faster than BufferedReader/FileInputStream?
Need help regarding the above issue.
Thanks in advance.

I think that your code shown in other question is faster than DOM-like parsers which would definitely require more memory and likely some computation in order to reconstruct the document in full. You may want to profile the code though.
I also think that your code can be prettified a bit for streaming processing if you would use javax XMLStreamReader, which I found quite helpful for many tasks. That class is "... is designed to be the lowest level and most efficient way to read XML data", according to Oracle.
Here is the excerpt from my code where I parse StackOverflow users XML file distributed as a public data dump:
// the input file location
private static final String fileLocation = "/media/My Book/Stack/users.xml";
// the target elements
private static final String USERS_ELEMENT = "users";
private static final String ROW_ELEMENT = "row";
// get the XML file handler
//
FileInputStream fileInputStream = new FileInputStream(fileLocation);
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(
fileInputStream);
// reading the data
//
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
// this triggers _users records_ logic
//
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
// read and parse the user data rows
//
while (xmlStreamReader.hasNext()) {
eventCode = xmlStreamReader.next();
// this breaks _users record_ reading logic
//
if ((XMLStreamConstants.END_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
break;
}
else {
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(ROW_ELEMENT)) {
// extract the user data
//
User user = new User();
int attributesCount = xmlStreamReader.getAttributeCount();
for (int i = 0; i < attributesCount; i++) {
user.setAttribute(xmlStreamReader.getAttributeLocalName(i),
xmlStreamReader.getAttributeValue(i));
}
// all other user record-related logic
//
}
}
}
}
}
That users file format is quite simple and similar to your Bank.xml file:
<users>
<row Id="1567200" Reputation="1" CreationDate="2012-07-31T23:57:57.770" DisplayName="XXX" EmailHash="XXX" LastAccessDate="2012-08-01T00:55:12.953" Views="0" UpVotes="0" DownVotes="0" />
...
</users>

There are different parser options available.
Consider using a streaming parser, because the DOM may become quite big. I.e. either a push or a pull parser.
It's not as if XML parsers are necessarily slow. Consider your web browser. It does XML parsing all the time, and tries really hard to be robust to syntax errors. Usually, memory is the bigger issue.

Creating an XML based on another XML in Java

I'd like to take an XML file, heavily structured and about half gig in size, and create from it another XML file, containing only selected elements of the original one.
1) How can I do that?
2) can it be done with DOM Parser? What is the size limit of the DOM parser?
Thanks!

If you have a very large source XML (like your 0.5 GB file), and wish to extract information from it, possibly creating a new XML, you might consider using an event-based parser which does not require loading the entire XML in memory. The simplest of these implementations is the SAX parser, which requires that you write an event listener which will capture events like document-start, element-start, element-end, etc, where you can inspect the data you are reading (the name of the element, the attributes, etc.) and decide if you are going to ignore it or do something with the data.
Search for a SAX tutorial using JAXP and you should find several examples. Another strategy which you might want to consider, depending on what you want to do is StAX.
Here is a simple example using SAX to read data from a XML file and extract some information based on search criteria. It's a very simple example I use to teach SAX processing. I think it might help your understanding of how it works. The search criteria is hardwired and consists of names of movie directors to search in a giant XML with a movie selection generated from IMDB data.
XML Source example ("source.xml" ~300MB file)
<Movies>
...
<Movie>
<Imdb>tt1527186</Imdb>
<Title>Melancholia</Title>
<Director>Lars von Trier</Director>
<Year>2011</Year>
<Duration>136</Duration>
</Movie>
<Movie>
<Imdb>tt0060390</Imdb>
<Title>Fahrenheit 451</Title>
<Director>François Truffaut</Director>
<Year>1966</Year>
<Duration>112</Duration>
</Movie>
<Movie>
<Imdb>tt0062622</Imdb>
<Title>2001: A Space Odyssey</Title>
<Director>Stanley Kubrick</Director>
<Year>1968</Year>
<Duration>160</Duration>
</Movie>
...
</Movies>
Here is an example of an event handler. It selects the Movie elements by matching strings. I extended DefaultHandler and implemented startElement() (called when an opening tag is found), characters() (called when a block of characters are read), endElement() (called when an end tag is found) and endDocument() (called once, when the document finished). Since the data that is read is not retained in memory, you have to save the data you are interested in yourself. I used some boolean flags and instance variables to save the current tag, current data, etc.
class ExtractMovieSaxHandler extends DefaultHandler {
// These are some parameters for the search which will select
// the subtrees (they will receive data when we set up the parser)
private String tagToMatch;
private String tagContents; // OR match
private boolean strict = false; // if strict matches will be exact
/**
* Sets criteria to select and copy Movie elements from source XML.
*
* #param tagToMatch Must contain text only
* #param tagContents Text contents of the tag
* #param strict If true, match must be exact
*/
public void setSearchCriteria(String tagToMatch, String tagContents, boolean strict) {
this.tagToMatch = tagToMatch;
this.tagContents = tagContents;
this.strict = strict;
}
// These are the temporary values we store as we parse the file
private String currentElement;
private StringBuilder contents = null; // if not null we are in Movie tag
private String currentData;
List<String> result = new ArrayList<String>(); // store resulting nodes here
private boolean skip = false;
...
These methods are the implementation of the ContentHandler. The first one detects an element was found (start tag). We save the name of the tag (child of Movie) in a variable, because it might be one we use in the search:
...
#Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
// Store the current element that started now
currentElement = qName;
// If this is a Movie tag, save the contents because we might need it
if (qName.equals("Movie")) {
contents = new StringBuilder();
}
}
...
This one is called every time a block of characters is called. We check if those characters are occurring inside an element which interests us. If it is, we match the contents and save it if it matches.
...
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
// if we discovered that we don't need this data, we skip it
if (skip || currentElement == null) {
return;
}
// If we are inside the tag we want to search, save the contents
currentData = new String(ch, start, length);
if (currentElement.equals(tagToMatch)) {
boolean discard = true;
if (strict) {
if (currentData.equals(tagContents)) { // exact match
discard = false;
}
} else {
if (currentData.toLowerCase().indexOf(tagContents.toLowerCase()) >= 0) { // matches occurrence of substring
discard = false;
}
}
if (discard) {
skip = true;
}
}
}
...
This is called when an end tag is found. We can now append it to the document we are building in memory if we wish.
...
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Rebuild the XML if it's a node we didn't skip
if (qName.equals("Movie")) {
if (!skip) {
result.add(contents.insert(0, "<Movie>").append("</Movie>").toString());
}
// reset the variables so we can check the next node
contents = null;
skip = false;
} else if (contents != null && !skip) {
contents.append("<").append(qName).append(">")
.append(currentData)
.append("</").append(qName).append(">");
}
currentElement = null;
}
...
Finally, this one is called when the document ends. I also used it to print the result at the end.
...
#Override
public void endDocument() throws SAXException {
StringBuilder resultFile = new StringBuilder();
resultFile.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
resultFile.append("<Movies>");
for (String childNode : result) {
resultFile.append(childNode.toString());
}
resultFile.append("</Movies>");
System.out.println("=== Resulting XML containing Movies where " + tagToMatch + " is one of " + tagContents + " ===");
System.out.println(resultFile.toString());
}
}
Here is a small Java application which loads that file, and uses an event handler to extract the data.
public class SAXReaderExample {
public static final String PATH = "src/main/resources"; // this is where I put the XML file
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
// Obtain XML Reader
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader reader = sp.getXMLReader();
// Instantiate SAX handler
ExtractMovieSaxHandler handler = new ExtractMovieSaxHandler();
// set search criteria
handler.setSearchCriteria("Director", "Kubrick", false);
// Register handler with XML reader
reader.setContentHandler(handler);
// Parse the XML
reader.parse(new InputSource(new FileInputStream(new File(PATH, "source.xml"))));
}
}
Here is the resulting file, after processing:
<?xml version="1.0" encoding="UTF-8"?>
<Movies>
<Movie>
<Imdb>tt0062622</Imdb>
<Title>2001: A Space Odyssey</Title>
<Director>Stanley Kubrick</Director>
<Year>1968</Year>
<Duration>160</Duration>
</Movie>
<Movie>
<Imdb>tt0066921</Imdb>
<Title>A Clockwork Orange</Title>
<Director>Stanley Kubrick</Director>
<Year>1972</Year>
<Duration>136</Duration>
</Movie>
<Movie>
<Imdb>tt0081505</Imdb>
<Title>The Shining</Title>
<Director>Stanley Kubrick</Director>
<Year>1980</Year>
<Duration>144</Duration>
</Movie>
...
</Movies>
Your scenario might be different, but this example shows a general solution which you can probably adapt to your problem. You can find more information in tutorials about SAX and JAXP.

500Mb is well within the limits of what can be achieved using XSLT. It depends a little bit on how much effort you want to expend to develop an optimum solution: i.e., which is more expensive, your time or the machine's time?

How to get proper string array when parsing CSV?

Using jcsv I'm trying to parse a CSV to a specified type. When I parse it, it says length of the data param is 1. This is incorrect. I tried removing line breaks, but it still says 1. Am I just missing something in plain sight?
This is my input string csvString variable
"Symbol","Last","Chg(%)","Vol",
INTC,23.90,1.06,28419200,
GE,26.83,0.19,22707700,
PFE,31.88,-0.03,17036200,
MRK,49.83,0.50,11565500,
T,35.41,0.37,11471300,
This is the Parser
public class BuySignalParser implements CSVEntryParser<BuySignal> {
#Override
public BuySignal parseEntry(String... data) {
// console says "Length 1"
System.out.println("Length " + data.length);
if (data.length != 4) {
throw new IllegalArgumentException("data is not a valid BuySignal record");
}
String symbol = data[0];
double last = Double.parseDouble(data[1]);
double change = Double.parseDouble(data[2]);
double volume = Double.parseDouble(data[3]);
return new BuySignal(symbol, last, change, volume);
}
}
And this is where I use the parser (right from the example)
CSVReader<BuySignal> cReader = new CSVReaderBuilder<BuySignal>(new StringReader( csvString)).entryParser(new BuySignalParser()).build();
List<BuySignal> signals = cReader.readAll();

jcsv allows different delimiter characters. The default is semicolon. Use CSVStrategy.UK_DEFAULT to get to use commas.
Also, you have four commas, and that usually indicates five values. You might want to remove the delimiters off the end.
I don't know how to make jcsv ignore the first line
I typically use CSVHelper to parse CSV files, and while jcsv seems pretty good, here is how you would do it with CVSHelper:
Reader reader = new InputStreamReader(new FileInputStream("persons.csv"), "UTF-8");
//bring in the first line with the headers if you want them
List<String> firstRow = CSVHelper.parseLine(reader);
List<String> dataRow = CSVHelper.parseLine(reader);
while (dataRow!=null) {
...put your code here to construct your objects from the strings
dataRow = CSVHelper.parseLine(reader);
}

You shouldn't have commas at the end of lines. Generally there are cell delimiters (commas) and line delimiters (newlines). By placing commas at the end of the line it looks like the entire file is one long line.

Reading Java Properties file without escaping values

My application needs to use a .properties file for configuration.
In the properties files, users are allow to specify paths.
Problem
Properties files need values to be escaped, eg
dir = c:\\mydir
Needed
I need some way to accept a properties file where the values are not escaped, so that the users can specify:
dir = c:\mydir

Why not simply extend the properties class to incorporate stripping of double forward slashes. A good feature of this will be that through the rest of your program you can still use the original Properties class.
public class PropertiesEx extends Properties {
public void load(FileInputStream fis) throws IOException {
Scanner in = new Scanner(fis);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext()) {
out.write(in.nextLine().replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}
}
Using the new class is a simple as:
PropertiesEx p = new PropertiesEx();
p.load(new FileInputStream("C:\\temp\\demo.properties"));
p.list(System.out);
The stripping code could also be improved upon but the general principle is there.

Two options:
use the XML properties format instead
Writer your own parser for a modified .properties format without escapes

You can "preprocess" the file before loading the properties, for example:
public InputStream preprocessPropertiesFile(String myFile) throws IOException{
Scanner in = new Scanner(new FileReader(myFile));
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext())
out.write(in.nextLine().replace("\\","\\\\").getBytes());
return new ByteArrayInputStream(out.toByteArray());
}
And your code could look this way
Properties properties = new Properties();
properties.load(preprocessPropertiesFile("path/myfile.properties"));
Doing this, your .properties file would look like you need, but you will have the properties values ready to use.
*I know there should be better ways to manipulate files, but I hope this helps.

The right way would be to provide your users with a property file editor (or a plugin for their favorite text editor) which allows them entering the text as pure text, and would save the file in the property file format.
If you don't want this, you are effectively defining a new format for the same (or a subset of the) content model as the property files have.
Go the whole way and actually specify your format, and then think about a way to either
transform the format to the canonical one, and then use this for loading the files, or
parse this format and populate a Properties object from it.
Both of these approaches will only work directly if you actually can control your property object's creation, otherwise you will have to store the transformed format with your application.
So, let's see how we can define this. The content model of normal property files is simple:
A map of string keys to string values, both allowing arbitrary Java strings.
The escaping which you want to avoid serves just to allow arbitrary Java strings, and not just a subset of these.
An often sufficient subset would be:
A map of string keys (not containing any whitespace, : or =) to string values (not containing any leading or trailing white space or line breaks).
In your example dir = c:\mydir, the key would be dir and the value c:\mydir.
If we want our keys and values to contain any Unicode character (other than the forbidden ones mentioned), we should use UTF-8 (or UTF-16) as the storage encoding - since we have no way to escape characters outside of the storage encoding. Otherwise, US-ASCII or ISO-8859-1 (as normal property files) or any other encoding supported by Java would be enough, but make sure to include this in your specification of the content model (and make sure to read it this way).
Since we restricted our content model so that all "dangerous" characters are out of the way, we can now define the file format simply as this:
<simplepropertyfile> ::= (<line> <line break> )*
<line> ::= <comment> | <empty> | <key-value>
<comment> ::= <space>* "#" < any text excluding line breaks >
<key-value> ::= <space>* <key> <space>* "=" <space>* <value> <space>*
<empty> ::= <space>*
<key> ::= < any text excluding ':', '=' and whitespace >
<value> ::= < any text starting and ending not with whitespace,
not including line breaks >
<space> ::= < any whitespace, but not a line break >
<line break> ::= < one of "\n", "\r", and "\r\n" >
Every \ occurring in either key or value now is a real backslash, not anything which escapes something else.
Thus, for transforming it into the original format, we simply need to double it, like Grekz proposed, for example in a filtering reader:
public DoubleBackslashFilter extends FilterReader {
private boolean bufferedBackslash = false;
public DoubleBackslashFilter(Reader org) {
super(org);
}
public int read() {
if(bufferedBackslash) {
bufferedBackslash = false;
return '\\';
}
int c = super.read();
if(c == '\\')
bufferedBackslash = true;
return c;
}
public int read(char[] buf, int off, int len) {
int read = 0;
if(bufferedBackslash) {
buf[off] = '\\';
read++;
off++;
len --;
bufferedBackslash = false;
}
if(len > 1) {
int step = super.read(buf, off, len/2);
for(int i = 0; i < step; i++) {
if(buf[off+i] == '\\') {
// shift everything from here one one char to the right.
System.arraycopy(buf, i, buf, i+1, step - i);
// adjust parameters
step++; i++;
}
}
read += step;
}
return read;
}
}
Then we would pass this Reader to our Properties object (or save the contents to a new file).
Instead, we could simply parse this format ourselves.
public Properties parse(Reader in) {
BufferedReader r = new BufferedReader(in);
Properties prop = new Properties();
Pattern keyValPattern = Pattern.compile("\s*=\s*");
String line;
while((line = r.readLine()) != null) {
line = line.trim(); // remove leading and trailing space
if(line.equals("") || line.startsWith("#")) {
continue; // ignore empty and comment lines
}
String[] kv = line.split(keyValPattern, 2);
// the pattern also grabs space around the separator.
if(kv.length < 2) {
// no key-value separator. TODO: Throw exception or simply ignore this line?
continue;
}
prop.setProperty(kv[0], kv[1]);
}
r.close();
return prop;
}
Again, using Properties.store() after this, we can export it in the original format.

Based on #Ian Harrigan, here is a complete solution to get Netbeans properties file (and other escaping properties file) right from and to ascii text-files :
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.Writer;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
/**
* This class allows to handle Netbeans properties file.
* It is based on the work of : http://stackoverflow.com/questions/6233532/reading-java-properties-file-without-escaping-values.
* It overrides both load methods in order to load a netbeans property file, taking into account the \ that
* were escaped by java properties original load methods.
* #author stephane
*/
public class NetbeansProperties extends Properties {
#Override
public synchronized void load(Reader reader) throws IOException {
BufferedReader bfr = new BufferedReader( reader );
ByteArrayOutputStream out = new ByteArrayOutputStream();
String readLine = null;
while( (readLine = bfr.readLine()) != null ) {
out.write(readLine.replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}//while
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}//met
#Override
public void load(InputStream is) throws IOException {
load( new InputStreamReader( is ) );
}//met
#Override
public void store(Writer writer, String comments) throws IOException {
PrintWriter out = new PrintWriter( writer );
if( comments != null ) {
out.print( '#' );
out.println( comments );
}//if
List<String> listOrderedKey = new ArrayList<String>();
listOrderedKey.addAll( this.stringPropertyNames() );
Collections.sort(listOrderedKey );
for( String key : listOrderedKey ) {
String newValue = this.getProperty(key);
out.println( key+"="+newValue );
}//for
}//met
#Override
public void store(OutputStream out, String comments) throws IOException {
store( new OutputStreamWriter(out), comments );
}//met
}//class

You could try using guava's Splitter: split on '=' and build a map from resulting Iterable.
The disadvantage of this solution is that it does not support comments.

#pdeva: one more solution
//Reads entire file in a String
//available in java1.5
Scanner scan = new Scanner(new File("C:/workspace/Test/src/myfile.properties"));
scan.useDelimiter("\\Z");
String content = scan.next();
//Use apache StringEscapeUtils.escapeJava() method to escape java characters
ByteArrayInputStream bi=new ByteArrayInputStream(StringEscapeUtils.escapeJava(content).getBytes());
//load properties file
Properties properties = new Properties();
properties.load(bi);

It's not an exact answer to your question, but a different solution that may be appropriate to your needs. In Java, you can use / as a path separator and it'll work on both Windows, Linux, and OSX. This is specially useful for relative paths.
In your example, you could use:
dir = c:/mydir

stax - get xml node as string

xml looks like so:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
I'm using stax to process one "<statement>" at a time and I got that working. I need to get that entire statement node as a string so I can create "123.xml" and "456.xml" or maybe even load it into a database table indexed by account.
using this approach: http://www.devx.com/Java/Article/30298/1954
I'm looking to do something like this:
String statementXml = staxXmlReader.getNodeByName("statement");
//load statementXml into database

I had a similar task and although the original question is older than a year, I couldn't find a satisfying answer. The most interesting answer up to now was Blaise Doughan's answer, but I couldn't get it running on the XML I am expecting (maybe some parameters for the underlying parser could change that?). Here the XML, very simplyfied:
<many-many-tags>
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
</many-many-tags>
My solution:
public static String readElementBody(XMLEventReader eventReader)
throws XMLStreamException {
StringWriter buf = new StringWriter(1024);
int depth = 0;
while (eventReader.hasNext()) {
// peek event
XMLEvent xmlEvent = eventReader.peek();
if (xmlEvent.isStartElement()) {
++depth;
}
else if (xmlEvent.isEndElement()) {
--depth;
// reached END_ELEMENT tag?
// break loop, leave event in stream
if (depth < 0)
break;
}
// consume event
xmlEvent = eventReader.nextEvent();
// print out event
xmlEvent.writeAsEncodedUnicode(buf);
}
return buf.getBuffer().toString();
}
Usage example:
XMLEventReader eventReader = ...;
while (eventReader.hasNext()) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement elem = xmlEvent.asStartElement();
String name = elem.getName().getLocalPart();
if ("DESCRIPTION".equals(name)) {
String xmlFragment = readElementBody(eventReader);
// do something with it...
System.out.println("'" + fragment + "'");
}
}
else if (xmlEvent.isEndElement()) {
// ...
}
}
Note that the extracted XML fragment will contain the complete extracted body content, including white space and comments. Filtering those on demand, or making the buffer size parametrizable have been left out for code brevity:
'
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
'

You can use StAX for this. You just need to advance the XMLStreamReader to the start element for statement. Check the account attribute to get the file name. Then use the javax.xml.transform APIs to transform the StAXSource to a StreamResult wrapping a File. This will advance the XMLStreamReader and then just repeat this process.
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
File file = new File("out" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}

Stax is a low-level access API, and it does not have either lookups or methods that access content recursively. But what you actually trying to do? And why are you considering Stax?
Beyond using a tree model (DOM, XOM, JDOM, Dom4j), which would work well with XPath, best choice when dealing with data is usually data binding library like JAXB. With it you can pass Stax or SAX reader and ask it to bind xml data into Java beans and instead of messing with xml process Java objects. This is often more convenient, and it is usually quite performance.
Only trick with larger files is that you do not want to bind the whole thing at once, but rather bind each sub-tree (in your case, one 'statement' at a time).
This is easiest done by iterating Stax XmlStreamReader, then using JAXB to bind.

I've been googling and this seems painfully difficult.
given my xml I think it might just be simpler to:
StringBuilder buffer = new StringBuilder();
for each line in file {
buffer.append(line)
if(line.equals(STMT_END_TAG)){
parse(buffer.toString())
buffer.delete(0,buffer.length)
}
}
private void parse(String statement){
//saxParser.parse( new InputSource( new StringReader( xmlText ) );
// do stuff
// save string
}

Why not just use xpath for this?
You could have a fairly simple xpath to get all 'statement' nodes.
Like so:
//statement
EDIT #1: If possible, take a look at dom4j. You could read the String and get all 'statement' nodes fairly simply.
EDIT #2: Using dom4j, this is how you would do it:
(from their cookbook)
String text = "your xml here";
Document document = DocumentHelper.parseText(text);
public void bar(Document document) {
List list = document.selectNodes( "//statement" );
// loop through node data
}

I had the similar problem and found the solution.
I used the solution proposed by #t0r0X but it does not work well in the current implementation in Java 11, the method xmlEvent.writeAsEncodedUnicode creates the invalid string representation of the start element (in the StartElementEvent class) in the result XML fragment, so I had to modify it, but then it seems to work well, what I could immediatelly verify by the parsing of the fragment by DOM and JaxBMarshaller to specific data containers.
In my case I had the huge structure
<Orders>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
...
</Orders>
in the file of multiple hundred megabytes (a lot of repeating "SyncOrder" structures), so the usage of DOM would lead to a large memory consumption and slow evaluation. Therefore I used the StAX to split the huge XML to smaller XML pieces, which I have analyzed with DOM and used the JaxbElements generated from the xsd definition of the element SyncOrder (This infrastructure I had from the webservice, which uses the same structure, but it is not important).
In this code there can be seen Where the XML fragment has een created and could be used, I used it directly in other processing...
private static <T> List<T> unmarshallMultipleSyncOrderXmlData(
InputStream aOrdersXmlContainingSyncOrderItems,
Function<SyncOrderType, T> aConversionFunction) throws XMLStreamException, ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory locDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
locDocumentBuilderFactory.setNamespaceAware(true);
DocumentBuilder locDocBuilder = locDocumentBuilderFactory.newDocumentBuilder();
List<T> locResult = new ArrayList<>();
XMLInputFactory locFactory = XMLInputFactory.newFactory();
XMLEventReader locReader = locFactory.createXMLEventReader(aOrdersXmlContainingSyncOrderItems);
boolean locIsInSyncOrder = false;
QName locSyncOrderElementQName = null;
StringWriter locXmlTextBuffer = new StringWriter();
int locDepth = 0;
while (locReader.hasNext()) {
XMLEvent locEvent = locReader.nextEvent();
if (locEvent.isStartElement()) {
if (locDepth == 0 && Objects.equals(locEvent.asStartElement().getName().getLocalPart(), "Orders")) {
locDepth++;
} else {
if (locDepth <= 0)
throw new IllegalStateException("There has been passed invalid XML stream intot he function. "
+ "Expecting the element 'Orders' as the root alament of the document, but found was '"
+ locEvent.asStartElement().getName().getLocalPart() + "'.");
locDepth++;
if (locSyncOrderElementQName == null) {
/* First element after the "Orders" has passed, so we retrieve
* the name of the element with the namespace prefix: */
locSyncOrderElementQName = locEvent.asStartElement().getName();
}
if(Objects.equals(locEvent.asStartElement().getName(), locSyncOrderElementQName)) {
locIsInSyncOrder = true;
}
}
} else if (locEvent.isEndElement()) {
locDepth--;
if(locDepth == 1 && Objects.equals(locEvent.asEndElement().getName(), locSyncOrderElementQName)) {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
/* at this moment the call of locXmlTextBuffer.toString() gets the complete fragment
* of XML containing the valid SyncOrder element, but I have continued to other processing,
* which immediatelly validates the produced XML fragment is valid and passes the values
* to communication object: */
Document locDocument = locDocBuilder.parse(new ByteArrayInputStream(locXmlTextBuffer.toString().getBytes()));
SyncOrderType locItem = unmarshallSyncOrderDomNodeToCo(locDocument);
locResult.add(aConversionFunction.apply(locItem));
locXmlTextBuffer = new StringWriter();
locIsInSyncOrder = false;
}
}
if (locIsInSyncOrder) {
if (locEvent.isStartElement()) {
/* here replaced the standard implementation of startElement's method writeAsEncodedUnicode: */
locXmlTextBuffer.write(startElementToStrng(locEvent.asStartElement()));
} else {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
}
}
}
return locResult;
}
private static String startElementToString(StartElement aStartElement) {
StringBuilder locStartElementBuffer = new StringBuilder();
// open element
locStartElementBuffer.append("<");
String locNameAsString = null;
if ("".equals(aStartElement.getName().getNamespaceURI())) {
locNameAsString = aStartElement.getName().getLocalPart();
} else if (aStartElement.getName().getPrefix() != null
&& !"".equals(aStartElement.getName().getPrefix())) {
locNameAsString = aStartElement.getName().getPrefix()
+ ":" + aStartElement.getName().getLocalPart();
} else {
locNameAsString = aStartElement.getName().getLocalPart();
}
locStartElementBuffer.append(locNameAsString);
// add any attributes
Iterator<Attribute> locAttributeIterator = aStartElement.getAttributes();
Attribute attr;
while (locAttributeIterator.hasNext()) {
attr = locAttributeIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(attr));
}
// add any namespaces
Iterator<Namespace> locNamespaceIterator = aStartElement.getNamespaces();
Namespace locNamespace;
while (locNamespaceIterator.hasNext()) {
locNamespace = locNamespaceIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(locNamespace));
}
// close start tag
locStartElementBuffer.append(">");
// return StartElement as a String
return locStartElementBuffer.toString();
}
private static String attributeToString(Attribute aAttr) {
if( aAttr.getName().getPrefix() != null && aAttr.getName().getPrefix().length() > 0 )
return aAttr.getName().getPrefix() + ":" + aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
else
return aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
}
public static SyncOrderType unmarshallSyncOrderDomNodeToCo(
Node aSyncOrderItemNode) {
Source locSource = new DOMSource(aSyncOrderItemNode);
Object locUnmarshalledObject = getMarshallerAndUnmarshaller().unmarshal(locSource);
SyncOrderType locCo = ((JAXBElement<SyncOrderType>) locUnmarshalledObject).getValue();
return locCo;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Does reading a text file in Java have a maximum line length? - java

For reading a large xml file, you can use SAX parser. In addition to that reading the values inside the "characters" in the SAX parser should be read using "String Buffer" instead of String. You can check out the SAX parser here.

Related

Fastest way to read a large XML file in Java

Creating an XML based on another XML in Java

How to get proper string array when parsing CSV?

Reading Java Properties file without escaping values

stax - get xml node as string

Categories

Resources