Fastest way to read a large XML file in Java

Fastest way to read a large XML file in Java - java

I'm working on a java project to optimize existing code. Currently i'm using BufferedReader/FileInputStream to read content of an XML file as String in Java.
But my question is , is there any faster way to read XML content.Are SAX/DOM faster than BufferedReader/FileInputStream?
Need help regarding the above issue.
Thanks in advance.

I think that your code shown in other question is faster than DOM-like parsers which would definitely require more memory and likely some computation in order to reconstruct the document in full. You may want to profile the code though.
I also think that your code can be prettified a bit for streaming processing if you would use javax XMLStreamReader, which I found quite helpful for many tasks. That class is "... is designed to be the lowest level and most efficient way to read XML data", according to Oracle.
Here is the excerpt from my code where I parse StackOverflow users XML file distributed as a public data dump:
// the input file location
private static final String fileLocation = "/media/My Book/Stack/users.xml";
// the target elements
private static final String USERS_ELEMENT = "users";
private static final String ROW_ELEMENT = "row";
// get the XML file handler
//
FileInputStream fileInputStream = new FileInputStream(fileLocation);
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(
fileInputStream);
// reading the data
//
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
// this triggers _users records_ logic
//
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
// read and parse the user data rows
//
while (xmlStreamReader.hasNext()) {
eventCode = xmlStreamReader.next();
// this breaks _users record_ reading logic
//
if ((XMLStreamConstants.END_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
break;
}
else {
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(ROW_ELEMENT)) {
// extract the user data
//
User user = new User();
int attributesCount = xmlStreamReader.getAttributeCount();
for (int i = 0; i < attributesCount; i++) {
user.setAttribute(xmlStreamReader.getAttributeLocalName(i),
xmlStreamReader.getAttributeValue(i));
}
// all other user record-related logic
//
}
}
}
}
}
That users file format is quite simple and similar to your Bank.xml file:
<users>
<row Id="1567200" Reputation="1" CreationDate="2012-07-31T23:57:57.770" DisplayName="XXX" EmailHash="XXX" LastAccessDate="2012-08-01T00:55:12.953" Views="0" UpVotes="0" DownVotes="0" />
...
</users>

There are different parser options available.
Consider using a streaming parser, because the DOM may become quite big. I.e. either a push or a pull parser.
It's not as if XML parsers are necessarily slow. Consider your web browser. It does XML parsing all the time, and tries really hard to be robust to syntax errors. Usually, memory is the bigger issue.

Related

Java Modify XML

I want to read an XML file in Java and then update certain elements in that file with new values. My file is > 200mb and performance is important, so the DOM model cannot be used.
I feel that a StaX Parser is the solution, but there is no decent literature on using Java StaX to read and then write XML back to the same file.
(For reference I have been using the java tutorial and this helpful tutorial to get what I have so far)
I am using Java 7, but there doesn't seem to be any updates to the XML parsing API since...a long time ago. So this probably isn't relevant.
Currently I have this:
public static String readValueFromXML(final File xmlFile, final String value) throws FileNotFoundException, XMLStreamException
{
XMLEventReader reader = new XMLInputFactory.newFactory().createXMLEventReader(new FileReader(xmlFile));
String found = "";
boolean read = false;
while (reader.hasNext())
{
XMLEvent event = reader.nextEvent();
if (event.isStartElement() &&
event.asStartElement().getName().getLocalPart().equals(value))
{
read = true;
}
if (event.isCharacters() && read)
{
found = event.asCharacters().getData();
break;
}
}
return found;
}
which will read the XMLFile and return the value of the selected element. However, I have another method updateXMLFile(final File xmlFile, final String value) which I want to use in conjunction with this.
So my question is threefold:
Is there a StaX implementation for editing XML
Will XPath be any help? Can that be used without converting my file to a Document?
(More Generally) Why doesn't Java have a better XML API?

There are two things you may want to look at. The first is to use JAXB to bind the XML to POJOs which you can then have your way with and serialize the structure back to XML when needed.
The second is a JDBC driver for XML, there are several available for a fee, not sure if there are any open source ones or not. In my experience JAXB is the better choice. If the XML file is too large to handle efficiently with JAXB I think you need to look at using a database as a replacement for the XML file.

This is my approach, which reads events from the file using StaX and writes them to another file. The values are updated as the loop passes over the correctly named elements.
public void read(String key, String value)
{
try (FileReader fReader = new FileReader(inputFile); FileWriter fWriter = new FileWriter(outputFile))
{
XMLEventFactory factory = XMLEventFactory.newInstance();
XMLEventReader reader = XMLInputFactory.newFactory().createXMLEventReader(fReader);
XMLEventWriter writer = XMLOutputFactory.newFactory().createXMLEventWriter(fWriter);
while (reader.hasNext())
{
XMLEvent event = reader.nextEvent();
boolean update = false;
if (event.isStartElement() && event.asStartElement().getName().getLocalPart().equals(key))
{
update = true;
}
else if (event.isCharacters() && update)
{
Characters characters = factory.createCharacters(value);
event = characters;
update = false;
}
writer.add(event);
}
}
catch (XMLStreamException | FactoryConfigurationError | IOException e)
{
e.printStackTrace();
}
}

How to read a large json from a text file in android app?

I have a text file in my android app which consist a json. I need to read and parse that json. File size is 21 mb. I am using following code to read file:
StringBuilder stringBuilder = new StringBuilder();
InputStream input = getAssets().open(filename);
int size = input.available();
byte[] buffer = new byte[size];
byte[] tempBuffer = new byte[1024];
int tempBufferIndex = 0;
for(int i=0; i<size; i++){
if(i == 0){
tempBuffer[tempBufferIndex] = buffer[i];
}else{
int mod = 1024 % i;
if(mod == 0){
input.read(tempBuffer);
stringBuilder.append(new String(tempBuffer));
tempBufferIndex = 0;
}
tempBuffer[tempBufferIndex] = buffer[i];
}
}
input.close();
Size int is 20949874 in real case. After loop is done stringBuilder length is always 11264 even if i change range of for loop. I tried to make one String from InputStream without using loop but it always gives me OutOfMemoryError Exception. I also get "Grow heap (frag case) to 26.668MB for 20949890-byte allocation" in my logs. I searched here and tried different solutions but did not make it work. Any idea how should i solve this issue. Thanks in advance.

For big json files you should use SAX parser and not DOM. For example JsonReader.
DOM (“Document Object Model”) loads the entire content into memory and permits the developer to query the data as they wish. SAX presents the data as a stream: the developer waits for their desired pieces of data to appear and saves only the parts they need. DOM is considered easier to use but SAX uses much less memory.

You can try to split the file into several parts.
So during processing the app hopefully doesn't get out of memory.
You should also consider using "largeHeap" flag in your manifest
(See http://developer.android.com/guide/topics/manifest/application-element.html)
I don't know your file, but maybe if you use smaller JSON tags, you can reduce storage as well.

Does reading a text file in Java have a maximum line length?

I'm reading in an XML configuration file that I don't control the format of, and the data I need is in the last element. Unfortunately, that element is a base64 encoded serialised Java class (yes, I know) that is 31200 characters in length.
Some experimenting seems to show that not only can the Java XML/XPath libraries not see the value in this element (they silently set the value to a blank string), if I just read the file into a string and print it out to console, everything (even a closing element on the next line) gets printed, but not this one element.
Finally, if I manually go into the file and break the line into rows, Java can see the line, although this obviously breaks XML parsing and deserialisation. It also isn't practical as I want to make a tool that will work across many such files.
Is there some line length limit in Java that stops this working? Can I get around it with a third party library?
EDIT: here's the XML-related code:
FileInputStream fstream = new FileInputStream("path/to/xml/file.xml");
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document d = db.parse(fstream);
String s = XPathFactory.newInstance().newXPath().compile("//el1").evaluate(d);

For reading a large xml file, you can use SAX parser.
In addition to that reading the values inside the "characters" in the SAX parser should be read using "String Buffer" instead of String.
You can check out the SAX parser here.

I wondered if it might be possible to do some pre-processing to the XML as you read it in.
I've been having a play to see if I could break down the long element into a list of sub-elements. Then this could be parsed and the sub-elements could be built back into a string. My testing threw up the fact that my initial guess of 4500 characters per sub element was still a bit high for my XML parsing to cope with, so I just arbitrarily picked 1000 and it seems to cope with that.
Anyway, this might help, it might not, but here's what I came up with:
private static final String ELEMENT_TO_BREAK_UP_OPEN = "<element>";
private static final String ELEMENT_TO_BREAK_UP_CLOSE = "</element>";
private static final String SUB_ELEMENT_OPEN = "<subelement>";
private static final String SUB_ELEMENT_CLOSE = "</subelement>";
private static final int SUB_ELEMENT_SIZE_LIMIT = 1000;
public static void main(final String[] args) {
try {
/* The XML currently looks like this:
*
* <root>
* <element> ... Super long input with 30000+ characters ... </element>
* </root>
*
*/
final File file = new File("src\\main\\java\\longxml\\test.xml");
final BufferedReader reader = new BufferedReader(new FileReader(file));
final StringBuffer buffer = new StringBuffer();
String line = reader.readLine();
while( line != null ) {
if( line.contains(ELEMENT_TO_BREAK_UP_OPEN) ) {
buffer.append(ELEMENT_TO_BREAK_UP_OPEN);
String substring = line.substring(ELEMENT_TO_BREAK_UP_OPEN.length(), (line.length() - ELEMENT_TO_BREAK_UP_CLOSE.length()) );
while( substring.length() > SUB_ELEMENT_SIZE_LIMIT ) {
buffer.append(SUB_ELEMENT_OPEN);
buffer.append( substring.substring(0, SUB_ELEMENT_SIZE_LIMIT) );
buffer.append(SUB_ELEMENT_CLOSE);
substring = substring.substring(SUB_ELEMENT_SIZE_LIMIT);
}
if( substring.length() > 0 ) {
buffer.append(SUB_ELEMENT_OPEN);
buffer.append(substring);
buffer.append(SUB_ELEMENT_CLOSE);
}
buffer.append(ELEMENT_TO_BREAK_UP_CLOSE);
}
else {
buffer.append(line);
}
line = reader.readLine();
}
reader.close();
/* The XML now looks something like this:
*
* <root>
* <element>
* <subElement> ... First Part of Data ... </subElement>
* <subElement> ... Second Part of Data ... </subElement>
* ... Multiple Other SubElements of Data ..
* <subElement> ... Final Part of Data ... </subElement>
* </element>
* </root>
*/
//This parses the xml with the new subElements in
final InputSource src = new InputSource(new StringReader(buffer.toString()));
final Node document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(src).getFirstChild();
//This gives us the first child (element) then that's children (subelements)
final NodeList childNodes = document.getFirstChild().getChildNodes();
//Then concatenate them back into a big string.
final StringBuilder finalElementValue = new StringBuilder();
for( int i = 0; i < childNodes.getLength(); i++ ) {
final Node node = childNodes.item(i);
finalElementValue.append( node.getFirstChild().getNodeValue() );
}
//At this point do whatever you need to do. Decode, Deserialize, etc...
System.out.println(finalElementValue.toString());
}
catch (final Exception e) {
e.printStackTrace();
}
}
There are a few issues with this in terms of it's general application:
It does rely on the element you want to break up being uniquely identifiable. (But I'm guessing the logic to find the element can be improved quite a bit)
It relies on knowing the format of the XML and hoping that doesn't change. (Only in the latter parsing section, you could potentially parse it better with xPath once it has been broken into subelements)
Having said all of that, you do end up with a parsable XML string, which you can build your encoded string from, so this might help you on your way to a solution.

Change id3 tag version programatically (pref java)

I need a way to change id3 tag version of mp3 files to some id3v2.x programatically, preferably using java though anything that works is better than nothing. Bonus points if it converts the existing tag so that already existing data isn't destroyed, rather than creating a new tag entirely.
Edit: Jaudiotagger worked, thanks. Sadly I had to restrict it to mp3 files and only saving data contained in previous tags if they were id3. I decided to convert the tag to ID3v2.3 since windows explorer can't handle v2.4, and it was a bit tricky since the program was a bit confused about whether to use the copy constructor or the conversion constructor.
MP3File mf = null;
try {
mf = (MP3File)AudioFileIO.read(new File(pathToMp3File));
} catch (Exception e) {}
ID3v23Tag tag;
if (mf.hasID3v2Tag()) tag = new ID3v23Tag(mf.getID3v2TagAsv24());
else if (mf.hasID3v1Tag()) tag = new ID3v23Tag(mf.getID3v1Tag());
else tag = new ID3v23Tag();

My application must be able to read id3v1 or id3v11, but shall only write v23, so I needed a little bit longer piece of code:
AudioFile mf;
Tag mTagsInFile;
...
mf = ... // open audio file the usual way
...
mTagsInFile = mf.getTag();
if (mTagsInFile == null)
{
//contrary to getTag(), getTagOrCreateAndSetDefault() ignores id3v1 tags
mTagsInFile = mf.getTagOrCreateAndSetDefault();
}
// mp3 id3v1 and id3v11 are suboptimal, convert to id3v23
if (mf instanceof MP3File)
{
MP3File mf3 = (MP3File) mf;
if (mf3.hasID3v1Tag() && !mf3.hasID3v2Tag())
{
// convert ID3v1 tag to ID3v23
mTagsInFile = new ID3v23Tag(mf3.getID3v1Tag());
mf3.setID3v1Tag(null); // remove v1 tags
mf3.setTag(mTagsInFile); // add v2 tags
}
}
Basically we have to know that getTagOrCreateAndSetDefault() and similar unfortunately ignores id3v1, so we first have to call getTag(), and only if this fails, we call the mentioned function.
Additionally, the code must also deal with flac and mp4, so we make sure to do our conversion only with mp3 files.
Finally there is a bug in JaudioTagger. You may replace this line
String genre = "(" + genreId + ") " + GenreTypes.getInstanceOf().getValueForId(genreId);
in "ID3v24Tag.java" with this one
String genre = GenreTypes.getInstanceOf().getValueForId(genreId);
Otherwise genre 12 from idv1 will get "(12) Other" which later is converted to "Other Other" and this is not what we would expect. Maybe someone has a more elegant solution.

You can use different libraries for this purpose, for example this or this.

Parse ~1 MB JSON on Android very slow

I have an approximately 1MB JSON file stored in my assets folder that I need to load in my app every time it runs. I find that the built-in JSON parser (org.json) parses the file very slowly, however once it's parsed, I can access and manipulate the data very quickly. I've counted out as many as 7 or 8 seconds from the moment I click on the app to the moment the Activity1 is brought up, but just a few milliseconds to go from Activity1 to Activity2 which depends on data processed from the data loaded in Activity1.
I'm reading the file into memory and parsing it using:
String jsonString = readFileToMemory(myFilename)
JSONArray array = new JSONArray(jsonString);
where readFileToMemory(String) looks like this:
private String readFileToMemory(String filename) {
StringBuilder data = new StringBuilder();
BufferedReader reader = null;
try {
InputStream stream = myContext.getAssets().open(filename);
reader = new BufferedReader(new InputStreamReader(stream, "UTF-8"));
int result = 0;
do
{
char[] chunk = new char[512];
result = reader.read(chunk);
data.append(chunk);
}
while(result != -1);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return data.toString();
}
Does anyone have any suggestions to how I can speed up the initial loading and parsing of the data? Should I perhaps mask the whole process behind a loading screen?

JSONObject -- the one from json.org, is the simplest API to use to parse JSON. However it comes with a cost -- performace. I have done extensive experiments with JSONObject, Gson and Jackson. It seems no matter what you do, JSONObject (and hence JSONArray) will be the slowest. Please switch to Jackson or Gson.
Here is the relative performance of the two
(fastest) Jackson > Gson >> JSONObject (slowest)
Refer:
- Jackson
- Gson

You should make an SQLite table to store the data and move it from JSON to SQL the first time the app runs. As an added benefit, this makes the data easier to search through and makes it possible for you to modify the data from within the app.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fastest way to read a large XML file in Java - java

Related

Java Modify XML

How to read a large json from a text file in android app?

Does reading a text file in Java have a maximum line length?

Change id3 tag version programatically (pref java)

Parse ~1 MB JSON on Android very slow

Categories

Resources