Java SAX parser progress monitoring - java

I'm writing a SAX parser in Java to parse a 2.5GB XML file of wikipedia articles. Is there a way to monitor the progress of the parsing in Java?

Thanks to EJP's suggestion of ProgressMonitorInputStream, in the end I extended FilterInputStream so that ChangeListener can be used to monitor the current read location in term of bytes.
With this you have finer control, for example to show multiple progress bars for parallel reading of big xml files. Which is exactly what I did.
So, a simplified version of the monitorable stream:
/**
* A class that monitors the read progress of an input stream.
*
* #author Hermia Yeung "Sheepy"
* #since 2012-04-05 18:42
*/
public class MonitoredInputStream extends FilterInputStream {
private volatile long mark = 0;
private volatile long lastTriggeredLocation = 0;
private volatile long location = 0;
private final int threshold;
private final List<ChangeListener> listeners = new ArrayList<>(4);
/**
* Creates a MonitoredInputStream over an underlying input stream.
* #param in Underlying input stream, should be non-null because of no public setter
* #param threshold Min. position change (in byte) to trigger change event.
*/
public MonitoredInputStream(InputStream in, int threshold) {
super(in);
this.threshold = threshold;
}
/**
* Creates a MonitoredInputStream over an underlying input stream.
* Default threshold is 16KB, small threshold may impact performance impact on larger streams.
* #param in Underlying input stream, should be non-null because of no public setter
*/
public MonitoredInputStream(InputStream in) {
super(in);
this.threshold = 1024*16;
}
public void addChangeListener(ChangeListener l) { if (!listeners.contains(l)) listeners.add(l); }
public void removeChangeListener(ChangeListener l) { listeners.remove(l); }
public long getProgress() { return location; }
protected void triggerChanged( final long location ) {
if ( threshold > 0 && Math.abs( location-lastTriggeredLocation ) < threshold ) return;
lastTriggeredLocation = location;
if (listeners.size() <= 0) return;
try {
final ChangeEvent evt = new ChangeEvent(this);
for (ChangeListener l : listeners) l.stateChanged(evt);
} catch (ConcurrentModificationException e) {
triggerChanged(location); // List changed? Let's re-try.
}
}
#Override public int read() throws IOException {
final int i = super.read();
if ( i != -1 ) triggerChanged( location++ );
return i;
}
#Override public int read(byte[] b, int off, int len) throws IOException {
final int i = super.read(b, off, len);
if ( i > 0 ) triggerChanged( location += i );
return i;
}
#Override public long skip(long n) throws IOException {
final long i = super.skip(n);
if ( i > 0 ) triggerChanged( location += i );
return i;
}
#Override public void mark(int readlimit) {
super.mark(readlimit);
mark = location;
}
#Override public void reset() throws IOException {
super.reset();
if ( location != mark ) triggerChanged( location = mark );
}
}
It doesn't know - or care - how big the underlying stream is, so you need to get it some other way, such as from the file itself.
So, here goes the simplified sample usage:
try (
MonitoredInputStream mis = new MonitoredInputStream(new FileInputStream(file), 65536*4)
) {
// Setup max progress and listener to monitor read progress
progressBar.setMaxProgress( (int) file.length() ); // Swing thread or before display please
mis.addChangeListener( new ChangeListener() { #Override public void stateChanged(ChangeEvent e) {
SwingUtilities.invokeLater( new Runnable() { #Override public void run() {
progressBar.setProgress( (int) mis.getProgress() ); // Promise me you WILL use MVC instead of this anonymous class mess!
}});
}});
// Start parsing. Listener would call Swing event thread to do the update.
SAXParserFactory.newInstance().newSAXParser().parse(mis, this);
} catch ( IOException | ParserConfigurationException | SAXException e) {
e.printStackTrace();
} finally {
progressBar.setVisible(false); // Again please call this in swing event thread
}
In my case the progresses raise nicely from left to right without abnormal jumps. Adjust threshold for optimum balance between performance and responsiveness. Too small and the reading speed can more then double on small devices, too big and the progress would not be smooth.
Hope it helps. Feel free to edit if you found mistakes or typos, or vote up to send me some encouragements! :D

Use a javax.swing.ProgressMonitorInputStream.

You can get an estimate of the current line/column in your file by overriding the method setDocumentLocator of org.xml.sax.helpers.DefaultHandler/BaseHandler. This method is called with an object from which you can get an approximation of the current line/column when needed.
Edit: To the best of my knowledge, there is no standard way to get the absolute position. However, I am sure some SAX implementations do offer this kind of information.

Assuming you know how many articles you have, can't you just keep a counter in the handler? E.g.
public void startElement (String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if(qName.equals("article")){
counter++
}
...
}
(I don't know whether you are parsing "article", it's just an example)
If you don't know the number of article in advance, you will need to count it first. Then you can print the status nb tags read/total nb of tags, say each 100 tags (counter % 100 == 0).
Or even have another thread monitor the progress. In this case, you might want to synchronize access to the counter, but not necessary given that it doesn't need to be really accurate.
My 2 cents

I'd use the input stream position. Make your own trivial stream class that delegates/inherits from the "real" one and keeps track of bytes read. As you say, getting the total filesize is easy. I wouldn't worry about buffering, lookahead, etc. - for large files like these it's chickenfeed. On the other hand, I'd limit the position to "99%".

Related

Race condition via use of Java's ExecutorService?

I have a class which loosely implements the HTTP Ranged-GET protocol. Essentially, some code invokes this class, telling it to download a file. The below MyClass is responsible for downloading the file in even-sized chunks (until the last chunk, which may be variable-length), and sending it to another service. When invoked, it sets the file size based on the Content-Range instance-length from the 1st HTTP response.
The class uses an ExecutorService with a thread pool size of 1 to control threading.
Below is the relevant implementation, with some hand-waving over the functions handling the GETs and PUTs.
class MyClass implements Runnable {
private long start;
private long chunkSize;
private int chunkNumber;
private int fileSize = 0;
private static final int MAX_RETRIES = 3;
public static final ExecutorService ES = Executors.newSingleThreadExecutor();
public MyClass(long start, long chunkSize, int chunkNumber) {
this.start = start;
this.chunkSize = chunkSize;
this.chunkNumber = chunkNumber;
}
public void run() {
for (int i = 0; i < MAX_RETRIES; i++) {
long end = start + chunkSize - 1; // inclusive so subtract 1
// doHttpGet() is a private instance function...
// if fileSize == 0 (i.e. first chunk downloaded), this will set the fileSize
doHttpGet(start, end);
// doHttpPost() is a private instance function
// it builds the POST from the GET message, which I'm not bothering to show here
if (!doHttpPost()) {
continue;
} else {
submitNextChunk(this);
break;
}
}
}
// this is the function a client uses to invoke the class
public static void submitWork(long newStartByte, long chunkSize, int chunkNumber) {
MyClass mc = new MyClass(newStartByte, chunkSize, chunkNumber);
if (ES.submit(mc) == null) {
//log error
}
}
// PROBLEM AREA?!?!
private static void submitNextChunk(MyClass mc) {
mc.chunkNumber++;
mc.start += mc.chunkSize;
// LOGGER.debug("start=" + mc.start + "\n" + "fileSize=" + mc.fileSize)
if (mc.start < mc.fileSize) {
if (ES.submit(mc) == null) {
//log error
}
}
}
}
And here is a snippet of the code which invokes MyClass.
long chunkSize = //somecalculation
DownloadAction.submitWork(0L, chunkSize, 1));
This code has been working fine, for a long time. However, I'm now noticing a potentially non-deterministic behavior when the file-to-download is extremely small (e.g., < 50 bytes). What appears to be happening is that the submitNextChunk() function does not seem to evaluate mc.start < mc.fileSize correctly. For example, if we set the packetSize=100K, and use a 50 byte file, then what I see -- via Wireshark -- is continuous HTTP GET requests asking for bytes 0-99999, 100000-199000, and 200000-299000, ..., etc. (The code on the other end is also slightly broken, as it continues to give us the original 50 bytes, rather than an HTTP out-of-range error code... but that another story.)
My concern was that there is a subtle race condition:
If I put some logging in submitNextChunk() to print out start and fileSize, then I only see one log statement of start=100000 and fileSize=100, and the function evaluates the less-than expression correctly to false. This makes sense, since it would be the first and only time submitNextChunk() would be called, and since that expression evaluates false, the function aborts.
I am concerned that this is somehow a threading issue, since without the debug statements, that less-than expression clearly results to true, which should not be occurring.

How do I keep track of parsing progress of large files in StAX?

I'm processing large (1TB) XML files using the StAX API. Let's assume we have a loop handling some elements:
XMLInputFactory fac = XMLInputFactory.newInstance();
XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
while (true) {
if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
// handle contents
}
}
How do I keep track of overall progress within the large XML file? Fetching the offset from reader works fine for smaller files:
int offset = reader.getLocation().getCharacterOffset();
but being an Integer offset, it'll probably only work for files up to 2GB...
A simple FilterReader should work.
class ProgressCounter extends FilterReader {
long progress = 0;
#Override
public long skip(long n) throws IOException {
progress += n;
return super.skip(n);
}
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
int red = super.read(cbuf, off, len);
progress += red;
return red;
}
#Override
public int read() throws IOException {
int red = super.read();
progress += red;
return red;
}
public ProgressCounter(Reader in) {
super(in);
}
public long getProgress () {
return progress;
}
}
Seems that the Stax API can't give you a long offset.
As a workaround you could create a custom java.io.FilterReader class which overrides read() and read(char[] cbuf, int off, int len) to increment a long offset.
You would pass this reader to the XMLInputFactory.
The handler loop can then get the offset information directly from the reader.
You could also do this on the byte-level reading using a FilterInputStream, counting the byte offset instead of character offset. That would allow for a exact progress calculation given the file size.

Sharing an object between multiple threads java

I would like to be able to run two methods at the same time that rely on the same global variable. The first method periodically updates the shared variable, but never finishes running. The second method keeps track of time. When time runs out, the second method returns the last result of the shared variable from the first method. Below is what I have so far, with commented out pseduocode in the places where I need help.
package learning;
public class testmath{
public static void main(String[] args){
long finishBy = 10000;
int currentresult = 0;
/*
* run eversquare(0) in a seperate thread /in parallel
*/
int finalresult = manager(finishBy);
System.out.println(finalresult);
}
public static int square(int x){
return x * x;
}
public static void eversquare(int x){
int newresult;
while(2 == 2){
x += 1;
newresult = square(x);
/*
* Store newresult as a global called currentresult
*/
}
}
public static int manager(long finishBy){
while(System.currentTimeMillis() + 1000 < finishBy){
Thread.sleep(100);
}
/*
* Access global called currentresult and create a local called currentresult
*/
return currentresult;
}
}
You only need to run one additional thread:
public class Main {
/**
* Delay in milliseconds until finished.
*/
private static final long FINISH_BY = 10000;
/**
* Start with this number.
*/
private static final int START_WITH = 1;
/**
* Delay between eversquare passes in milliseconds.
*/
private static final long DELAY_BETWEEN_PASSES = 50;
/**
* Holds the current result. The "volatile" keyword tells the JVM that the
* value could be changed by another thread, so don't cache it. Marking a
* variable as volatile incurs a *serious* performance hit so don't use it
* unless really necessary.
*/
private static volatile int currentResult = 0;
public static void main(String[] args) {
// create a Thread to run "eversquare" in parallel
Thread eversquareThread = new Thread(new Runnable() {
#Override public void run() {
eversquare(START_WITH, DELAY_BETWEEN_PASSES);
}
});
// make the eversquare thread shut down when the "main" method exits
// (otherwise the program would never finish, since the "eversquare" thread
// would run forever due to its "while" loop)
eversquareThread.setDaemon(true);
// start the eversquare thread
eversquareThread.start();
// wait until the specified delay is up
long currentTime = System.currentTimeMillis();
final long stopTime = currentTime + FINISH_BY;
while (currentTime < stopTime) {
final long sleepTime = stopTime - currentTime;
try {
Thread.sleep(sleepTime);
} catch (InterruptedException ex) {
// in the unlikely event of an InterruptedException, do nothing since
// the "while" loop will continue until done anyway
}
currentTime = System.currentTimeMillis();
}
System.out.println(currentResult);
}
/**
* Increment the value and compute its square. Runs forever if left to its own
* devices.
*
* #param startValue
* The value to start with.
*
* #param delay
* If you were to try to run this without any delay between passes, it would
* max out the CPU and starve any other threads. This value is the wait time
* between passes.
*/
private static void eversquare(final int startValue, final long delay) {
int currentValue = startValue;
while (true) { // run forever (just use "true"; "2==2" looks silly)
currentResult = square(currentValue); // store in the global "currentResult"
currentValue++; // even shorter than "x += 1"
if (delay > 0) {
try { // need to handle the exception that "Thread.sleep()" can throw
Thread.sleep(delay);
} catch (InterruptedException ex) { // "Thread.sleep()" can throw this
// just print to the console in the unlikely event of an
// InterruptedException--things will continue fine
ex.printStackTrace();
}
}
}
}
private static int square(int x) {
return x * x;
}
}
I should also mention that the "volatile" keyword works for (most) primitives, since any JVM you'll see these days guarantees they will be modified atomically. This is not the case for objects, and you will need to use synchronized blocks and locks to ensure they are always "seen" in a consistent state.
Most people will also mention that you really should not use the synchronized keyword on the method itself, and instead synchronize on a specific "lock" object. And generally this lock object should not be visible outside your code. This helps prevent people from using your code incorrectly, getting themselves into trouble, and then trying to blame you. :)

2 dimensional array changing during serialisation

I'm serialising and deserialising a large two dimensional array of objects. Each object contains instructions to creating a BufferedImage - this is done to get around BufferedImage not being directly serializable itself.
The class being serialised is:
public final class MapTile extends TransientImage
{
private static final long serialVersionUID = 0;
private transient BufferedImage f;
transient BufferedImage b;
int along;
int down;
boolean flip = false;
int rot = 0;
public MapTile(World w, int a, int d)
{
// f = w.getMapTiles();
along = a;
down = d;
assignImage();
}
public MapTile(World w, int a, int d, int r, boolean fl)
{
// f = w.getMapTiles();
along = a;
down = d;
rot = r;
flip = fl;
assignImage();
}
public int getA()
{
return along;
}
public int getD()
{
return down;
}
#Override
public void assignImage()
{
if (f == null)
{
f = World.mapTiles;
}
b = f.getSubimage(along, down, World.squareSize, World.squareSize);
if (rot != 0)
{
b = SmallMap.rotateImage(b, rot);
}
if (flip)
{
b = SmallMap.flipImage(b);
}
super.setImage(b);
f.flush();
b.flush();
f = null;
b = null;
}
}
which extends:
public abstract class TransientImage implements Serializable
{
private transient BufferedImage image;
public BufferedImage getImage()
{
return image;
}
public void setImage(BufferedImage i)
{
image = i;
}
public abstract void assignImage();
private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException
{
in.defaultReadObject();
assignImage();
}
}
This will ultimately be part of a map - usually it is created randomly but certain areas must be the same each time, hence serialising them and reading the array back in. As I will never need to save the image during normal usage I am putting in the write code:
try (ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("verticalroad.necro")))
{
//out.writeObject(mapArray);
//}
//catch (IOException e) {
//}
in the class that creates the map, the read code:
try{
FileInputStream door = new FileInputStream(new File(f.getPath()+ "//verticalroad.necro"));
ObjectInputStream reader = new ObjectInputStream(door);
homeTiles = (MapTile[][]) reader.readObject();
}
catch (IOException | ClassNotFoundException e)
{
System.out.println("Thrown an error" + e.getMessage());
}
in the initialising class and commenting in and out as needed.
However. Each time I run the program the contents of the two dimensional array (mapArray in write, homeTiles in read) is different. Not only different from the one I (thought) I wrote, but also different each time the program is opened.
As can be seen, I'm printing out the toString to System.out which reveals further oddities. As its just a standard array, the toString isn't 100% helpful but it seems to cycle between several distinct values. However, even when the toStringg gives the same value, the contents of the array as displayed are not the same.
An example of a toString is hometiles:[[Lriseofthenecromancer.MapTile;#7681720a Looking at the documentation for Array.toString (here) it seems to be badly formed, lacking a trailing ]. I'm not sure if this is a clue to the issue or if its simply that the array is very large (several thousand objects) and its an issue of display space (I'm using NetBeans).
Any insight as to why this is changing would be appreciated. My working assumption is that its serializing the array but not the contents. But I have no idea a) if that's the case and b)if it is, what to do about it.
EDIT: Looking into this a bit further, it seems that instance variables aren't being set immediately. Printing them out directly after the call to setImage() has them all at zero, printing them from the calling class has them where they should be.
The underlying problem was that I'm an idiot. The specific expression of this in this particular case was that I forgot that subclasses couldn't inherit private methods. As such, the assignImage call wasn't being made and the image wasn't being set up.
Sorry for wasting the time of anyone who looked at this. I feel quite embarrassed.

Multiple lines of text to a single map

I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already.
I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line].
I have tried to set the option and it only takes N lines of input sending it at 1 line at a time to each map:
job.setInt("mapred.line.input.format.linespermap", 10);
I've found a mailing list recommending me to override LineRecordReader::next, however that is not that simple, as that the internal data members are all private.
I've just checked the source for NLineInputFormat and it hard codes LineReader, so overriding will not help.
Also, btw I'm using Hadoop 0.18 for compatibility with the Amazon EC2 MapReduce.
You have to implement your own input format. You also have the possibility to define your own record reader then.
Unfortunately you have to define a getSplits()-method. In my opinion this will be harder than implementing the record reader: This method has to implement a logic to chunk the input data.
See the following excerpt from "Hadoop - The definitive guide" (a great book I would always recommend!):
Here’s the interface:
public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}
The JobClient calls the getSplits() method, passing the desired number of map tasks
as the numSplits argument. This number is treated as a hint, as InputFormat imple-
mentations are free to return a different number of splits to the number specified in
numSplits. Having calculated the splits, the client sends them to the jobtracker, which
uses their storage locations to schedule map tasks to process them on the tasktrackers.
On a tasktracker, the map task passes the split to the getRecordReader() method on
InputFormat to obtain a RecordReader for that split. A RecordReader is little more than
an iterator over records, and the map task uses one to generate record key-value pairs,
which it passes to the map function. A code snippet (based on the code in MapRunner)
illustrates the idea:
K key = reader.createKey();
V value = reader.createValue();
while (reader.next(key, value)) {
mapper.map(key, value, output, reporter);
}
I solved this problem recently by simply creating my own InputFormat that overrides NLineInputFormat and implements a custom MultiLineRecordReader instead of the default LineReader.
I chose to extend NLineInputFormat because I wanted to have the same guarantee of having exactly N line(s) per split.
This record reader is taken almost as is from http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/
The only things I modified is the property for maxLineLength that now uses the new API, and the value for NLINESTOPROCESS that gets read from NLineInputFormat's setNumLinesPerSplit() insead of being hardcoded (for more flexibility).
Here is the result:
public class MultiLineInputFormat extends NLineInputFormat{
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context) {
context.setStatus(genericSplit.toString());
return new MultiLineRecordReader();
}
public static class MultiLineRecordReader extends RecordReader<LongWritable, Text>{
private int NLINESTOPROCESS;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start =0;
private long end =0;
private long pos =0;
private int maxLineLength;
#Override
public void close() throws IOException {
if (in != null) {
in.close();
}
}
#Override
public LongWritable getCurrentKey() throws IOException,InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException, InterruptedException {
if (start == end) {
return 0.0f;
}
else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)throws IOException, InterruptedException {
NLINESTOPROCESS = getNumLinesPerSplit(context);
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
this.maxLineLength = conf.getInt("mapreduce.input.linerecordreader.line.maxlength",Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
start = split.getStart();
end= start + split.getLength();
boolean skipFirstLine = false;
FSDataInputStream filein = fs.open(split.getPath());
if (start != 0){
skipFirstLine = true;
--start;
filein.seek(start);
}
in = new LineReader(filein,conf);
if(skipFirstLine){
start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
value.clear();
final Text endline = new Text("\n");
int newSize = 0;
for(int i=0;i<NLINESTOPROCESS;i++){
Text v = new Text();
while (pos < end) {
newSize = in.readLine(v, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength));
value.append(v.getBytes(),0, v.getLength());
value.append(endline.getBytes(),0, endline.getLength());
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
}
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
}
}
I think that in your case you can follow the delegation pattern and implement a wrapper around LineRecordReader that overrides necessary methods i.e. next() (or nextKeyValue() in new API) to set value to a concatenation of N lines, rather than one line.
I have googled the exemplary implementation of ParagraphRecordReader that uses LineRecordReader to read input data line by line (and concatenate it) until encountering either EOF or a blank line. Then it returns pair, where value is a paragraph (instead of one line). Moreover, ParagraphInputFormat for this ParagraphRecordReader is as simple as standard TextInputFormat.
You can find the necessary links to this implementation and a couple of words about that the following post: http://hadoop-mapreduce.blogspot.com/2011/03/little-more-complicated-recordreaders.html.
Best

Categories