Java: SAX Parsing a huge XML file

Java: SAX Parsing a huge XML file - java

I have a 35 GB XML file (yes, some organizations do that and I have no control over it) that I would like to SAX parse. I found an example here:
http://www.java2s.com/Code/Java/XML/SAXDemo.htm
of how to run a SAX parser and avoid loading everything. However, I get an out of memory error immediatly. Why does this happens and how I can make this code perfectly scalable for any XML file size?
Here my code:
import org.apache.log4j.Logger;
import org.xml.sax.AttributeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
public class XMLSAXTools extends org.xml.sax.helpers.DefaultHandler {
/**
* Logging facility
*/
static Logger logger = Logger.getLogger(XMLSAXTools.class);
private String fileName = "C:/Data/hugefile.xml";
private int counter = 0;
/** The main method sets things up for parsing */
public void test() throws IOException, SAXException,
ParserConfigurationException {
// Create a JAXP "parser factory" for creating SAX parsers
javax.xml.parsers.SAXParserFactory spf = SAXParserFactory.newInstance();
// Configure the parser factory for the type of parsers we require
spf.setValidating(false); // No validation required
// Now use the parser factory to create a SAXParser object
// Note that SAXParser is a JAXP class, not a SAX class
javax.xml.parsers.SAXParser sp = spf.newSAXParser();
// Create a SAX input source for the file argument
org.xml.sax.InputSource input = new InputSource(new FileReader(fileName));
// Give the InputSource an absolute URL for the file, so that
// it can resolve relative URLs in a <!DOCTYPE> declaration, e.g.
input.setSystemId("file://" + new File(fileName).getAbsolutePath());
// Create an instance of this class; it defines all the handler methods
XMLSAXTools handler = new XMLSAXTools();
// Finally, tell the parser to parse the input and notify the handler
sp.parse(input, handler);
// Instead of using the SAXParser.parse() method, which is part of the
// JAXP API, we could also use the SAX1 API directly. Note the
// difference between the JAXP class javax.xml.parsers.SAXParser and
// the SAX1 class org.xml.sax.Parser
//
// org.xml.sax.Parser parser = sp.getParser(); // Get the SAX parser
// parser.setDocumentHandler(handler); // Set main handler
// parser.setErrorHandler(handler); // Set error handler
// parser.parse(input); // Parse!
}
StringBuffer accumulator = new StringBuffer(); // Accumulate parsed text
String servletName; // The name of the servlet
String servletClass; // The class name of the servlet
String servletId; // Value of id attribute of <servlet> tag
// When the parser encounters plain text (not XML elements), it calls
// this method, which accumulates them in a string buffer
public void characters(char[] buffer, int start, int length) {
accumulator.append(buffer, start, length);
}
// Every time the parser encounters the beginning of a new element, it
// calls this method, which resets the string buffer
public void startElement(String name, AttributeList attributes) {
accumulator.setLength(0); // Ready to accumulate new text
if (name.equals("item")) {
logger.info("item tag opened");
counter++;
}
}
// When the parser encounters the end of an element, it calls this method
public void endElement(String name) {
if (name.equals("item")) {
logger.info("item tag closed. Counter: " + counter);
}
}
/** This method is called when warnings occur */
public void warning(SAXParseException exception) {
System.err.println("WARNING: line " + exception.getLineNumber() + ": "
+ exception.getMessage());
}
/** This method is called when errors occur */
public void error(SAXParseException exception) {
System.err.println("ERROR: line " + exception.getLineNumber() + ": "
+ exception.getMessage());
}
/** This method is called when non-recoverable errors occur. */
public void fatalError(SAXParseException exception) throws SAXException {
System.err.println("FATAL: line " + exception.getLineNumber() + ": "
+ exception.getMessage());
throw (exception);
}
public static void main(String[] args){
XMLSAXTools t = new XMLSAXTools();
try {
t.test();
} catch (Exception e){
logger.error("Exception in XMLSAXTools: " + e.getMessage());
e.printStackTrace();
}
}
}

You are filling up your accumulator without ever emptying it - this is unlikely to be what you want.
Just using SAX is not sufficient to ensure you do not run out of memory - you still need to implement the code that finds, selects and processes what you do need from the xml and discards the rest.
Here's a fairly simple parser that is designed to be run in a separate thread. It communicates with the calling thread via n ArrayBlockingQueue<String> queue which is defined in an enclosing class.
The huge data files I have to deal with are essentially <Batch> ... a few thousand items ... </Batch>. This parser pulls each item out and presents them one-at-a-time through the blocking queue. One day I will turn them into XOM Elements but atm it uses Strings.
Notice how it clears down its temporary data fields when enque is called to ensure we don't run out of memory:
private class Parser extends DefaultHandler {
// Track the depth of the xml - whenever we hit level 1 we add the accumulated xml to the queue.
private int level = 0;
// The current xml fragment.
private final StringBuilder xml = new StringBuilder();
// We've had a start tag but no data yet.
private boolean tagWithNoData = false;
/*
* Called when the starting of the Element is reached. For Example if we have Tag
* called <Title> ... </Title>, then this method is called when <Title> tag is
* Encountered while parsing the Current XML File. The AttributeList Parameter has
* the list of all Attributes declared for the Current Element in the XML File.
*/
#Override
public void startElement(final String uri, final String localName, final String name, final Attributes atrbts) throws SAXException {
checkForAbort();
// Have we got back to level 1 yet?
if (level == 1) {
// Emit any built ones.
try {
enqueue();
} catch (InterruptedException ex) {
Throwables.rethrow(ex);
}
}
// Add it on.
if (level > 0) {
// The name.
xml.append("<").append(name);
// The attributes.
for (int i = 0; i < atrbts.getLength(); i++) {
final String att = atrbts.getValue(i);
xml.append(" ").append(atrbts.getQName(i)).append("=\"").append(XML.to(att)).append("\"");
}
// Done.
xml.append(">");
// Remember we've not had any data yet.
tagWithNoData = true;
}
// Next element is a sub-element.
level += 1;
}
/*
* Called when the Ending of the current Element is reached. For example in the
* above explanation, this method is called when </Title> tag is reached
*/
#Override
public void endElement(final String uri, final String localName, final String name) throws SAXException {
checkForAbort();
if (level > 1) {
if (tagWithNoData) {
// No data. Make the > into a />
xml.insert(xml.length() - 1, "/");
// I've closed this one but the enclosing one has data (i.e. this one).
tagWithNoData = false;
} else {
// Had data, finish properly.
xml.append("</").append(name).append(">");
}
}
// Done with that level.
level -= 1;
if (level == 1) {
// Finished and at level 1.
try {
// Enqueue the results.
enqueue();
} catch (InterruptedException ex) {
Throwables.rethrow(ex);
}
}
}
/*
* Called when the data part is encountered.
*/
#Override
public void characters(final char buf[], final int offset, final int len) throws SAXException {
checkForAbort();
// I want it trimmed.
final String chs = new String(buf, offset, len).trim();
if (chs.length() > 0) {
// Grab that data.
xml.append(XML.to(chs));
tagWithNoData = false;
}
}
/*
* Called when the Parser starts parsing the Current XML File.
*/
#Override
public void startDocument() throws SAXException {
checkForAbort();
tagWithNoData = false;
}
/*
* Called when the Parser Completes parsing the Current XML File.
*/
#Override
public void endDocument() throws SAXException {
checkForAbort();
try {
// Enqueue the results.
enqueue();
} catch (InterruptedException ex) {
Throwables.rethrow(ex);
}
}
private void enqueue() throws InterruptedException, SAXException {
// We may have been closed while blocking on the queue.
checkForAbort();
final String x = xml.toString().trim();
if (x.length() > 0) {
// Add it to the queue.
queue.put(x);
// Clear out.
xml.setLength(0);
tagWithNoData = false;
}
// We may have been closed while blocking on the queue.
checkForAbort();
}
private void checkForAbort() throws XMLInnerDocumentIteratorAbortedException {
if (iteratorFinished) {
LOGGER.debug("Aborting!!!");
throw new XMLInnerDocumentIterator.XMLInnerDocumentIteratorAbortedException("Aborted!");
}
}
}
}

Related

Java thread pooled complex constraints data processing

I have Java EE (desktop) application that had to process data files generated by multiple sources (up to a 200 different sources). Each source periodically generates data file with unique name which also contains that source's unique ID.
I need to create a thread pool with 15 threads which will process and remove files with these constraints:
Multiple threads can't process files from the same source simultaneously.
Multiple files from the same source should be processed in order of it's creation timestamp.
No synchronization with the file generator sources is possible so it means that the next file(s) may be generated by source while it's previous file is processed or scheduled for processing.
Processing should be multi threaded because of performance reasons (single threaded processing is not enough so I'm planning to use 10-15 threads).
A file processing operation may be time consuming 3-15 seconds.
Any suggestions on architecture of such complex synchronization of the threads in pool are welcome.
P.S. Due to the limitation on simultaneous processing the design I've used in more simple situations earlier i.e. using ArrayBlockingQueue does not fit this case.

General idea:
You have a task-queue per source.
And you have a central queue which is effectively a queue of task-queues which is shared between all worker threads.
For each source you create a task-queue. And you stick these task-queue's in a hashtable based on the unique id. This way you get the guarantee that tasks from the same source are processed in order (requirement 2).
If a tasks is received, you look up (or create) the task-queue in the hashtable and you add the task to the taskqueue. If it was the first task added to the queue, you also add it to the central queue.
Then there are a bunch of worker-threads that take task-queues from this central queue and then take a single task from this task-queue they just took and process that task. Once they are done with the task, they need to decide if the task-queue needs to be reinserted back into the central-queue or not.
There are a few parts were things could easily go wrong:
You don't want to end up with a task-queue being inserted into the central-queue multiple times. That would violate your first requirement.
You don't want the task-queue not being reinserted into the central-queue even though a task is available.
So you need to take care of the appropriate synchronization and it might be a bit more complex than you would initially think. But seen the fact that the tasks are long running, I would start out with a regular mutex (could be per task-queue) e.g. synchronized or a lock and don't worry about making it non blocking.

So this is the skeleton/tester of the class I've created to solve my problem. See #pveentjer answer for some details on whats going on here.
package org.zur.test;
import java.io.File;
import java.util.Comparator;
import java.util.HashMap;
import java.util.PriorityQueue;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.Semaphore;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
public class PmFileProcessor {
private static final int THREADS_COUNT = 15;
ArrayBlockingQueue<File> scheduledFiles = new ArrayBlockingQueue<>(10000, false);
HashMap<String, PmFileProcessingJob> allJobs = new HashMap<>();
ScheduledExecutorService jobsExecutorService = Executors.newScheduledThreadPool(THREADS_COUNT);
public PmFileProcessor() {
super();
SourceJobsManager fc = new SourceJobsManager();
fc.setDaemon(true);
fc.start();
}
public void scheduleFile(File f) {
try {
scheduledFiles.add(f);
} catch (Exception e) {
// TODO: handle exception
}
}
/**
* Assigns files to file source processing job.
*
* #author
* <ul>
* <li>Zur13</li>
* </ul>
*
*/
public class SourceJobsManager extends Thread {
#Override
public void run() {
// assigns scheduled files to per-source jobs and schedules job for additional execution
while ( true ) {
try {
File f = scheduledFiles.take();
PmFileProcessingJob job = getSourceJob(f);
job.scheduleSourceFile(f);
jobsExecutorService.execute(job); // schedules job execution
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// TODO: check disk space periodically
}
}
/**
* Finds existing job for file source or creates a new one.
*
* #param f
* #return
*/
private PmFileProcessingJob getSourceJob(File f) {
// TODO: test code
String fname = f.getName();
String[] parts = fname.split("_");
String uid = parts[0];
PmFileProcessingJob res = allJobs.get(uid);
if ( res == null ) {
res = new PmFileProcessingJob(uid);
allJobs.put(uid, res);
}
return res;
}
}
/**
* Process first file from scheduledSourceFiles queue (i.e. each job execution processes a single file or
* reschedules itself for later execution if another thread already processes the file from the same source).
*
* #author
* <ul>
* <li>Zur13</li>
* </ul>
*
*/
public class PmFileProcessingJob implements Runnable {
public final String fileSourceUidString;
PriorityQueue<File> scheduledSourceFiles = new PriorityQueue<>(1000, new Comparator<File>() {
#Override
public int compare(File o1, File o2) {
// TODO Auto-generated method stub
return 0;
}
});
Semaphore onePassSemaphore = new Semaphore(1);
public PmFileProcessingJob(String fileSourceUid) {
super();
this.fileSourceUidString = fileSourceUid;
}
/**
* Schedules file from for processing by this job.
*/
public void scheduleSourceFile(File f) {
scheduledSourceFiles.add(f);
}
#Override
public void run() {
File f = null;
if ( scheduledSourceFiles.size() > 0 ) { // fail fast optimization 1
if ( onePassSemaphore.tryAcquire() ) { // fail fast optimization 2
try {
f = scheduledSourceFiles.poll();
if ( f != null ) {
// TODO: process the file
try {
System.err.println(f.getName() + "\t" + Thread.currentThread().getId());
Thread.sleep(1000);
return;
} catch (Exception e) {
// TODO: handle exception
return; // prevents reschedule loop for failing files
}
} else {
// scheduledSourceFiles queue is empty
return;
}
} finally {
onePassSemaphore.release();
}
}
if ( f == null && scheduledSourceFiles.size() > 0 ) {
// this thread did not process the scheduled file because another thread holds the critical section
// pass
// this thread should reschedule this Job to release this thread and try to process this job later
// with another thread
// reschedule the job with 4 seconds delay to prevent excess CPU usage
// System.err.println("RESCHEDULE");
jobsExecutorService.schedule(this, 3, TimeUnit.SECONDS);
}
}
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((this.fileSourceUidString == null) ? 0 : this.fileSourceUidString.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if ( this == obj )
return true;
if ( !(obj instanceof PmFileProcessingJob) )
return false;
PmFileProcessingJob other = (PmFileProcessingJob) obj;
if ( this.fileSourceUidString == null ) {
if ( other.fileSourceUidString != null )
return false;
} else if ( !this.fileSourceUidString.equals(other.fileSourceUidString) )
return false;
return true;
}
}
public static void main(String[] args) {
PmFileProcessor fp = new PmFileProcessor();
fp.unitTest();
}
private void unitTest() {
// TODO Auto-generated method stub
int filesCount = 1000;
for (int i = 0; i < filesCount; i++) {
int sourceUid = ThreadLocalRandom.current().nextInt(1, 30);
File f = new File(sourceUid + "_" + i);
scheduleFile(f);
}
Thread.yield();
try {
Thread.sleep(999000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

How do I determine the line number where invalid XML occurs?

I need to support the situation where a user submits an invalid XML file to me and I report back to them information about the error. Ideally the location of the error (line number and column number) and the nature of the error.
My sample code (see below) works well enough when there is a missing tag or similar error. In that case, I get an approximate location and a useful explanation. However my code fails spectacularly when the XML file contains non-UTF-8 characters. In this case, I get a useless error:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I cannot find a way to determine the line number where the invalid character might be, nor the character itself. Is there a way to do this?
If, as one comment suggests, it may not be possible as we don't get to the parsing step, is there a way to process the XML file, not with a parser, but simply line-by-line, looking for and reporting non-UTF-8 characters?
Sample code follows. First a basic error handler:
public class XmlErrorHandler implements ErrorHandler {
#Override
public void warning(SAXParseException e) throws SAXException {
show("Warning", e); throw e;
}
#Override
public void error(SAXParseException e) throws SAXException {
show("Error", e); throw e;
}
#Override
public void fatalError(SAXParseException e) throws SAXException {
show("Fatal", e); throw e;
}
private void show(String type, SAXParseException e) {
System.out.println("Line " + e.getLineNumber() + " Column " + e.getColumnNumber());
System.out.println(type + ": " + e.getMessage());
}
}
And a trivial test program:
public class XmlTest {
public static void main(String[] args) {
try {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser parser = spf.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setContentHandler(new DefaultHandler());
reader.setErrorHandler(new XmlErrorHandler());
InputSource is = new InputSource(args[0]);
reader.parse(is);
}
catch (SAXException e) { // Useful error case
System.err.println(e);
e.printStackTrace(System.err);
}
catch (Exception e) { // Useless error case arrives here
System.err.println(e);
e.printStackTrace();
}
}
}
Sample XML File (with non-UTF-8 smart quotes from (say) a Word document):
<?xml version="1.0" encoding="UTF-8"?>
<example>
<![CDATA[Text with <91>smart quotes<92>.]]>
</example>

I had some success with identifying where the issue in the XML file is using a couple of approaches.
Adapting the code from my question to use a home-grown ContentHandler with a Locator (see below) demonstrated that the XML was being processed up until the invalid character is encountered. In particular, the line number is being tracked. Preserving the line number allowed it to be retrieved from the ContentHandler when the problematic exception occurs.
At this point, I came up with two possibilities. The first is to re-run the processing with a different encoding on the InputStream, eg. Windows-1252. Parsing completed without error in this instance and I was able to retrieve the characters on the line with the known issue. This allows for a reasonably useful error message to the user, ie. line number and the characters.
My second approach was to adapt the code from the top-rated answer to this SO question. This code allows you to find the first non-UTF-8 character in a byte stream. If you assume that 0x0A (linefeed) represents a new line in the XML (and this seems to work pretty well in practice), then the line number, column number and the invalid characters can be extracted easily enough for a precise error message.
// Modified test program
public class XmlTest {
public static void main(String[] args) {
ErrorFinder errorFinder = new ErrorFinder(0); // Create our own content handler
try {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser parser = spf.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setContentHandler(errorFinder); // Use instead of the default handler
reader.setErrorHandler(new XmlErrorHandler());
InputSource is = new InputSource(args[0]);
reader.parse(is);
}
catch (SAXException e) { // Useful error case
System.err.println(e);
e.printStackTrace(System.err);
}
catch (Exception e) { // Useless error case arrives here
System.err.println(e);
e.printStackTrace();
// Option 1: repeat parsing (see above) with a new ErrorFinder initialised thus:
ErrorFinder ef2 = new ErrorFinder(errorFinder.getCurrentLineNumber()); // and
is.setEncoding("Windows-1252");
}
}
}
// Content handler with irrelevant method implementations elided.
public class ErrorFinder implements ContentHandler {
private int lineNumber; // If non-zero, the line number to retrieve characters for.
private int currentLineNumber;
private char[] chars;
private Locator locator;
public ErrorFinder(int lineNumber) {
super();
this.lineNumber = lineNumber;
}
public void setDocumentLocator(Locator locator) {
this.locator = locator;
}
#Override
public void startDocument() throws SAXException {
currentLineNumber = locator.getLineNumber();
}
... // Skip other over-ridden methods as they have same code as startDocument().
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
currentLineNumber = locator.getLineNumber();
if (currentLineNumber == lineNumber) {
char[] c = new char[length];
System.arraycopy(ch, start, c, 0, length);
chars = c;
}
}
public int getCurrentLineNumber() {
return currentLineNumber;
}
public char[] getChars() {
return chars;
}
}

Force stop Java Files.copy() running on external thread

The answer here seemed to be a valid solution before Java 8:
How to cancel Files.copy() in Java?
But now it doesn't work, because ExtendedCopyOption.INTERRUPTIBLE is private.
Basically, I need to download a file from some given URL and save it to my local file-system using Files.copy().
Currently, I am using a JavaFX Service because I need to show the progress in a ProgressBar.
However, I don't know how to block the thread running Files.copy() if the operation takes too long.
Using Thread.stop() is at least not wanted. Even Thread.interrupt() fails.
I also want the operation to terminate gracefully if the internet connection becomes unavailable.
To test the case when no internet connection is available, I'm removing my ethernet cable and putting it back after 3 seconds.
Unfortunately, Files.copy() returns only when I put back the ethernet cable, while I would like it to fail immediately.
As I can see, internally Files.copy() is running a loop, which prevents the thread from exiting.
Tester(Downloading OBS Studio exe):
/**
* #author GOXR3PLUS
*
*/
public class TestDownloader extends Application {
/**
* #param args
*/
public static void main(String[] args) {
launch(args);
}
#Override
public void start(Stage primaryStage) throws Exception {
// Block From exiting
Platform.setImplicitExit(false);
// Try to download the File from URL
new DownloadService().startDownload(
"https://github.com/jp9000/obs-studio/releases/download/17.0.2/OBS-Studio-17.0.2-Small-Installer.exe",
System.getProperty("user.home") + File.separator + "Desktop" + File.separator + "OBS-Studio-17.0.2-Small-Installer.exe");
}
}
DownloadService:
Using #sillyfly comment with FileChannel and removing File.copy seems to work only with calling Thread.interrupt() but it is not exiting when the internet is not available..
import java.io.File;
import java.net.URL;
import java.net.URLConnection;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.nio.file.StandardOpenOption;
import java.util.logging.Level;
import java.util.logging.Logger;
import javafx.concurrent.Service;
import javafx.concurrent.Task;
/**
* JavaFX Service which is Capable of Downloading Files from the Internet to the
* LocalHost
*
* #author GOXR3PLUS
*
*/
public class DownloadService extends Service<Boolean> {
// -----
private long totalBytes;
private boolean succeeded = false;
private volatile boolean stopThread;
// CopyThread
private Thread copyThread = null;
// ----
private String urlString;
private String destination;
/**
* The logger of the class
*/
private static final Logger LOGGER = Logger.getLogger(DownloadService.class.getName());
/**
* Constructor
*/
public DownloadService() {
setOnFailed(f -> System.out.println("Failed with value: " + super.getValue()+" , Copy Thread is Alive? "+copyThread.isAlive()));
setOnSucceeded(s -> System.out.println("Succeeded with value: " + super.getValue()+" , Copy Thread is Alive? "+copyThread.isAlive()));
setOnCancelled(c -> System.out.println("Succeeded with value: " + super.getValue()+" , Copy Thread is Alive? "+copyThread.isAlive()));
}
/**
* Start the Download Service
*
* #param urlString
* The source File URL
* #param destination
* The destination File
*/
public void startDownload(String urlString, String destination) {
if (!super.isRunning()) {
this.urlString = urlString;
this.destination = destination;
totalBytes = 0;
restart();
}
}
#Override
protected Task<Boolean> createTask() {
return new Task<Boolean>() {
#Override
protected Boolean call() throws Exception {
// Succeeded boolean
succeeded = true;
// URL and LocalFile
URL urlFile = new URL(java.net.URLDecoder.decode(urlString, "UTF-8"));
File destinationFile = new File(destination);
try {
// Open the connection and get totalBytes
URLConnection connection = urlFile.openConnection();
totalBytes = Long.parseLong(connection.getHeaderField("Content-Length"));
// --------------------- Copy the File to External Thread-----------
copyThread = new Thread(() -> {
// Start File Copy
try (FileChannel zip = FileChannel.open(destinationFile.toPath(), StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING, StandardOpenOption.WRITE)) {
zip.transferFrom(Channels.newChannel(connection.getInputStream()), 0, Long.MAX_VALUE);
// Files.copy(dl.openStream(), fl.toPath(),StandardCopyOption.REPLACE_EXISTING)
} catch (Exception ex) {
stopThread = true;
LOGGER.log(Level.WARNING, "DownloadService failed", ex);
}
System.out.println("Copy Thread exited...");
});
// Set to Daemon
copyThread.setDaemon(true);
// Start the Thread
copyThread.start();
// -------------------- End of Copy the File to External Thread-------
// ---------------------------Check the %100 Progress--------------------
long outPutFileLength;
long previousLength = 0;
int failCounter = 0;
// While Loop
while ((outPutFileLength = destinationFile.length()) < totalBytes && !stopThread) {
// Check the previous length
if (previousLength != outPutFileLength) {
previousLength = outPutFileLength;
failCounter = 0;
} else
++failCounter;
// 2 Seconds passed without response
if (failCounter == 40 || stopThread)
break;
// Update Progress
super.updateProgress((outPutFileLength * 100) / totalBytes, 100);
System.out.println("Current Bytes:" + outPutFileLength + " ,|, TotalBytes:" + totalBytes
+ " ,|, Current Progress: " + (outPutFileLength * 100) / totalBytes + " %");
// Sleep
try {
Thread.sleep(50);
} catch (InterruptedException ex) {
LOGGER.log(Level.WARNING, "", ex);
}
}
// 2 Seconds passed without response
if (failCounter == 40)
succeeded = false;
// --------------------------End of Check the %100 Progress--------------------
} catch (Exception ex) {
succeeded = false;
// Stop the External Thread which is updating the %100
// progress
stopThread = true;
LOGGER.log(Level.WARNING, "DownloadService failed", ex);
}
//----------------------Finally------------------------------
System.out.println("Trying to interrupt[shoot with an assault rifle] the copy Thread");
// ---FORCE STOP COPY FILES
if (copyThread != null && copyThread.isAlive()) {
copyThread.interrupt();
System.out.println("Done an interrupt to the copy Thread");
// Run a Looping checking if the copyThread has stopped...
while (copyThread.isAlive()) {
System.out.println("Copy Thread is still Alive,refusing to die.");
Thread.sleep(50);
}
}
System.out.println("Download Service exited:[Value=" + succeeded + "] Copy Thread is Alive? "
+ (copyThread == null ? "" : copyThread.isAlive()));
//---------------------- End of Finally------------------------------
return succeeded;
}
};
}
}
Interesting questions:
1-> What does java.lang.Thread.interrupt() do?

I strongly encourage you to use a FileChannel.
It has the transferFrom() method which returns immediately when the thread running it is interrupted.
(The Javadoc here says that it should raise a ClosedByInterruptException, but it doesn't.)
try (FileChannel channel = FileChannel.open(Paths.get(...), StandardOpenOption.CREATE,
StandardOpenOption.WRITE)) {
channel.transferFrom(Channels.newChannel(new URL(...).openStream()), 0, Long.MAX_VALUE);
}
It also has the potential to perform much better than its java.io alternative.
(However, it turns out that the implementation of Files.copy() may elect to delegate to this method instead of actually performing the copy by itself.)
Here's an example of a reusable JavaFX Service that lets you fetch a resource from the internet and save it to your local file-system, with automatic graceful termination if the operation takes too long.
The service task (spawned by createTask()) is the user of the file-channel API.
A separate ScheduledExecutorService is used to handle the time constraint.
Always stick to the good practices for extending Service.
If you choose to use such an high-level method, you won't be able to track down the progress of the task.
If the connection becomes unavailable, transferFrom() should eventually return without throwing an exception.
To start the service (may be done from any thread):
DownloadService downloadService = new DownloadService();
downloadService.setRemoteResourceLocation(new URL("http://speedtest.ftp.otenet.gr/files/test1Gb.db"));
downloadService.setPathToLocalResource(Paths.get("C:", "test1Gb.db"));
downloadService.start();
and then to cancel it (otherwise it will be automatically cancelled after the time expires):
downloadService.cancel();
Note that the same service can be reused, just be sure to reset it before starting again:
downloadService.reset();
Here is the DownloadService class:
public class DownloadService extends Service<Void> {
private static final long TIME_BUDGET = 2; // In seconds
private final ScheduledExecutorService watchdogService =
Executors.newSingleThreadScheduledExecutor(new ThreadFactory() {
private final ThreadFactory delegate = Executors.defaultThreadFactory();
#Override
public Thread newThread(Runnable r) {
Thread thread = delegate.newThread(r);
thread.setDaemon(true);
return thread;
}
});
private Future<?> watchdogThread;
private final ObjectProperty<URL> remoteResourceLocation = new SimpleObjectProperty<>();
private final ObjectProperty<Path> pathToLocalResource = new SimpleObjectProperty<>();
public final URL getRemoteResourceLocation() {
return remoteResourceLocation.get();
}
public final void setRemoteResourceLocation(URL remoteResourceLocation) {
this.remoteResourceLocation.set(remoteResourceLocation);
}
public ObjectProperty<URL> remoteResourceLocationProperty() {
return remoteResourceLocation;
}
public final Path getPathToLocalResource() {
return pathToLocalResource.get();
}
public final void setPathToLocalResource(Path pathToLocalResource) {
this.pathToLocalResource.set(pathToLocalResource);
}
public ObjectProperty<Path> pathToLocalResourceProperty() {
return pathToLocalResource;
}
#Override
protected Task<Void> createTask() {
final Path pathToLocalResource = getPathToLocalResource();
final URL remoteResourceLocation = getRemoteResourceLocation();
if (pathToLocalResource == null) {
throw new IllegalStateException("pathToLocalResource property value is null");
}
if (remoteResourceLocation == null) {
throw new IllegalStateException("remoteResourceLocation property value is null");
}
return new Task<Void>() {
#Override
protected Void call() throws IOException {
try (FileChannel channel = FileChannel.open(pathToLocalResource, StandardOpenOption.CREATE,
StandardOpenOption.WRITE)) {
channel.transferFrom(Channels.newChannel(remoteResourceLocation.openStream()), 0, Long.MAX_VALUE);
}
return null;
}
};
}
#Override
protected void running() {
watchdogThread = watchdogService.schedule(() -> {
Platform.runLater(() -> cancel());
}, TIME_BUDGET, TimeUnit.SECONDS);
}
#Override
protected void succeeded() {
watchdogThread.cancel(false);
}
#Override
protected void cancelled() {
watchdogThread.cancel(false);
}
#Override
protected void failed() {
watchdogThread.cancel(false);
}
}

There is one important aspect not covered by the other answers/comments; and that is a wrong assumption of yours:
What I want is it to fail immediately when no internet connection is there.
It is not that easy. The TCP stack/state machine is actually a pretty complicated thing; and depending on your context (OS type; TCP stack implementation, kernel parameters, ...), there can be situations where a network partition takes place and a sender doesn't notice for 15 or more minutes. Listen here for more details on that.
In other words: "just pulling the plug" is no way equal to "immediately breaking" your existing TCP connection. And just for the record: you don't need to plug cables manually to simulate network outages. In a reasonable test setup, tools like iptables aka firewalls can do that for you.

You seem to need an Asynchronous/Cancellable HTTP GET which can be tough.
The problem is that if read stalls waiting for more data (cable is pulled) it won't quit until either the socket dies or new data comes in.
There are a few path you could follow, tinkering with socket factories to set a good timeout, using http client with timeouts and others.
I would have a look at Apache Http Components which has non blocking HTTP based on java NIO Sockets.

Java logging to File: Alternative persistence scheme

New to Java. I would like to use the logger, but with a different file persistence scheme. Instead of rotating files and overriding, I would like the logs to be created in a time based file system hierarchy, where log files contain logs of the past minute: Example: if a log is generated on 2015-03-08 13:05, it will be placed in log_05.txt under /home/myUser/logs/2015/03/08/13
in other words, the file full path would be /home/myUser/logs/2015/03/08/13/log_05.txt.
Any suggestions?

I ended up implementing a library. Tested on Linux & Windows. It provides the desired file persistence scheme, and allows asynchronous logging. Would appreciate comments.
package com.signin.ems;
/**
* The EMSLogger JAR wraps the Java Logger for two purposes:
* 1. Implement a custom file persistence scheme (other than a single file, or a rotating scheme).
* In particular, the scheme implemented is one minute files, placed in hourly directories.
* The file name format is <mm>.log (mm=00..59), and the directory name format is YYYYMMDD24HH.
*
* 2. Logging should be done asynchronously. For this, a dedicated thread is created. When a message is logged,
* the LogRecord is placed in a BlockingQueue instead of writing the LogRecord to file. The dedicated thread
* performs a blocking wait on the queue. Upon retrieving a LogRecord object, it writes the LogRecord to the
* proper file
*
*
*/
public class EMSLogger
{
private static final int m_iQueSize = 100000;
private static BlockingQueue<LogRecord> m_LogRecordQueue;
private static EMSLoggerThread m_EMSLoggerThread;
private static Thread m_thread;
private static final Logger m_instance = createInstance();
protected EMSLogger()
{
}
public static Logger getInstance() {
return m_instance;
}
private static Logger createInstance()
{
MyFileHandler fileHandler = null;
Logger LOGGER = null;
try
{
// initialize the Log queue
m_LogRecordQueue = new ArrayBlockingQueue<LogRecord>(m_iQueSize);
// get top level logger
LOGGER = Logger.getLogger("");
LOGGER.setLevel(Level.ALL);
// create our file handler
fileHandler = new MyFileHandler(m_LogRecordQueue);
fileHandler.setLevel(Level.ALL);
LOGGER.addHandler(fileHandler);
// create the logging thread
m_EMSLoggerThread = new EMSLoggerThread(m_LogRecordQueue, fileHandler);
m_thread = new Thread(m_EMSLoggerThread);
m_thread.start();
}
catch (IOException e)
{
e.printStackTrace();
}
return LOGGER;
}
public static void Terminate ()
{
m_thread.interrupt();
}
}
public class MyFileHandler extends FileHandler
{
private final BlockingQueue<LogRecord> m_queue;
private BufferedOutputStream m_BufferedOutputStream;
private String m_RootFolderName;
private String m_CurrentDirectoryName;
private String m_CurrentFileName;
private SimpleDateFormat m_SDfh;
private SimpleDateFormat m_SDfm;
public MyFileHandler (BlockingQueue<LogRecord> q) throws IOException, SecurityException
{
super ();
// use simple formatter. Do not use the default XML
super.setFormatter (new SimpleFormatter ());
// get root folder from which to create the log directory hierarchy
m_RootFolderName = System.getProperty ("user.home") + "/logs";
// Service can optionally set its name. All hourly directories will
// be created below the provided name. If no name is given, "Default"
// is used
String sName = System.getProperty ("EMS.ServiceName");
if (sName != null)
{
System.out.println ("EMS.ServiceName = " + sName);
}
else
{
sName = "Default";
System.out.println ("Using \"" + sName + "\" as service name");
}
m_RootFolderName += "/" + sName;
// make sure the root folder is created
new File (m_RootFolderName).mkdirs ();
// initialize format objects
m_SDfh = new SimpleDateFormat ("yyyyMMddHH");
m_SDfm = new SimpleDateFormat ("mm");
m_CurrentDirectoryName = "";
m_CurrentFileName = "";
m_BufferedOutputStream = null;
m_queue = q;
}
// post the record the the queue. Actual writing to the log is done in a dedicated thread
// note that placing in the queue is done without blocking while waiting for available space
#Override
public void publish (LogRecord record)
{
m_queue.offer (record);
}
// check if a new file needs to be created
private void SetCurrentFile ()
{
boolean bChangeFile = false;
Date d = new Date (System.currentTimeMillis());
String newDirectory = m_RootFolderName + "/" + m_SDfh.format(d);
String newFile = m_SDfm.format(d);
if (!newDirectory.equals(m_CurrentDirectoryName))
{
// need to create a new directory and a new file
m_CurrentDirectoryName = newDirectory;
new File(m_CurrentDirectoryName).mkdirs();
bChangeFile = true;
}
if (!newFile.equals(m_CurrentFileName))
{
// need to create a new file
m_CurrentFileName = newFile;
bChangeFile = true;
}
if (bChangeFile)
{
try
{
if (m_BufferedOutputStream != null)
{
m_BufferedOutputStream.close ();
}
System.out.println("Creating File: " + m_CurrentDirectoryName + "/" + m_CurrentFileName + ".log");
m_BufferedOutputStream = new BufferedOutputStream
(new FileOutputStream (m_CurrentDirectoryName + "/" + m_CurrentFileName + ".log", true),2048);
this.setOutputStream(m_BufferedOutputStream);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
// method _published is called from the dedicated thread
public void _publish(LogRecord record)
{
// check if a new file needs to be created
SetCurrentFile ();
super.publish(record);
}
}
class EMSLoggerThread implements Runnable
{
private final BlockingQueue<LogRecord> m_queue;
private final MyFileHandler m_MyFileHandler;
// Constructor
EMSLoggerThread(BlockingQueue<LogRecord> q, MyFileHandler fh)
{
m_queue = q;
m_MyFileHandler = fh;
}
public void run()
{
try
{
while (true)
{
m_MyFileHandler._publish(m_queue.take());
}
}
catch (InterruptedException ex)
{
}
}
}

This is covered in How to create the log file for each record in a specific format using java util logging framework. You have to modify those examples to create directories since the FileHandler will not create directories. If you are going to create an asynchronous handler, you should follow the advice in Using java.util.logger with a separate thread to write on file.

How to Parse Big (50 GB) XML Files in Java

Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.
Is there any way to speed this up? A better method?
Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.
Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000
Main:
public class Read {
public static void main(String[] args) {
pages = XMLManager.getPages();
}
}
XMLManager
public class XMLManager {
public static ArrayList<Page> getPages() {
ArrayList<Page> pages = null;
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("..\\enwiki-20140811-pages-articles.xml");
PageHandler pageHandler = new PageHandler();
parser.parse(file, pageHandler);
pages = pageHandler.getPages();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pages;
}
}
PageHandler
public class PageHandler extends DefaultHandler{
private ArrayList<Page> pages = new ArrayList<>();
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(){
super();
}
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
stringBuilder = new StringBuilder();
if (qName.equals("page")){
page = new Page();
idSet = false;
} else if (qName.equals("redirect")){
if (page != null){
page.setRedirecting(true);
}
}
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (page != null && !page.isRedirecting()){
if (qName.equals("title")){
page.setTitle(stringBuilder.toString());
} else if (qName.equals("id")){
if (!idSet){
page.setId(Integer.parseInt(stringBuilder.toString()));
idSet = true;
}
} else if (qName.equals("text")){
String articleText = stringBuilder.toString();
articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}", " "); //remove links underneath headings
articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
articleText = articleText.replaceAll("\\|", " "); //Separate multiple links
articleText = articleText.replaceAll("\\n", " "); //remove new lines
articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]", " "); //remove all non alphanumeric except dashes and spaces
articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space
Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
Matcher matcher = pattern.matcher(articleText);
matcher.find();
try {
page.setSummaryText(matcher.group());
} catch (IllegalStateException se){
page.setSummaryText("None");
}
page.setText(articleText);
} else if (qName.equals("page")){
pages.add(page);
page = null;
}
} else {
page = null;
}
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
stringBuilder.append(ch,start, length);
}
public ArrayList<Page> getPages() {
return pages;
}
}

Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.
You need some sort of pipeline to pass the data on to its actual destination without ever
store it all in memory at once.
What I've sometimes done for this sort of situation is similar to the following.
Create an interface for processing a single element:
public interface PageProcessor {
void process(Page page);
}
Supply an implementation of this to the PageHandler through a constructor:
public class Read {
public static void main(String[] args) {
XMLManager.load(new PageProcessor() {
#Override
public void process(Page page) {
// Obviously you want to do something other than just printing,
// but I don't know what that is...
System.out.println(page);
}
}) ;
}
}
public class XMLManager {
public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);
parser.parse(file, pageHandler);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Send data to this processor instead of putting it in the list:
public class PageHandler extends DefaultHandler {
private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(PageProcessor processor) {
this.processor = processor;
}
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//Unchanged from your implementation
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
//Unchanged from your implementation
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change
} else if (qName.equals("page")){
processor.process(page);
page = null;
}
} else {
page = null;
}
}
}
Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler collect pages locally in a smaller list and periodically send the list off for processing and clear the list.
Or (perhaps better) you could implement the PageProcessor interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.

Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType has its Java POJO equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.
XML2J uses mapping of complexTypes to Java POJOs on the one hand, but lets you specify events you want to listen on.
E.g.
account/#process = true
account/accounts/#process = true
account/accounts/#detach = true
The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.
class AccountType {
private List<AccountType> accounts = new ArrayList<>();
public void addAccount(AccountType tAccount) {
accounts.add(tAccount);
}
// etc.
};
In your code you need to implement the process method (by default the code generator generates an empty method:
class AccountsProcessor implements MessageProcessor {
static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);
// assuming Spring data persistency here
final String path = new ClassPathResource("spring-config.xml").getPath();
ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext(path);
AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);
#Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
if (evt == XMLEvent.END) {
if( data instanceof AccountType) {
process((AccountType)data);
}
}
}
private void process(AccountType data) {
if (logger.isInfoEnabled()) {
// do some logging
}
repo.save(data);
}
}
Note that XMLEvent.END marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END you would then update the parent.
Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.
There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.
The default process method is like this:
#Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
/*
* TODO Auto-generated method stub implement your own handling here.
* Use the runtime configuration file to determine which events are to be sent to the processor.
*/
if (evt == XMLEvent.END) {
data.print( ConsoleWriter.out );
}
}
Downloads:
https://github.com/lolkedijkstra/xml2j-core
https://github.com/lolkedijkstra/xml2j-gen
https://sourceforge.net/projects/xml2j/
First mvn clean install the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME as per directions in the usermanual.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.