I'm getting the error in the title occasionally from a process the parses lots of XML files.
The files themselves seem OK, and running the process again on the same files that generated the error works just fine.
The exception occurs on a call to XMLReader.parse(InputStream is)
Could this be a bug in the parser (I use piccolo)? Or is it something about how I open the file stream?
No multithreading is involved.
Piccolo seemed like a good idea at the time, but I don't really have a good excuse for using it. I will to try to switch to the default SAX parser and see if that helps.
Update: It didn't help, and I found that Piccolo is considerably faster for some of the workloads, so I went back.
I should probably tell the end of this story: it was a stupid mistake. There were two processes: one that produces XML files and another that reads them. The reader just scans the directory and tries to process every new file it sees.
Every once in a while, the reader would detect a file before the producer was done writing, and so it legitimately raised an exception for "Unexpected end of file". Since we're talking about small files here, this event was pretty rare. By the time I came to check, the producer would already finish writing the file, so to me it seemed like the parser was complaining for nothing.
I wrote "No multithreading was involved". Obviously this was very misleading.
One solution would be to write the file elsewhere and move it to the monitored folder only after it was done. A better solution would be to use a proper message queue.
I'm experiencing something similar with Picolo under XMLBeans. After a quick Google, I came across the following post:
XMLBEANS-226 - Exception "Unexpected end of file after null"
The post states that use of the Apache Commons (v1.4 onwards) class org.apache.commons.io.input.AutoCloseInputStream may resolve this exception (not tried it myself, apologies).
Is this a multithreaded scenario? I.e. do you parse more than one at the same time.
Any particular reason you do not use the default XML parser in the JRE?
Related
I have created a Java process which writes to a plain text file and another Java process which consumes this text file. The 'consumer' reads then deletes the text file. For the sake of simplicity, I do not use file locks (I know it may lead to concurrency problems).
The 'consumer' process runs every 30 minutes from crontab. The 'producer' process currently just redirects whatever it receives from the standard input to the text file. This is just for testing - in the future, the 'producer' process will write the text file by itself.
The 'producer' process opens a FileOutputStream once and keeps writing to the text file usign this output stream. The problem is when the 'consumer' deletes the file. Since I'm in an UNIX environment, this situation is handled 'gracefully': the 'producer' keeps working as if nothing happened, since the inode of the file is still valid, but the file can no longer be found in the file system. This thread provides a way to handle this situation using C. Since I'm using Java, which is portable and therefore hides all platform-specific features, I'm not able to use the solution presented there.
Is there a portable way in Java to detect when the file was deleted while the FileOutputStream was still open?
This isn't a robust way for your processes to communicate, and the best I can advise is to stop doing that.
As far as I know there isn't a reliable way for a C program to detect when a file being written is unlinked, let alone a Java program. (The accepted answer you've linked to can only poll the directory entry to see if it's still there; I don't consider this sufficiently robust).
As you've noticed, UNIX doesn't consider it abnormal for an open file to be unlinked (indeed, it's an established practice to create a named tempfile, grab a filehandle, then delete it from the directory so that other processes can't get at it, before reading and writing).
If you must use files, consider having your consumer poll a directory. Have a .../pending/ directory for files in the process of being written and .../inbox/ for files that are ready for processing.
Producer creates a new uniquefilename (e.g. a UUID) and writes a new file to pending/.
After closing the file, Producer moves the file to inbox/ -- as long as both dirs are on the same filesystem, this will just be a relink, so the file will never be incomplete in inbox/.
Consumer looks for files in inbox/, reads them and deletes when done.
You can enhance this with more directories if there are eventually multiple consumers, but there's no immediate need.
But polling files/directories is always a bit fragile. Consider a database or a message queue.
You can check the filename itself for existence:
if (!Files.exists(Paths.get("/path/to/file"))) {
// The consumer has deleted the file.
}
but in any case, shouldn't the consumer be waiting for the producer to finish writing the file before it reads & deletes it? If it did, you wouldn't have this problem.
To solve this the way you're intending to do, you might have to look at JNI, which lets you call c/c++ functions from within Java, but this might also require you to program a wrapper-library for stat/fstat first (in c/c++).
However - that will cause you major headache.
This might be a workaround which doesnt require much change to your code right now (i assume).
You can let the producer write to a new File each time its producing new Data. Depending on the amount, you might want to group the data, so that the directory wont be flooded with files. For example, one file per minute that contains all data that's been produced so far.
Also it might be a good idea to write the files to another directory first and then move them to your Consumers input-directory - i'm a bit paranoid here, because there could be some race-conditions causing you some dataloss... - moving the files after everything has been already written and then moving them will make sure, no data gets lost.
Hope this helps Good luck :)
In my Ruby on Rails app I'm having a routine that writes to a file (through a java application) and then reads the written file.
write_to_file(file.path, data)
read_file(file.path)
Most of the time this works just fine. But some times it looks like the file write had not happened but there were no errors either. And when I retry the routine with the same data it has worked each time.
I have begun to think if the file write happens asynchronously and the file is actually read before the data is written to the disk. Would this be possible?
write_to_file calls a java application through a socket connection that takes care of the writing. Java application returns a simple json string back to Rails.
This question is really "what does the Java code do?" and is not really a Ruby question. It's not even really a Java question, because the Java language allows (of course) any kind of implementation.
The Java code could certainly be returning before the file is available for reading. We have no idea. It could be posting a request to a queue, and then returning, for example.
The Java code is what you need to look at. If you don't want to bother with that, you could always do something like this:
sleep 0.01 until File.readable?(file.path)
This is a bit crude and there are more elegant ways to do this, but it would work.
You might be experiencing file buffering where small amounts of data aren't written to the file unless it's flushed. I'm not sure what interface you're using here, but the flush method is intended to deal with this exact situation.
I am using Ubuntu (in case it will make a difference) and I am trying use Camel to send files to processor from one folder. But the problem is that when I am saving this file in the folder (takes about 5-10 seconds) Camel picks it up straight away.
To simulate the process I am using gedit with txt file with ~500k rows so it will take some time to save.
I have tried adding options:
from("file:src/Data/new/?readLock=changed&readLockMinAge=3m")
I have tried using
.filter(header("CamelFileLastModified").isGreaterThan(new Date(System.currentTimeMillis()-120000))) to give 2 minute delay.
Nothing seems to influence its behaviour, it picks it up straight away, throws an exception because of some checks while processing file and moves it to the Error folder.
I know there is an issue with FTP file transfers which I will have to face later on, but I can not even get it working on local file system.
Any help will be appreciated!
SOLVED
from("file:src/Data/new/?readLock=changed&readLockMinAge=3m")
Parameters actually work as they should. I was using Jetty to run the project and I should have done whole project clean/install after any amendments.
I had to amend parameters a bit to:
from("file:src/Data/new/?readLock=changed&readLockTimeout=65000&readLockMinAge=1m")
because it was complaining that readLockTimeout should be more than readLockCheckInterval + readLockMinAge.
Have a look into the documentation:
Avoid reading files currently being written by another application
Beware the JDK File IO API is a bit limited in detecting whether
another application is currently writing/copying a file. And the
implementation can be different depending on OS platform as well. This
could lead to that Camel thinks the file is not locked by another
process and start consuming it. Therefore you have to do you own
investigation what suites your environment. To help with this Camel
provides different readLock options and doneFileName option that you
can use. See also the section Consuming files from folders where
others drop files directly.
So I think the doneFileName option will solve your problem.
I'm getting the following in my console. It's not coming from my code. I've placed several libraries such as WebLogic, sun, javax, in my step filters. Is there a way to tell which library generated this message? I don't really want to step through all of the library classes that I'm using to try to find this.
[Fatal Error] :1:80: The element type "body" must be terminated by the matching end-tag "</body>".
When this happens to me, I typically first reach for Google. These days most error messages have been posted by someone somewhere, although you have to ensure you strip your specifics (in this case the "body" sounds pretty specific) out of it.
If Google doesn't help or just takes too long, I typically end up running a binary search tool over my lib directory. The results aren't pretty. But once I know which jar contains the message, I can zoom into a prettier format for the details.
I have a simple test case producing a sure ArrayOutOfBoundException in jzlib
1.0.7 depending on the data subsequently written to one and the same instance
of ZOutputStream.
Stacktrace:
java.lang.ArrayIndexOutOfBoundsException: 587
at com.jcraft.jzlib.Tree.d_code(Tree.java:149)
at com.jcraft.jzlib.Deflate.compress_block(Deflate.java:691)
at com.jcraft.jzlib.Deflate._tr_flush_block(Deflate.java:897)
at com.jcraft.jzlib.Deflate.flush_block_only(Deflate.java:772)
at com.jcraft.jzlib.Deflate.deflate_slow(Deflate.java:1195)
at com.jcraft.jzlib.Deflate.deflate(Deflate.java:1567)
at com.jcraft.jzlib.ZStream.deflate(ZStream.java:133)
at com.jcraft.jzlib.ZOutputStream.write(ZOutputStream.java:102)
at com.jcraft.jzlib.JZLibTestCase.main(JZLibTestCase.java:51)
at JZLibTestCase.main(JZLibTestCase.java:28)
The problem occurs very rarely and depends on the data subsequentially
written to an open ZOutputStream from jzlib.
Do you have a hint how to fix this? Have you ever heard of this?
Near as I can tell you might've found a bug with JZlib. While searching around I came across other places that have your post with attached source and data files. It does not appear that you're doing anything wrong. You should be able to stream any sequence of bytes to ZOutputStream.
Is there a particular reason you're using JZlib? The two main reasons I understand to use it are support for Z_PARTIAL_FLUSH mode and licensing. If you don't need that flush mode and you're using the Oracle JVM, you should be just fine with the included DeflaterOutputStream. Substituting it in your code for ZOuputStream works without an exception.
I haven't found a concrete reason for using jzlib asking my co-workers, but for sure there has been a bug using java.util.zip somewhen in JRE 1.4 on multi-processor systems, although no one has been able to tell me concretely which one. From that time on we have been using jzlib, which has worked good for many years. Most probably it is already fixed. Nevertheless, using java.util.zip works with my simple test data in the same manner jzlib failed with, that's true.