Java process that invoke pkzip taking long time to zip - java

I must use pkzip in my java program to zip (since standard java zip routine does not work on mainframe), I think it zip correctly, however it take a long time to finish. Here is mycode
Runtime myruntime = Runtime.getRuntime();
Process newproc = myruntime.exec("c:\\app\\pkzipc.exe -add c:\\output\\test.zip c:\\doc\\foo.pdf c:\\doc\\bar.doc"");
foo.pdf and bar.doc are about 20MB each, if I execute this via commandline, then it take about a second to zip, but when I use the java, it take 30 min to 1 hours to finish the zipping. Any idea why?

You need to make sure that you are reading from the standard output and error streams of the child process. If pkzip generates output then it will be buffered by the operating system, and if the buffer fills up then you can expect the child process to block until the buffer is cleared.
The Process object has methods for obtaining the input, output and error streams. Create new threads that read from the output and error streams and either pipe them to System.out and System.err, or just discard the output if you don't care about it.

Try using Java's java.util.zip API for manipulating .zip files, take a look at the tutorial.

Related

Java ProcessBuilder: Input/Output Stream

I want to invoke an external program in java code, then the Google tell me that the Runtime or ProcessBuilder can help me to do this work. I have tried it, and there come out a problem the java program can't exit, that means both the sub process and the father process wait for forever. they are hanging or deadlock.
Someone tell me the reason is that the sub process's cache is too small. when it try to give back data to the father process, but the father process don't read it in time, then both of them hang. So they advice me fork an thread to be in charge of read sub process's cache data. I do it as what they tell me, but there still some problem.
Then I close the output stream which get by the method getOutputStream(). Finally, the program success. But I don't know why it happen? Is there some relationship between the output steam and input stream?
You have provided very few details in your question, so I can only provide a general answer.
All processes have three standard streams: standard input, standard output and standard error. Standard input is used for reading in data, standard output for writing out data, and standard error for writing out error messages. When you start an external program using Runtime.getRuntime().exec() or ProcessBuilder, Java will create a Process object for the external program, and this Process object will have methods to access these streams.
These streams are accessed as follows:
process.getOutputStream(): return the standard input of the external program. This is an OutputStream as it is something your Java code will write to.
process.getInputStream(): return the standard output of the external program. This is an InputStream as it is something your Java code will read from.
process.getErrorStream(): return the standard error of the external program. This is an InputStream as, like standard output, it is something your Java code will read from.
Note that the names of getInputStream() and getOutputStream() can be confusing.
All streams between your Java code and the external program are buffered. This means each stream has a small amount of memory (a buffer) where the writer can write data that is yet to be read by the reader. The writer does not have to wait for the reader to read its data immediately; it can leave its output in the buffer and continue.
There are two ways in which writing to buffers and reading from them can hang:
attempting to write data to a buffer when there is not enough space left for the data,
attempting to read from an empty buffer.
In the first situation, the writer will wait until space is made in the buffer by reading data out of it. In the second, the reader will wait until data is written into the buffer.
You mention that closing the stream returned by getOutputStream() caused your program to complete successfully. This closes the standard input of the external program, telling it that there will be nothing more for it to read. If your program then completes successfully, this suggests that your program was waiting for more input to come when it was hanging.
It is perhaps arguable that if you do run an external program, you should close its standard input if you don't need to use it, as you have done. This tells the external program that there will be no more input, and so removes the possibility of it being stuck waiting for input. However, it doesn't answer the question of why your external program is waiting for input.
Most of the time, when you run external programs using Runtime.getRuntime().exec() or ProcessBuilder, you don't often use the standard input. Typically, you'd pass whatever inputs you'd need to the external program on the command line and then read its output (if it generates any at all).
Does your external program do what you need it to and then get stuck, apparently waiting for input? Do you ever need to send it data to its standard input? If you start a process on Windows using cmd.exe /k ..., the command interpreter will continue even after the program it started has exited. In this case, you should use /c instead of /k.
Finally, I'd like to emphasise that there are two output streams, standard output and standard error. There can be problems if you read from the wrong stream at the wrong time. If you attempt to read from the external program's standard output while its buffer is empty, your Java code will wait for the external program to generate output. However, if your external program is writing a lot of data to its standard error, it could fill the buffer and then find itself waiting for your Java code to make space in the buffer by reading from it. The end result of this is your Java code and the external program are both waiting for each other to do something, i.e. deadlock.
This problem can be eliminated simply by using a ProcessBuilder and ensuring that you call its redirectErrorStream() method with a true value. Calling this method redirects the standard error of the external program into its standard output, so you only have one stream to read from.

IO : Writing and reading to the same text file from a C++ program and from another Java program simultaneously?

Is it possible to read and write to the same text file with both a C++ application and a java application at the same time without writing conflicting lines / characters to it ? I have tested with two java applications for now, and it seems like it's possible to write to the file from one process even if the other process has opened the stream, but not closed it. Are there any way to lock the file so that the other process needs to wait ?
I think yes, for example boost::interprocess http://www.boost.org/doc/libs/1_50_0/doc/html/interprocess.html file locks http://www.boost.org/doc/libs/1_50_0/doc/html/interprocess/synchronization_mechanisms.html#interprocess.synchronization_mechanisms.file_lock
For two processes that are writing to the same file, as long as you flush your output buffers on line boundaries (i.e., flush after you write a newline character sequence), the data written to the file should be intervleaved nicely.
If one process is writing while another is reading from the same file, you just have to ensure that the reads don't get ahead of the writes. If a read gets an end-of-file condition (or worse, a partial data line), then you know that the reading process must wait until the writing process has finished writing another chunk of data to the file.
If you need more complicated read/write control, you should consider some kind of locking mechanism.

Splitting text file without reading it

Is there any method so that I can split a text file in java without reading it?
I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.
As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.
Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:
1 Reader Thread (Reads the File and feeds the workers )
Queue with read chunks
1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)
Queue or dictionary with processed chunks
1 Writer Thread ( Writes results to some file)
Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.
It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores
Without reading the content of file you can't do that. That is not possible.
I don't think this is possible for the following reasons:
How do you write a file without "reading" it?
You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.
Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:
public static void loadFileFromInputStream(InputStream in) throws IOException {
BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));
String record = inputStream.readLine();
while (record != null) {
// do something with the record
// ...
record = inputStream.readLine();
}
}
You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.
Good luck! If, for some reason, you do find a solution, please post it here. Thanks!
Technically speaking - it cant be done without reading the file. But you also dont need to keep the entire file contents in memory to do the splitting. Just open a stream to the file and write out to other files by redirecting output to another file after certain number of bytes are written to one file. This way you are not required to keep more than one byte of file data in memory at any given time. But having a larger buffer, about 8 or 16kb will be dramatically increase performance.
Something has to read your file to split it (and you probably want to split it at line barriers, probably not at some multiple of kilobytes).
If running on Linux machine, you could delegate the splitting to an external command like csplit. So your Java program would simply run a csplit yourbigfile.txt command.
In the literal sense no. To literally split a file into smaller files, you have to read the large one and write the smaller ones.
However, I think you really want to know if you can have different threads sequentially reading different "parts" of a file at the same time. And the answer is that you can do that. Just have each thread create its own RandomAccessFile object for the file, seek to the relevant place, and start reading.
(A FileInputStream would probably work too, though I don't think that the Java API spec guarantees that skip is implemented using a OS level "seek" operation on the file.)
There are a couple of possible complications:
If the file is text, you presumably want each thread to start processing at the start of some line in the file. So each thread has to start by finding the end of a line, and make sure that it reads to the end of the last line in its "part".
If the file uses a variable width character encoding (e.g. UTF-8), then you need to deal with the case where your partition boundaries fall in the middle of a character.

Sending Large files as stream to process.getOutputStream

I am using gzip utilities in windows machine. I compressed a file and stored in the DB as blob. When I want to decompress this file using gzip utility I am writing this byte stream to process.getOutputStream. But after 30KB, it was unable to read the file. It hangs there.
Tried with memory arguments, read and flush logic. But the same data if I try to write to a file it is pretty fast.
OutputStream stdin = proc.getOutputStream();
Blob blob = Hibernate.createBlob(inputFileReader);
InputStream source = blob.getBinaryStream();
byte[] buffer = new byte[256];
long readBufferCount = 0;
while (source.read(buffer) > 0)
{
stdin.write(buffer);
stdin.flush();
log.info("Reading the file - Read bytes: " + readBufferCount);
readBufferCount = readBufferCount + 256;
}
stdin.flush();
Regards,
Mani Kumar Adari.
I suspect that the problem is that the external process (connected to proc) is either
not reading its standard input, or
it is writing stuff to its standard output that your Java application is not reading.
Bear in mind that Java talks to the external process using a pair of "pipes", and these have a limited amount of buffering. If you exceed the buffering capacity of a pipe, the writer process will be blocked writing to the pipe until the reader process has read enough data from the pipe to make space. If the reader doesn't read, then the pipeline locks up.
If you provided more context (e.g. the part of the application that launches the gzip process) I'd be able to be more definitive.
FOLLOWUP
gzip.exe is a unix utility in windows we are using. gzip.exe in command prompt working fine. But Not with the java program. Is there any way we can increase the buffering size which java writes to a pipe. I am concerned about the input part at present.
On UNIX, the gzip utility is typically used one of two ways:
gzip file compresses file turning it into file.gz.
... | gzip | ... (or something similar) which writes a compressed version of its standard input to its standard output.
I suspect that you are doing the equivalent of the latter, with the java application as both the source of the gzip command's input and the destination of its output. And this is the precisely the scenario that can lock up ... if the java application is not implemented correctly. For instance:
Process proc = Runtime.exec(...); // gzip.exe pathname.
OutputStream out = proc.getOutputStream();
while (...) {
out.write(...);
}
out.flush();
InputStream in = proc.getInputStream();
while (...) {
in.read(...);
}
If the write phase of the application above writes too much data, it is guaranteed to lockup.
Communication between the java application and gzip is via two pipes. As I stated above, a pipe will buffer a certain amount of data, but that amount is relatively small, and certainly bounded. This is the cause of the lockup. Here is what happens:
The gzip process is creates with a pair of pipes connecting it to the Java application process.
The Java application writes data to its out stream
The gzip processes reads that data from its standard input, compresses it and writes to its standard output.
Steps 2. and 3. are repeated a few times, until finally the gzip processes attempt to write to its standard output blocks.
What has been happening is that gzip has been writing into its output pipe, but nothing has been reading from it. Eventually, we reach the point where we've exhausted the output pipe's buffer capacity, and the write to the pipe blocks.
Meanwhile, the Java application is still writing to the out Stream, and after a couple more rounds, this too blocks because we've filled the other pipe.
The only solution is for the Java application to read and write at the same time. The simple way to do this is to create a second thread and do the writing to the external process from one thread and the reading from the process in the other one.
(Changing the Java buffering or the Java read / write sizes won't help. The buffering that matters is in the OS implementations of the pipes, and there's no way to change that from pure Java, if at all.)

The best way to monitor output of process along with its execution

I have started a process in my Java code, this process take a very long time to run and could generate some output from time to time. I need to react to every output when they are generated, what is the best way to do this?
What kind of reaction are you talking about? Is the process writing to its standard output and/or standard error? If so, I suspect Process.getInputStream and Process.getErrorStream are what you're looking for. Read from both of those and react accordingly. Note that you may want to read from both of them from different threads, to avoid the individual buffer for either stream from filling up.
Alternatively, if you don't need the two separately, just leave redirectErrorStream in ProcessBuilder as false, so the error and output streams are merged.
You should start a thread which reads from the Process.getInputStream() and getErrorStream() (or alternatively use ProcessBuilder.redirectErrorStream(true)) and handle it when something shows up in the stream. There are many ways that how to handle it - the right way depends on how the data is being used. Please tell more details.
Here is one real-life example: SbtRunner uses ProcessRunner to send commands to a command line application and wait for the command to finish execution (the application will print "> " when a command finishes execution). There is some indirection happening to make it easier to read from the process' output (the output is written to a MulticastPipe from where it is then read by an OutputReader).

Categories