Why InputStream.available() so time consuming? - java

I have implemented my own class to read pcap files. (Binary files, i.e. tcpdump, wireshark)
public class PcapReader implements Iterator<PcapPacket> {
private InputStream is;
public PcapReader (File file) throws FileNotFoundException, IOException {
is = this(new DataInputStream(
new BufferedInputStream(
new FileInputStream(file))));
}
#Override
public boolean hasNext () {
try {
return (is.available() > 0);
} catch (IOException e) {
return false;
}
}
//pseudo code!
#Override
public PcapPacket next () {
is.read(header);
is.read(body);
return new PcapPacket(header, body);
}
//more code here
}
Then I use it like this:
PcapReader reader = new PcapReader(file);
while (reader.hasNext()) {
PcapPacket pcapPacket = reader.next();
//process packet
}
The file under test has 190 Mb. And I also use JVisualVM to profile.
hasNext() is called 1.7 million times and time is 7.7 seconds
next() is called same number of times and time is 3.6 seconds
My main question is why hasNext() is so time consuming in absolute value and also twice greater than next?

When you call is.available(), in your hasNext() method, it goes down to FileInputStream.available() implementation. This is a native method, as one may see from FileInputStream source code.
In the end, this is indeed a time-consumming operation, as the Operating System implementation of the file operations will have to check ahead if more data is available to be read. So, it will actually do a read operation without updating the file pointer (or updating it back to the original position), just to check if there is a "next" byte.

I'm sure, that internal (native) implementation of available() method is not something like just returning some return availableSize;, but more complicated. Stream counts available data using OS API; especially, for example, for log files, which are written due Stream reads them.

I have implemented my own class to read pcap files.
Because you're not using jNetPcap, or because you are using jNetPcap but need something that can read from a File?
If the latter, you probably want to use a pattern other than one that has a "more data is available" method and a separate "so read that data" method; something that reads the data and either returns a "packet available"/"end of file"/"error" indication or throws an exception for one or both of the latter conditions (DataInputStream appears to throw exceptions for both I/O errors and EOF, so it might make sense to do the same for your class).
Yeah, that means it can't be an Iterator, but maybe Iterators weren't originally intended to represent records in a sequential file (besides, if you really want it to be an Iterator, what are you going to do about the remove method?).
And if you can avoid needing to read from a File, you could then use jNetPcap's own routines for reading capture files, which, in libpcap 1.1.0 and later, can also read some pcap-ng files.

Related

Usefulness of DELETE_ON_CLOSE

There are many examples on the internet showing how to use StandardOpenOption.DELETE_ON_CLOSE, such as this:
Files.write(myTempFile, ..., StandardOpenOption.DELETE_ON_CLOSE);
Other examples similarly use Files.newOutputStream(..., StandardOpenOption.DELETE_ON_CLOSE).
I suspect all of these examples are probably flawed. The purpose of writing a file is that you're going to read it back at some point; otherwise, why bother writing it? But wouldn't DELETE_ON_CLOSE cause the file to be deleted before you have a chance to read it?
If you create a work file (to work with large amounts of data that are too large to keep in memory) then wouldn't you use RandomAccessFile instead, which allows both read and write access? However, RandomAccessFile doesn't give you the option to specify DELETE_ON_CLOSE, as far as I can see.
So can someone show me how DELETE_ON_CLOSE is actually useful?
First of all I agree with you Files.write(myTempFile, ..., StandardOpenOption.DELETE_ON_CLOSE) in this example the use of DELETE_ON_CLOSE is meaningless. After a (not so intense) search through the internet the only example I could find which shows the usage as mentioned was the one from which you might got it (http://softwarecave.org/2014/02/05/create-temporary-files-and-directories-using-java-nio2/).
This option is not intended to be used for Files.write(...) only. The API make is quite clear:
This option is primarily intended for use with work files that are used solely by a single instance of the Java virtual machine. This option is not recommended for use when opening files that are open concurrently by other entities.
Sorry I can't give you a meaningful short example, but see such file like a swap file/partition used by an operating system. In cases where the current JVM have the need to temporarily store data on the disc and after the shutdown the data are of no use anymore. As practical example I would mention it is similar to an JEE application server which might decide to serialize some entities to disc to freeup memory.
edit Maybe the following (oversimplified code) can be taken as example to demonstrate the principle. (so please: nobody should start a discussion about that this "data management" could be done differently, using fixed temporary filename is bad and so on, ...)
in the try-with-resource block you need for some reason to externalize data (the reasons are not subject of the discussion)
you have random read/write access to this externalized data
this externalized data only is of use only inside the try-with-resource block
with the use of the StandardOpenOption.DELETE_ON_CLOSE option you don't need to handle the deletion after the use yourself, the JVM will take care about it (the limitations and edge cases are described in the API)
.
static final int RECORD_LENGTH = 20;
static final String RECORD_FORMAT = "%-" + RECORD_LENGTH + "s";
// add exception handling, left out only for the example
public static void main(String[] args) throws Exception {
EnumSet<StandardOpenOption> options = EnumSet.of(
StandardOpenOption.CREATE,
StandardOpenOption.WRITE,
StandardOpenOption.READ,
StandardOpenOption.DELETE_ON_CLOSE
);
Path file = Paths.get("/tmp/enternal_data.tmp");
try (SeekableByteChannel sbc = Files.newByteChannel(file, options)) {
// during your business processing the below two cases might happen
// several times in random order
// example of huge datastructure to externalize
String[] sampleData = {"some", "huge", "datastructure"};
for (int i = 0; i < sampleData.length; i++) {
byte[] buffer = String.format(RECORD_FORMAT, sampleData[i])
.getBytes();
ByteBuffer byteBuffer = ByteBuffer.wrap(buffer);
sbc.position(i * RECORD_LENGTH);
sbc.write(byteBuffer);
}
// example of processing which need the externalized data
Random random = new Random();
byte[] buffer = new byte[RECORD_LENGTH];
ByteBuffer byteBuffer = ByteBuffer.wrap(buffer);
for (int i = 0; i < 10; i++) {
sbc.position(RECORD_LENGTH * random.nextInt(sampleData.length));
sbc.read(byteBuffer);
byteBuffer.flip();
System.out.printf("loop: %d %s%n", i, new String(buffer));
}
}
}
The DELETE_ON_CLOSE is intended for working temp files.
If you need to make some operation that needs too be temporaly stored on a file but you don't need to use the file outside of the current execution a DELETE_ON_CLOSE in a good solution for that.
An example is when you need to store informations that can't be mantained in memory for example because they are too heavy.
Another example is when you need to store temporarely the informations and you need them only in a second moment and you don't like to occupy memory for that.
Imagine also a situation in which a process needs a lot of time to be completed. You store informations on a file and only later you use them (perhaps many minutes or hours after). This guarantees you that the memory is not used for those informations if you don't need them.
The DELETE_ON_CLOSE try to delete the file when you explicitly close it calling the method close() or when the JVM is shutting down if not manually closed before.
Here are two possible ways it can be used:
1. When calling Files.newByteChannel
This method returns a SeekableByteChannel suitable for both reading and writing, in which the current position can be modified.
Seems quite useful for situations where some data needs to be stored out of memory for read/write access and doesn't need to be persisted after the application closes.
2. Write to a file, read back, delete:
An example using an arbitrary text file:
Path p = Paths.get("C:\\test", "foo.txt");
System.out.println(Files.exists(p));
try {
Files.createFile(p);
System.out.println(Files.exists(p));
try (BufferedWriter out = Files.newBufferedWriter(p, Charset.defaultCharset(), StandardOpenOption.DELETE_ON_CLOSE)) {
out.append("Hello, World!");
out.flush();
try (BufferedReader in = Files.newBufferedReader(p, Charset.defaultCharset())) {
String line;
while ((line = in.readLine()) != null) {
System.out.println(line);
}
}
}
} catch (IOException ex) {
ex.printStackTrace();
}
System.out.println(Files.exists(p));
This outputs (as expected):
false
true
Hello, World!
false
This example is obviously trivial, but I imagine there are plenty of situations where such an approach may come in handy.
However, I still believe the old File.deleteOnExit method may be preferable as you won't need to keep the output stream open for the duration of any read operations on the file, too.

Creating Zip file while client is downloading

I try to develop something like dropbox(very basic one). For one file to download, it's really easy. Just use servletoutputstream. what i want is: when client asks me multiple file, i zip files in server side then send to user. But if file is big it takes too many times to zip them and send to user.
is there any way to send files while they are compressing?
thanks for your help.
Part of the Java API for ZIP files is actually desgined to provide "on the fly" compression. It all fits nicely both with the java.io API and the servlet API, which means this is even... kind of easy (no multithreading required - even for performance reason, because usually your CPU will probably be faster at ZIPping than your network will be at sending contents).
The part you'll be interacting with is ZipOutputStream. It is a FilterOutputStream (which means it is designed to wrap an outputstream that already exists - in your case, that would be the respone's OutputStream), and will compress every byte you send it, using ZIP compression.
So, say you have a get request
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
// Your code to handle the request
List<YourFileObject> responseFiles = ... // Whatever you need to do
// We declare that the response will contain raw bytes
response.setContentType("application/octet-stream");
// We open a ZIP output stream
try (ZipOutputStream zipStream = new ZipOutputStream(response.getOutputStream()) {// This is Java 7, but not that different from java 6
// We need to loop over each files you want to send
for(YourFileObject fileToSend : responseFiles) {
// We give a name to the file
zipStream.putNextEntry(new ZipEntry(fileToSend.getName()));
// and we copy its content
copy(fileToSend, zipStream);
}
}
}
Of course, you should do proper exception handling. A couple quick notes though :
The ZIP file format mandates that each file has a name, so you must create a new ZipEntry each time you start a new file (you'll probably get an IllegalStateException if you do not, anyway)
Proper use of the API would be that you close each entry once you are done writing to it (at the end of the file). BUT : the Java implementation does that for you : each time you call putNextEntry it closes the previous one (if need be) all by itself
Likewise, you must not forget to close the ZIP stream, beacuse, this will properly close the last entry AND flush everything that is needed to create a proper ZIP file. Failure to do so will result in a corrupt file. Here, the try with resources statement does this : it closes the ZipOutputStream once everything is written to it.
The copy method here is just what you would use to transfert all the bytes from the original file to the outputstream, there is nothing ZIP specific about it. Just call outputStream.write(byte[] bytes).
**EDIT : ** to clarify...
For example, given a YourFileType that has the following methods :
public interface YourFileType {
public byte[] getContent();
public InputStream getContentAsStream();
}
Then the copy method could look like (this is all very basic Java IO, you could maybe use a library such as commons io to not reinvent the wheel...)
public void copy(YourFileType file, OutputStream os) throws IOException {
os.write(file.getContent());
}
Or, for a full streaming implementation :
public void copy(YourFileType file, OutputStream os) throws IOException {
try (InputStream fileContent = file.getContentAsStream()) {
byte[] buffer = new byte[4096]; // 4096 is kind of a magic number
int readBytesCount = 0;
while((readBytesCount = fileContent.read(buffer)) >= 0) {
os.write(buffer, 0, readBytesCount);
}
}
}
Using this kind of implementation, your client will start receiveing a response almost as soon as you start writing to the ZIPOutputStream (the only delay would be that of internal buffers), meaning it should not timeout (unless you spent too long buliding the content to send - but that would not be the ZIPping part fault's).

Rolling file implementation

I am always curious how a rolling file is implemented in logs.
How would one even start creating a file writing class in any language in order to ensure that the file size is not exceeded.
The only possible solution I can think of is this:
write method:
size = file size + size of string to write
if(size > limit)
close the file writer
open file reader
read the file
close file reader
open file writer (clears the whole file)
remove the size from the beginning to accommodate for new string to write
write the new truncated string
write the string we received
This seems like a terrible implementation, but I can not think up of anything better.
Specifically I would love to see a solution in java.
EDIT: By remove size from the beginning is, let's say I have 20 byte string (which is the limit), I want to write another 3 byte string, therefore I remove 3 bytes from the beginning, and am left with end 17 bytes, and by appending the new string I have 20 bytes.
Because your question made me look into it, here's an example from the logback logging framework. The RollingfileAppender#rollover() method looks like this:
public void rollover() {
synchronized (lock) {
// Note: This method needs to be synchronized because it needs exclusive
// access while it closes and then re-opens the target file.
//
// make sure to close the hereto active log file! Renaming under windows
// does not work for open files
this.closeOutputStream();
try {
rollingPolicy.rollover(); // this actually does the renaming of files
} catch (RolloverFailure rf) {
addWarn("RolloverFailure occurred. Deferring roll-over.");
// we failed to roll-over, let us not truncate and risk data loss
this.append = true;
}
try {
// update the currentlyActiveFile
currentlyActiveFile = new File(rollingPolicy.getActiveFileName());
// This will also close the file. This is OK since multiple
// close operations are safe.
// COMMENT MINE this also sets the new OutputStream for the new file
this.openFile(rollingPolicy.getActiveFileName());
} catch (IOException e) {
addError("setFile(" + fileName + ", false) call failed.", e);
}
}
}
As you can see, the logic is pretty similar to what you posted. They close the current OutputStream, perform the rollover, then open a new one (openFile()). Obviously, this is all done in a synchronized block since many threads are using the logger, but only one rollover should occur at a time.
A RollingPolicy is a policy on how to perform a rollover and a TriggeringPolicy is when to perform a rollover. With logback, you usually base these policies on file size or time.

Recovering from IOException: network name no longer available

I'm trying to read in a large (700GB) file and incrementally process it, but the network I'm working on will occasionally go down, cutting off access to the file. This throws a java.io.IOException telling me that "The specified network name is no longer available". Is there a way that I can catch this exception and wait for, say, fifteen minues, and then retry the read, or is the Reader object fried once access to the file is lost?
If the Reader is rendered useless once the connection is lost, is there a way that I can rewrite this in such a way as to allow me to "save my place" and then begin my read from there without having to read and discard all the data before it? Even just munching data without processing it takes a long time when there's 500GB of it to get through.
Currently, the code looks something like this (edited for brevity):
class Processor {
BufferedReader br;
Processor(String fname) {
br = new BufferedReader(new FileReader("fname"));
}
void process() {
try {
String line;
while((line=br.readLine)!=null) {
...code for processing the line goes here...
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thank you for your time.
You can keep track of read bytes in a variable. For example here I keep track in a variable called read, and buff is char[]. Not sure if this is possible using the readLine method.
read+=br.read(buff);
Then if you need to restart, you can skip that many bytes
br.skip(read);
Then you can keep processing away. Good luck
I doubt that the underlying fd will still be usable after this error, but you would have to try it. More probably you will have to reopen the file and skip to where you were up to.

How to find unclosed I/O resources in Java?

Many I/O resources in Java such as InputStream and OutputStream need to be closed when they are finished with, as discussed here.
How can I search my project for places where such resources are not being closed, e.g. this kind of error:
private void readFile(File file) throws IOException {
InputStream in = new FileInputStream(file);
int nextByte = in.read();
while (nextByte != -1) {
// Do something with the byte here
// ...
// Read the next byte
nextByte = in.read();
}
// Oops! Not closing the InputStream
}
I've tried some static analysis tools such as PMD and FindBugs, but they don't flag the above code as being wrong.
It's probably matter of setting - I ran FindBugs through my IDE plugin and it reported OS_OPEN_STREAM.
If FindBugs with modified rules doesn't work for you, another slower approach is heap analysis. VisualVM allows you to query all objects of a specific type that are open at any given time within a heap dump using OQL. You could then check for streams open to files that shouldn't be accessed at that point in the program.
Running it is as simple as:
%>jvisualvm
Choose the running process. Choose option save heap dump (or something to that effect), open the heap dump and look at class instances for file streams in the browser, or query for them.
In Java 7, they added a feature of using closable resources in current scope (so called try-with-resources), such as:
public void someMethod() {
try(InputStream is = new FileInputStream(file)) {
//do something here
} // the stream is closed here
}
In older versions, the common technique is using try-catch-finally chain.

Categories