I'm trying to read in a large (700GB) file and incrementally process it, but the network I'm working on will occasionally go down, cutting off access to the file. This throws a java.io.IOException telling me that "The specified network name is no longer available". Is there a way that I can catch this exception and wait for, say, fifteen minues, and then retry the read, or is the Reader object fried once access to the file is lost?
If the Reader is rendered useless once the connection is lost, is there a way that I can rewrite this in such a way as to allow me to "save my place" and then begin my read from there without having to read and discard all the data before it? Even just munching data without processing it takes a long time when there's 500GB of it to get through.
Currently, the code looks something like this (edited for brevity):
class Processor {
BufferedReader br;
Processor(String fname) {
br = new BufferedReader(new FileReader("fname"));
}
void process() {
try {
String line;
while((line=br.readLine)!=null) {
...code for processing the line goes here...
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thank you for your time.
You can keep track of read bytes in a variable. For example here I keep track in a variable called read, and buff is char[]. Not sure if this is possible using the readLine method.
read+=br.read(buff);
Then if you need to restart, you can skip that many bytes
br.skip(read);
Then you can keep processing away. Good luck
I doubt that the underlying fd will still be usable after this error, but you would have to try it. More probably you will have to reopen the file and skip to where you were up to.
Related
I am trying to read continuously from a named pipe using java. This question answers it for python/bash.
public class PipeProducer {
private BufferedReader pipeReader;
public PipeProducer(String namedPipe) throws IOException {
this.pipeReader = new BufferedReader(new FileReader(new File(namedPipe)));
}
public void process() {
while ((msg = this.pipeReader.readLine()) != null) {
//Process
}
}
public static void main(String args[]) throws JSONException,IOException {
PipeProducer p = new PipeProducer("/tmp/testpipe");
while(true) {
p.process();
System.out.println("Encountered EOF");
now = new Date();
System.out.println("End : " + now);
}
}
Questions
What happens if there is no data from pipe for some time ?
Can Reader object be reused when EOF is encountered ?
Is EOF is sent by pipe only when it terminate and not otherwise ?
Does pipe guarantees to be alive and working unless something really goes wrong ?
Environment is CentOS 6.7 with Java 7
This is tested and works fine but corner cases needs to be handled so that continuous operation is ensured.
What happens if there is no data from pipe for some time ?
The program blocks until there is data to read or until EOF is detected, just like a Reader connected to any other kind of file.
Can Reader object be reused when EOF is encountered ?
I wouldn't count on it. It would be safer to close the Reader and create a new one in that case.
Is EOF is sent by pipe only when it terminate and not otherwise ?
On Unix, EOF will be received from the pipe after it goes from having one writer to having zero, when no more data are available from it. I am uncertain whether Windows named pipe semantics differ in this regard, but since you're on Linux, that doesn't matter to you.
Does pipe guarantees to be alive and working unless something really goes wrong ?
If the named pipe in fact exists on the file system and you have sufficient permission, then you should reliably be able to open it for reading, but that may block until there is at least one writer. Other than that, I'm not sure what you mean.
What happens if there is no data from pipe for some time?
Nothing. It blocks.
Can Reader object be reused when EOF is encountered?
Reused for what? It's got to the end of the data. The question does not arise.
Is EOF is sent by pipe only when it terminate and not otherwise?
It is sent when the peer closes its end of the pipe.
Does pipe guarantees to be alive and working unless something really goes wrong?
Nothing is guaranteed, in pipes or in life, but in the absence of an error you should continue to read any data that is sent.
I have a problem with time.
I currently develop an app in Java where I have to make a network analyzer.
For that I use JPCAP to capture all the packets, and write them in a file, and from there I will put them bulk in DB.
The problem is when I am writting in file the entire object, like this,
UDPPacket udpPacket = (UDPPacket)packet
wtf.writeToFile("packets.txt",udpPacket +"\n");
everything is working nice and smooth, but when I try to write like this
String str=""+udpPacket.src_ip+" "+udpPacket.dst_ip+""
+udpPacket.src_port+" "+udpPacket.dst_port+" "+udpPacket.protocol +
" Wi-fi "+udpPacket.dst_ip.getCanonicalHostName()+"\n";
wtf.writeToFile("packets.txt",str +"\n");
writting in file is during lot more time.
the function to write in file is this
public void writeToFile(String name, String str){
try{
PrintWriter writer = new PrintWriter(new FileOutputStream(new File(name),this.restart));
if(!str.equalsIgnoreCase("0")){
writer.append(str);
this.restart=true;
}
else {
this.restart=false;
writer.print("");
}
writer.close();
} catch (IOException e) {
System.out.println(e);
}
}
Can anyone give me a hit, whats the best way to do this?
Thanks a lot
EDIT:
7354.120266 ns - packet print
241471.110451 ns - with StringBuilder
Keep the PrintWriter open. Don't open and close it for every line you want to write to the file. And don't flush it either: just close it when you exit. Basically you should remove your writeToFile() method and just call PrintWriter.write() or whatever directly when necessary.
NB You are writing text, not objects.
I found the problem
as #KevinO said, getCanonicalHostName() was the problem.
Thanks a lot.
I am always curious how a rolling file is implemented in logs.
How would one even start creating a file writing class in any language in order to ensure that the file size is not exceeded.
The only possible solution I can think of is this:
write method:
size = file size + size of string to write
if(size > limit)
close the file writer
open file reader
read the file
close file reader
open file writer (clears the whole file)
remove the size from the beginning to accommodate for new string to write
write the new truncated string
write the string we received
This seems like a terrible implementation, but I can not think up of anything better.
Specifically I would love to see a solution in java.
EDIT: By remove size from the beginning is, let's say I have 20 byte string (which is the limit), I want to write another 3 byte string, therefore I remove 3 bytes from the beginning, and am left with end 17 bytes, and by appending the new string I have 20 bytes.
Because your question made me look into it, here's an example from the logback logging framework. The RollingfileAppender#rollover() method looks like this:
public void rollover() {
synchronized (lock) {
// Note: This method needs to be synchronized because it needs exclusive
// access while it closes and then re-opens the target file.
//
// make sure to close the hereto active log file! Renaming under windows
// does not work for open files
this.closeOutputStream();
try {
rollingPolicy.rollover(); // this actually does the renaming of files
} catch (RolloverFailure rf) {
addWarn("RolloverFailure occurred. Deferring roll-over.");
// we failed to roll-over, let us not truncate and risk data loss
this.append = true;
}
try {
// update the currentlyActiveFile
currentlyActiveFile = new File(rollingPolicy.getActiveFileName());
// This will also close the file. This is OK since multiple
// close operations are safe.
// COMMENT MINE this also sets the new OutputStream for the new file
this.openFile(rollingPolicy.getActiveFileName());
} catch (IOException e) {
addError("setFile(" + fileName + ", false) call failed.", e);
}
}
}
As you can see, the logic is pretty similar to what you posted. They close the current OutputStream, perform the rollover, then open a new one (openFile()). Obviously, this is all done in a synchronized block since many threads are using the logger, but only one rollover should occur at a time.
A RollingPolicy is a policy on how to perform a rollover and a TriggeringPolicy is when to perform a rollover. With logback, you usually base these policies on file size or time.
I have implemented my own class to read pcap files. (Binary files, i.e. tcpdump, wireshark)
public class PcapReader implements Iterator<PcapPacket> {
private InputStream is;
public PcapReader (File file) throws FileNotFoundException, IOException {
is = this(new DataInputStream(
new BufferedInputStream(
new FileInputStream(file))));
}
#Override
public boolean hasNext () {
try {
return (is.available() > 0);
} catch (IOException e) {
return false;
}
}
//pseudo code!
#Override
public PcapPacket next () {
is.read(header);
is.read(body);
return new PcapPacket(header, body);
}
//more code here
}
Then I use it like this:
PcapReader reader = new PcapReader(file);
while (reader.hasNext()) {
PcapPacket pcapPacket = reader.next();
//process packet
}
The file under test has 190 Mb. And I also use JVisualVM to profile.
hasNext() is called 1.7 million times and time is 7.7 seconds
next() is called same number of times and time is 3.6 seconds
My main question is why hasNext() is so time consuming in absolute value and also twice greater than next?
When you call is.available(), in your hasNext() method, it goes down to FileInputStream.available() implementation. This is a native method, as one may see from FileInputStream source code.
In the end, this is indeed a time-consumming operation, as the Operating System implementation of the file operations will have to check ahead if more data is available to be read. So, it will actually do a read operation without updating the file pointer (or updating it back to the original position), just to check if there is a "next" byte.
I'm sure, that internal (native) implementation of available() method is not something like just returning some return availableSize;, but more complicated. Stream counts available data using OS API; especially, for example, for log files, which are written due Stream reads them.
I have implemented my own class to read pcap files.
Because you're not using jNetPcap, or because you are using jNetPcap but need something that can read from a File?
If the latter, you probably want to use a pattern other than one that has a "more data is available" method and a separate "so read that data" method; something that reads the data and either returns a "packet available"/"end of file"/"error" indication or throws an exception for one or both of the latter conditions (DataInputStream appears to throw exceptions for both I/O errors and EOF, so it might make sense to do the same for your class).
Yeah, that means it can't be an Iterator, but maybe Iterators weren't originally intended to represent records in a sequential file (besides, if you really want it to be an Iterator, what are you going to do about the remove method?).
And if you can avoid needing to read from a File, you could then use jNetPcap's own routines for reading capture files, which, in libpcap 1.1.0 and later, can also read some pcap-ng files.
Today, when I was working on some kind of servlet which was writing some information to some file present on my hard disk, I was using the following code to perform the write operation
File f=new File("c:/users/dell/desktop/ja/MyLOgs.txt");
PrintWriter out=new PrintWriter(new FileWriter(f,true));
out.println("the name of the user is "+name+"\n");
out.println("the email of the user is "+ email+"\n");
out.close(); //**my question is about this statement**
When I was not using the statement, the servlet was compiling well, but it was not writing anything to the file, but when I included it, then the write operation was successfully performed. My questions are:
Why was the data not being written to the file when I was not including that statement (even my servlet was compiling without any errors)?
Up to which extent the close operation is considerable for the streams?
Calling close() causes all the data to be flushed. You have constructed a PrintWriter without enabling auto-flush (a second argument to one of the constructors), which would mean you would have to manually call flush(), which close() does for you.
Closing also frees up any system resources used by having the file open. Although the VM and Operating System will eventually close the file, it is good practice to close it when you are finished with it to save memory on the computer.
You may also which to put the close() inside a finally block to ensure it always gets called. Such as:
PrintWriter out = null;
try {
File f = new File("c:/users/dell/desktop/ja/MyLOgs.txt");
out = new PrintWriter(new FileWriter(f,true));
out.println("the name of the user is "+name+"\n");
out.println("the email of the user is "+ email+"\n");
} finally {
out.close();
}
See: PrintWriter
Sanchit also makes a good point about getting the Java 7 VM to automatically close your streams the moment you don't need them automatically.
When you close a PrintWriter, it will flush all of its data out to wherever you want the data to go. It doesn't automatically do this because if it did every time you wrote to something, it would be very inefficient as writing is not an easy process.
You could achieve the same effect with flush();, but you should always close streams - see here: http://www.javapractices.com/topic/TopicAction.do?Id=8 and here: http://docs.oracle.com/javase/tutorial/jndi/ldap/close.html. Always call close(); on streams when you are done using them. Additionally, to make sure it is always closed regardless of exceptions, you could do this:
try {
//do stuff
} finally {
outputStream.close():
}
It is because the PrintWriter buffers your data in order for not making I/O operations repeatedly for every write operation (which is very expensive). When you call close() the Buffer flushes into the file. You can also call flush() for forcing the data to be written without closing the stream.
Streams automatically flush their data before closing. So you can either manually flush the data every once in a while using out.flush(); or you can just close the stream once you are done with it. When the program ends, streams close and your data gets flushed, this is why most of the time people do not close their streams!
Using Java 7 you can do something like this below which will auto close your streams in the order you open them.
public static void main(String[] args) {
String name = "";
String email = "";
File f = new File("c:/users/dell/desktop/ja/MyLOgs.txt");
try (FileWriter fw = new FileWriter(f, true); PrintWriter out = new PrintWriter(fw);) {
out.println("the name of the user is " + name + "\n");
out.println("the email of the user is " + email + "\n");
} catch (IOException e) {
e.printStackTrace();
}
}
PrintWriter buffers the data to be written so and will not write to disk until its buffer is full. Calling close() will ensure that any remaining data is flushed as well as closing the OutputStream.
close() statements typically appear in finally blocks.
Why the data was not being written to the file when I was not including that statement?
When the process terminates the unmanaged resources will be released. For InputStreams this is fine. For OutputStreams, you could lose an buffered data, so you should at least flush the stream before exiting the program.