Counting bytes consumed by char streams - java

I have a large text file (csv) on disk that I'm splitting into lines. Something like this:
BufferedReader reader = new BufferedReader(new FileReader(file));
while ((line = reader .readLine()) != null) {
...
}
What I want to do is compute the offset from the start of the file for every 1,000 lines say, so if in the future I want to read the 10,001th line, I can jump straight to offset X, then start iterating.
The file could be encoded in any way, so there is no strong relationship between bytes and chars.
Does anyone know of any "counting readers", or an alternative approach? I'm very happy to implement a Reader myself, but don't want to write a very complex class if I can avoid it.

When you need random access, BufferedReader is not suited. Instead, you need to look into Channel and its subclasses like FileChannel and so on.
Simple example of reading using a channel:
RandomAccessFile aFile = new RandomAccessFile("data/nio-data.txt", "rw");
FileChannel inChannel = aFile.getChannel();
ByteBuffer buf = ByteBuffer.allocate(48);
int bytesRead = inChannel.read(buf);
while (bytesRead != -1) {
System.out.println("Read " + bytesRead);
buf.flip();
while(buf.hasRemaining()){
System.out.print((char) buf.get());
}
buf.clear();
bytesRead = inChannel.read(buf);
}
aFile.close();
Source: http://tutorials.jenkov.com/java-nio/channels.html
As for your question of reading from where you left off, FileChannel defines a method read(ByteBuffer buf,int position) where position is the position in bytes where yu want to read from.

Related

Why does position() method of FileChannel return zero always?

My app reads text file line by line and record offset of each line until the end of file.
But position() always returns 0.
What is wrong with my code?
String buffer;
long offset;
RandomAccessFile raf = new RandomAccessFile("data.txt", "r");
FileChannel channel = raf.getChannel();
BufferedReader br = new BufferedReader(new InputStreamReader(Channels.newInputStream(channel)));
while (true) {
offset = channel.position(); // offset is always 0. why?
if ((buffer = br.readLine()) == null) // buffer has correct value.
return;
………………………………
}
I cannot reproduce your error, that is, offset is not always 0 when I run your code. Still, it doesn't do what you expect it to do. You create a BufferedReader on top of your FileChannel. The BufferedReader will fill its buffer (and thus increase the offset in the channel) and then read from the buffer until its empty. So after calling br.readLine() once, the offset is not the length of the string you've read, it is the length of the buffer.
You can better use a BufferedReader and FileInputStream directly and count the characters by some other means.

Fastest way to incrementally read a large file

When given a buffer of MAX_BUFFER_SIZE, and a file that far exceeds it, how can one:
Read the file in blocks of MAX_BUFFER_SIZE?
Do it as fast as possible
I tried using NIO
RandomAccessFile aFile = new RandomAccessFile(fileName, "r");
FileChannel inChannel = aFile.getChannel();
ByteBuffer buffer = ByteBuffer.allocate(CAPARICY);
int bytesRead = inChannel.read(buffer);
buffer.flip();
while (buffer.hasRemaining()) {
buffer.get();
}
buffer.clear();
bytesRead = inChannel.read(buffer);
aFile.close();
And regular IO
InputStream in = new FileInputStream(fileName);
long length = fileName.length();
if (length > Integer.MAX_VALUE) {
throw new IOException("File is too large!");
}
byte[] bytes = new byte[(int) length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead = in.read(bytes, offset, bytes.length - offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException("Could not completely read file " + fileName);
}
in.close();
Turns out that regular IO is about 100 times faster in doing the same thing as NIO. Am i missing something? Is this expected? Is there a faster way to read the file in buffer chunks?
Ultimately i am working with a large file i don't have memory for to read it all at once. Instead, I'd like to read it incrementally in blocks that would then be used for processing.
If you want to make your first example faster
FileChannel inChannel = new FileInputStream(fileName).getChannel();
ByteBuffer buffer = ByteBuffer.allocateDirect(CAPACITY);
while(inChannel.read(buffer) > 0)
buffer.clear(); // do something with the data and clear/compact it.
inChannel.close();
If you want it to be even faster.
FileChannel inChannel = new RandomAccessFile(fileName, "r").getChannel();
MappedByteBuffer buffer = inChannel.map(FileChannel.MapMode.READ_ONLY, 0, inChannel.size());
// access the buffer as you wish.
inChannel.close();
This can take 10 - 20 micro-seconds for files up to 2 GB in size.
Assuming that you need to read the entire file into memory at once (as you're currently doing), neither reading smaller chunks nor NIO are going to help you here.
In fact, you'd probably be best reading larger chunks - which your regular IO code is automatically doing for you.
Your NIO code is currently slower, because you're only reading one byte at a time (using buffer.get();).
If you want to process in chunks - for example, transferring between streams - here is a standard way of doing it without NIO:
InputStream is = ...;
OutputStream os = ...;
byte buffer[] = new byte[1024];
int read;
while((read = is.read(buffer)) != -1){
os.write(buffer, 0, read);
}
This uses a buffer size of only 1 KB, but can transfer an unlimited amount of data.
(If you extend your answer with details of what you're actually looking to do at a functional level, I could further improve this to a better answer.)

Identify the current character being read in "read" method of fileinputstream?

FileInputStream in = new FileInputStream("filetoreadfrom.txt");
while ((c = in.read()) != -1) {
Integer cobj = new Integer(c);
System.out.println("The Current data being read is :" + cobj.byteValue());
out.write(c);
}
The sysouts give an intvalue representing the byte being read.But i want to print the exact character being read.Is there a way to do it?
In InputStream contains bytes, not characters. What does it even mean to talk about the "character" when you're in the middle of an mp3 file for example?
If you want to read text data, you need a Reader, e.g. an InputStreamReader wrapped around an InputStream with a specific encoding.
Try the type conversion (char) cobj.byteValue()
It's better to use BufferedReader and InputStreamReader but you can also use such code:
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(inputFile));
byte[] buffer = new byte[4096];
int len;
while ((len = bis.read(buffer)) >= 0) {
String line = new String(buffer, 0, len);

Prepend lines to file in Java

Is there a way to prepend a line to the File in Java, without creating a temporary file, and writing the needed content to it?
No, there is no way to do that SAFELY in Java. (Or AFAIK, any other programming language.)
No filesystem implementation in any mainstream operating system supports this kind of thing, and you won't find this feature supported in any mainstream programming languages.
Real world file systems are implemented on devices that store data as fixed sized "blocks". It is not possible to implement a file system model where you can insert bytes into the middle of a file without significantly slowing down file I/O, wasting disk space or both.
The solutions that involve an in-place rewrite of the file are inherently unsafe. If your application is killed or the power dies in the middle of the prepend / rewrite process, you are likely to lose data. I would NOT recommend using that approach in practice.
Use a temporary file and renaming. It is safer.
There is a way, it involves rewriting the whole file though (but no temporary file). As others mentioned, no file system supports prepending content to a file. Here is some sample code that uses a RandomAccessFile to write and read content while keeping some content buffered in memory:
public static void main(final String args[]) throws Exception {
File f = File.createTempFile(Main.class.getName(), "tmp");
f.deleteOnExit();
System.out.println(f.getPath());
// put some dummy content into our file
BufferedWriter w = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f)));
for (int i = 0; i < 1000; i++) {
w.write(UUID.randomUUID().toString());
w.write('\n');
}
w.flush();
w.close();
// append "some uuids" to our file
int bufLength = 4096;
byte[] appendBuf = "some uuids\n".getBytes();
byte[] writeBuf = appendBuf;
byte[] readBuf = new byte[bufLength];
int writeBytes = writeBuf.length;
RandomAccessFile rw = new RandomAccessFile(f, "rw");
int read = 0;
int write = 0;
while (true) {
// seek to read position and read content into read buffer
rw.seek(read);
int bytesRead = rw.read(readBuf, 0, readBuf.length);
// seek to write position and write content from write buffer
rw.seek(write);
rw.write(writeBuf, 0, writeBytes);
// no bytes read - end of file reached
if (bytesRead < 0) {
// end of
break;
}
// update seek positions for write and read
read += bytesRead;
write += writeBytes;
writeBytes = bytesRead;
// reuse buffer, create new one to replace (short) append buf
byte[] nextWrite = writeBuf == appendBuf ? new byte[bufLength] : writeBuf;
writeBuf = readBuf;
readBuf = nextWrite;
};
rw.close();
// now show the content of our file
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}
You could store the file content in a String and prepend the desired line by using a StringBuilder-Object. You just have to put the desired line first and then append the file-content-String.
No extra temporary file needed.
No. There are no "intra-file shift" operations, only read and write of discrete sizes.
It would be possible to do so by reading a chunk of the file of equal length to what you want to prepend, writing the new content in place of it, reading the later chunk and replacing it with what you read before, and so on, rippling down the to the end of the file.
However, don't do that, because if anything stops (out-of-memory, power outage, rogue thread calling System.exit) in the middle of that process, data will be lost. Use the temporary file instead.
private static void addPreAppnedText(File fileName) {
FileOutputStream fileOutputStream =null;
BufferedReader br = null;
FileReader fr = null;
String newFileName = fileName.getAbsolutePath() + "#";
try {
fileOutputStream = new FileOutputStream(newFileName);
fileOutputStream.write("preappendTextDataHere".getBytes());
fr = new FileReader(fileName);
br = new BufferedReader(fr);
String sCurrentLine;
while ((sCurrentLine = br.readLine()) != null) {
fileOutputStream.write(("\n"+sCurrentLine).getBytes());
}
fileOutputStream.flush();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fileOutputStream.close();
if (br != null)
br.close();
if (fr != null)
fr.close();
new File(newFileName).renameTo(new File(newFileName.replace("#", "")));
} catch (IOException ex) {
ex.printStackTrace();
}
}
}

Java: BufferedReader reads more than a line?

I'm making a program in Java with Sockets. I can send commands to the client and from the client to the server. To read the commands I use a BufferedReader. To write them, a PrintWriter But now I want to transfer a file through that socket (Not simply create a second connection).First I write to the outputstream how many bytes the file contains. For example 40000 bytes. So I write the number 40000 through the socket, but the other side of the connection reads 78.
So I was thinking: The BufferedReader reads more than just the line (by calling readLine()) and on that way I lose some bytes from the file-data. Because they are in the buffer from the BufferedReader.
So the number 78 is a byte of the file I want to transmit.
Is this way of thinking right, or not. If so, how to sovle this problem.
I hope I've explained well.
Here is my code, but my default language is Dutch. So some variable-name can sound stange.
public void flushStreamToStream(InputStream is, OutputStream os, boolean closeIn, boolean closeOut) throws IOException {
byte[] buffer = new byte[BUFFERSIZE];
int bytesRead;
if ((!closeOut) && closeIn) { // To Socket from File
action = "Upload";
os.write(is.available()); // Here I write 400000
max = is.available();
System.out.println("Bytes to send: " + max);
while ((bytesRead = is.read(buffer)) != -1) {
startTiming(); // Two lines to compute the speed
os.write(buffer, 0, bytesRead);
stopTiming(); // Speed compution
process += bytesRead;
}
os.flush();
is.close();
return;
}
if ((!closeIn) && closeOut) { // To File from Socket
action = "Download";
int bytesToRead = -1;
bytesToRead = is.read(); // Here he reads 78.
System.out.println("Bytes to read: " + bytesToRead);
max = bytesToRead;
int nextBufferSize;
while ((nextBufferSize = Math.min(BUFFERSIZE, bytesToRead)) > 0) {
startTiming();
bytesRead = is.read(buffer, 0, nextBufferSize);
bytesToRead -= bytesRead;
process += nextBufferSize;
os.write(buffer, 0, bytesRead);
stopTiming();
}
os.flush();
os.close();
return;
}
throw new IllegalArgumentException("The only two boolean combinations are: closeOut == false && closeIn == true AND closeOut == true && closeIn == false");
}
Here is the solution:
Thanks to James suggestion
I think laginimaineb anwser was a piece of the solution.
Read the commands.
DataInputStream in = new DataInputStream(is); // Originally a BufferedReader
// Read the request line
String str;
while ((str = in.readLine()) != null) {
if (str.trim().equals("")) {
continue;
}
handleSocketInput(str);
}
Now the flushStreamToStream:
public void flushStreamToStream(InputStream is, OutputStream os, boolean closeIn, boolean closeOut) throws IOException {
byte[] buffer = new byte[BUFFERSIZE];
int bytesRead;
if ((!closeOut) && closeIn) { // To Socket from File
action = "Upload";
DataOutputStream dos = new DataOutputStream(os);
dos.writeInt(is.available());
max = is.available();
System.out.println("Bytes to send: " + max);
while ((bytesRead = is.read(buffer)) != -1) {
startTiming();
dos.write(buffer, 0, bytesRead);
stopTiming();
process += bytesRead;
}
os.flush();
is.close();
return;
}
if ((!closeIn) && closeOut) { // To File from Socket
action = "Download";
DataInputStream dis = new DataInputStream(is);
int bytesToRead = dis.readInt();
System.out.println("Bytes to read: " + bytesToRead);
max = bytesToRead;
int nextBufferSize;
while ((nextBufferSize = Math.min(BUFFERSIZE, bytesToRead)) > 0) {
startTiming();
bytesRead = is.read(buffer, 0, nextBufferSize);
bytesToRead -= bytesRead;
process += nextBufferSize;
os.write(buffer, 0, bytesRead);
stopTiming();
}
os.flush();
os.close();
return;
}
throw new IllegalArgumentException("The only two boolean combinations are: closeOut == false && closeIn == true AND closeOut == true && closeIn == false");
}
Martijn.
I'm not sure I've followed your explanation.
However, yes - you have no real control over how much a BufferedReader will actually read. The point of such a reader is that it optimistically reads chunks of the underlying resource as needed to replenish its buffer. So when you first call readLine, it will see that its internal buffer doesn't have enough to serve youthe request, and will go off and read however many bytes it feels like into its buffer from the underlying source, which will generally be much more than you asked for just then. Once the buffer's been populated, it returns your line from the buffered content.
Thus once you wrap an input stream in a BufferedReader, you should be sure to only read that stream through the same buffered reader. If you don't you'll end up losing data (as some bytes will have been consumed and are now sitting in the BufferedReader's cache waiting to be served).
DataInputStream is most likely what you want to use. Also, don't use the available() method as it is generally useless.
A BufferedReader assumes that it is the only one reading from the underlying input stream.
It's purpose is to minimize the number of reads from the underlying stream (which are expensive, as they can delegate quite deeply). To that end, it keeps a buffer, which it fills by reading as many bytes as possible into it in a single call to the underlying stream.
So yes, your diagnosis is accurate.
Just a wild stab here - 40000 is 1001110001000000 in binary. Now, the first seven bits here are 1001110 which is 78. Meaning, you're writing 2 bytes of information but reading seven bits.

Categories