Create log file by day, one file about 400MB,JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions:
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}
It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.
Related
I need to manipulate content as it is being written to an OutputStream. Specifically, I need to replace CR or LF with CRLF to canonicalize text. This is easy for simple character sets where CR=13 and LF=10, but not so simple with multi-byte character sets. The characters should be replaced, not the bytes. It is non-trivial in general to do that in the output stream itself.
The built-in class OutputStreamWriter converts from characters to bytes for a configured encoding. I'm looking for a class that does the opposite, that is an OutputStream configured with a character set that buffers data as needed and translates the written bytes into characters with the character set (or throws on invalid byte sequences), making the characters available in some way, for example by forwarding the call to a Writer.
In other words I want to convert from bytes to characters on-the-fly as content is being written. I could write everything to a buffer and read it back with an InputStreamReader, but that is inefficient for very large payloads that won't fit in memory.
Is there a class like this somewhere (ideally open source, as I don't think it is built in)? If not, are there similar examples for efficient streaming conversion I could use as a starting point? The JDK classes I've seen are optimized for converting many bytes at a time, not for streaming use.
I wrote an implementation based on CharsetDecoder. Create a decoder and allocate a ByteBuffer and CharBuffer in the constructor:
decoder = charset.newDecoder();
byteBuf = ByteBuffer.allocate(bufferSize);
charBuf = CharBuffer.allocate(bufferSize);
Then implement write:
public void write(int b) throws IOException {
if (!byteBuf.hasRemaining()) {
decodeAndWriteByteBuffer(false);
}
byteBuf.put((byte) b);
}
And decodeAndWriteByteBuffer:
private void decodeAndWriteByteBuffer(boolean endOfInput) throws IOException {
byteBuf.flip();
CoderResult cr;
do {
cr = byteBuf.hasRemaining() || endOfInput
? decoder.decode(byteBuf, charBuf, endOfInput)
: CoderResult.UNDERFLOW;
if (cr.isUnderflow()) {
if (endOfInput) {
do {
cr = decoder.flush(charBuf);
writeCharBuffer();
} while (cr.isOverflow());
if (cr.isError()) {
cr.throwException();
}
}
} else if (cr.isOverflow()) {
writeCharBuffer();
} else {
cr.throwException();
}
} while (cr.isOverflow());
byteBuf.compact();
}
The remaining details are left as an exercise to the reader. It seems to work, though it is to early to say anything about performance.
I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}
Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."
You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.
I am basically looking for a solution that allows me to stream the lines and replace them IN THE SAME FILE, a la Files.lines
Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?
Basically, no.
Any change to a file that involves changing the number of bytes between offets A and B can only be done by rewriting the file, or creating a new one. In either case, everything after B has to be loaded / read into memory.
This is not a Java-specific restriction. It is a consequence of the way that modern operating systems represent files, and the low-level (ie.e. syscall) APIs that they provide to applications.
In the specific case where you replace one line (or sequence of lines) with a line (or sequence of lines) of exactly the same length, then you can do the replacement using either RandomAccessFile, or by mapping the file into memory. Note that the latter approach won't cause the entire file to be read into memory.
It is also possible to replace or delete lines while updating the file "in place" (changing the file length ...). See #Sergio Montoro's answer for an example. However, with an in place update, there is a risk that the file will be corrupted if the application is interrupted. And this does involve reading and rewriting all bytes in the file after the insertion / deletion point. And that entails loading them into memory.
There was a mechanism in Java 1: RandomAccessFile; but any such in-place mechanism requires that you know the start offset of the line, and that the new line is the same length as the old one.
Otherwise you have to copy the file up to that line, substitute the new line in the output, and then continue the copy.
You certainly don't have to load the entire file into memory.
Yes.
A FileChannel allows random read/write to any position of a file. Therefore, if you have a read ahead buffer which is long enough you can replace lines even if the new line is longer than the former one.
The following example is a toy implementation which makes two assumptions: 1st) the input file is ISO-8859-1 Unix LF encoded and 2nd) each new line is never going to be longer than the next line (one line read ahead buffer).
Unless you definitely cannot create a temporary file, you should benchmark this approach against the more natural stream in -> stream out, because I do not know what performance may a spinning drive provide you for an algorithm that constantly moves forward and backward in a file.
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import static java.nio.file.StandardOpenOption.*;
import java.io.IOException;
public class ReplaceInFile {
public static void main(String args[]) throws IOException {
Path file = Paths.get(args[0]);
ByteBuffer writeBuffer;
long readPos = 0l;
long writePos;
String line_m;
String line_n;
String line_t;
FileChannel channel = FileChannel.open(file, READ, WRITE);
channel.position(0);
writePos = readPos;
line_m = readLine(channel);
do {
readPos += line_m.length() + 1;
channel.position(readPos);
line_n = readLine(channel);
line_t = transformLine(line_m)+"\n";
writeBuffer = ByteBuffer.allocate(line_t.length()+1);
writeBuffer.put(line_t.getBytes("ISO8859_1"));
System.out.print("replaced line "+line_m+" with "+line_t);
channel.position(writePos);
writeBuffer.rewind();
while (writeBuffer.hasRemaining()) {
channel.write(writeBuffer);
}
writePos += line_t.length();
line_m = line_n;
assert writePos > readPos;
} while (line_m.length() > 0);
channel.close();
System.out.println("Done!");
}
public static String transformLine(String input) throws IOException {
return input.replace("<", "<").replace(">", ">");
}
public static String readLine(FileChannel channel) throws IOException {
ByteBuffer readBuffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
do {
int read = channel.read(readBuffer);
if (read<1) break;
readBuffer.rewind();
char c = (char) readBuffer.get();
readBuffer.rewind();
if (c=='\n') break;
line.append(c);
} while (true);
return line.toString();
}
}
I have a large log file and I want to read it 1Mb one by one .
Example.I have 100Mb text file and I want to read 1Mb at a time. That need 100 times.
Any relevant Ideas ?
You can pass your file to an InputStream and the call the function read(byte[] b, int off, int len) and pass the total amount of bytes to be read in len and pass the right offset to off, or just use read() to read one byte of the InputStream and pass a loop around this statment
for(int i = 0; i < 1048576; i++)
{
input.read();
//do something with the input
}
The simplest approach is if you do not have to read 1MB sharp, i.e. you have to just read file line by line and when it exceeds 1M stop. In this case just count the bytes you have read:
1
BufferedReader reader = new BufferedReader(new InputStremReader(new FileInputStream(myfile)));
String line = null;
int bytesCount = 0;
while((line = reader.readLine()) != null) {
// process the line
bytesCount += line.getBytes().length;
if (bytesCount > 1024*1024) {
// 1MB reached. Do what you need here.
}
}
If however you need 1M sharp the task is a little bit more complicated because you still want to use convenient tools for text reading like BufferedReader. In this case create your own input stream that counts bytes and wraps other input stream. Once the limit is achieved your stream should return -1 as a marker of EOF. However it should implement method reset() that signals it to continue reading. The implementation will take a couple of minutes, so I am leaving it to you as an exercise.
I have a log file that I am reading to a string
public static String read (String path) throws IOException {
StringBuilder sb = new StringBuilder();
FileInputStream fs = new FileInputStream(path);
InputStream in = new BufferedInputStream(fs);
int r;
while ((r = in.read()) != -1) {
sb.append((char)r);
}
fs.close();
in.close();
return sb.toString();
}
Then I have a parser that iterates over the entire string once
void parse () {
String con = read("log.txt");
for (int i = 0; i < con.length; i++) {
/* parsing action */
}
}
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading?
In C# I understand there is some sort of yield return thing, but I'm locked with Java.
What are my options in Java?
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
It's worse than just a huge waste of cpu cycles. It's a huge waste of memory to read the entire file into a string, if you're only going to use it once and the use is looking at one character at a time moving forward, as your code indicates. And if your file is large, you'll exhaust memory.
You should parse as you read, and never have the entire file loaded into memory at once.
If the parsing action needs to be called from more than one place, make it a function and call it rather than copying the same code all over the place. Copying a single-line function call is fine.