How to read text file[.log file] every 1Mb

How to read text file[.log file] every 1Mb - java

I have a large log file and I want to read it 1Mb one by one .
Example.I have 100Mb text file and I want to read 1Mb at a time. That need 100 times.
Any relevant Ideas ?

You can pass your file to an InputStream and the call the function read(byte[] b, int off, int len) and pass the total amount of bytes to be read in len and pass the right offset to off, or just use read() to read one byte of the InputStream and pass a loop around this statment
for(int i = 0; i < 1048576; i++)
{
input.read();
//do something with the input
}

The simplest approach is if you do not have to read 1MB sharp, i.e. you have to just read file line by line and when it exceeds 1M stop. In this case just count the bytes you have read:
1
BufferedReader reader = new BufferedReader(new InputStremReader(new FileInputStream(myfile)));
String line = null;
int bytesCount = 0;
while((line = reader.readLine()) != null) {
// process the line
bytesCount += line.getBytes().length;
if (bytesCount > 1024*1024) {
// 1MB reached. Do what you need here.
}
}
If however you need 1M sharp the task is a little bit more complicated because you still want to use convenient tools for text reading like BufferedReader. In this case create your own input stream that counts bytes and wraps other input stream. Once the limit is achieved your stream should return -1 as a marker of EOF. However it should implement method reset() that signals it to continue reading. The implementation will take a couple of minutes, so I am leaving it to you as an exercise.

Related

how to read a large log file which other process current write

Create log file by day， one file about 400MB，JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions：
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}

It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.

Java: number of lines in a file without processing it

I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}

Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."

You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.

How to inflate a git tree object?

I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.
The code I use to read the files is the same everywhere:
FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));
and it works without issues for every object beside trees. When I try to read a tree this way I get this:
tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�
It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?

The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.
Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.
My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):
FileInputStream fis = new FileInputStream(this.filepath);
InflaterInputStream inStream = new InflaterInputStream(fis);
int i = -1;
while((i = inStream.read()) != 0){
//First line
}
//Content data
while((i = inStream.read()) != -1){
while((i = inStream.read()) != 0x20){ //0x20 is the space char
//Permission bytes
}
//Filename: 0-terminated
String filename = "";
while((i = inStream.read()) != 0){
filename += (char) i;
}
//Hash: 20 byte long, can contain any value, the only way
// to be sure is to count the bytes
String hash = "";
for(int count = 0; count < 20 ; count++){
i = inStream.read();
hash += Integer.toHexString(i);
}
}

OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."
To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.
and read a line at time.
Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".

In java, when reading in a file one character at a time, how do I determine EOF?

I am having to read in a while and use an algorithm to code each letter and then print them to another file. I know generally to find the end of a file you would use readLine and check to see if its null. I am using a bufferedReader. Is there anyway to check to see if there is another character to read in? Basically, how do I know that I just read in the last character of the file?
I guess i could use readline and see if there was another line if I knew how to determine when I was at the end of my current line.
I found where the File class has a method called size() that supposidly turns the length in bytes of the file. Would that be telling me how many characters are in the file? Could i do while(charCount<length) ?

I don't exactly understand what you want to do. I guess you may want to read a file character by character. If so, you can do:
FileInputStream fileInput = new FileInputStream("file.txt");
int r;
while ((r = fileInput.read()) != -1) {
char c = (char) r;
// do something with the character c
}
fileInput.close();
FileInputStream.read() returns -1 when there are no more characters to read. It returns an int and not a char so a cast is mandatory.
Please note that this won't work if your file is in UTF-8 format and contains multi-byte characters. In that case you have to wrap the FileInputStream in an InputStreamReader and specify the appropriate charset. I'm omitting it here for the sake of simplicity.

From my understanding, buffers will return -1 if there are no characters left. So you could write:
BufferedInputStream in = new BufferedInputStream(new FileInputStream("filename"));
while (currentChar = in.read() != -1) {
//do something
}
in.close();

Java iteration reading & parsing

I have a log file that I am reading to a string
public static String read (String path) throws IOException {
StringBuilder sb = new StringBuilder();
FileInputStream fs = new FileInputStream(path);
InputStream in = new BufferedInputStream(fs);
int r;
while ((r = in.read()) != -1) {
sb.append((char)r);
}
fs.close();
in.close();
return sb.toString();
}
Then I have a parser that iterates over the entire string once
void parse () {
String con = read("log.txt");
for (int i = 0; i < con.length; i++) {
/* parsing action */
}
}
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading?
In C# I understand there is some sort of yield return thing, but I'm locked with Java.
What are my options in Java?

This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
It's worse than just a huge waste of cpu cycles. It's a huge waste of memory to read the entire file into a string, if you're only going to use it once and the use is looking at one character at a time moving forward, as your code indicates. And if your file is large, you'll exhaust memory.
You should parse as you read, and never have the entire file loaded into memory at once.
If the parsing action needs to be called from more than one place, make it a function and call it rather than copying the same code all over the place. Copying a single-line function call is fine.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.