Java: number of lines in a file without processing it

Java: number of lines in a file without processing it - java

I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}

Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."

You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.

Related

how to read a large log file which other process current write

Create log file by day， one file about 400MB，JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions：
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}

It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.

How to inflate a git tree object?

I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.
The code I use to read the files is the same everywhere:
FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));
and it works without issues for every object beside trees. When I try to read a tree this way I get this:
tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�
It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?

The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.
Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.
My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):
FileInputStream fis = new FileInputStream(this.filepath);
InflaterInputStream inStream = new InflaterInputStream(fis);
int i = -1;
while((i = inStream.read()) != 0){
//First line
}
//Content data
while((i = inStream.read()) != -1){
while((i = inStream.read()) != 0x20){ //0x20 is the space char
//Permission bytes
}
//Filename: 0-terminated
String filename = "";
while((i = inStream.read()) != 0){
filename += (char) i;
}
//Hash: 20 byte long, can contain any value, the only way
// to be sure is to count the bytes
String hash = "";
for(int count = 0; count < 20 ; count++){
i = inStream.read();
hash += Integer.toHexString(i);
}
}

OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."
To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.
and read a line at time.
Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".

How to read text file[.log file] every 1Mb

I have a large log file and I want to read it 1Mb one by one .
Example.I have 100Mb text file and I want to read 1Mb at a time. That need 100 times.
Any relevant Ideas ?

You can pass your file to an InputStream and the call the function read(byte[] b, int off, int len) and pass the total amount of bytes to be read in len and pass the right offset to off, or just use read() to read one byte of the InputStream and pass a loop around this statment
for(int i = 0; i < 1048576; i++)
{
input.read();
//do something with the input
}

The simplest approach is if you do not have to read 1MB sharp, i.e. you have to just read file line by line and when it exceeds 1M stop. In this case just count the bytes you have read:
1
BufferedReader reader = new BufferedReader(new InputStremReader(new FileInputStream(myfile)));
String line = null;
int bytesCount = 0;
while((line = reader.readLine()) != null) {
// process the line
bytesCount += line.getBytes().length;
if (bytesCount > 1024*1024) {
// 1MB reached. Do what you need here.
}
}
If however you need 1M sharp the task is a little bit more complicated because you still want to use convenient tools for text reading like BufferedReader. In this case create your own input stream that counts bytes and wraps other input stream. Once the limit is achieved your stream should return -1 as a marker of EOF. However it should implement method reset() that signals it to continue reading. The implementation will take a couple of minutes, so I am leaving it to you as an exercise.

Java large String returned from findWithinHorizon converted to InputStream

I have wrote an application which in one of its modules parses huge file and saves this file chunk by chunk into a database.
First of all the following code works, and my main problem is to reduce memory usage and general increase in performance.
The following code snippet is a small part of the big picture, but is the most problematic after doing some YourKit profiling, The lines that are marked by /*Here*/ allocate huge amount of memory.
....
Scanner fileScanner = new Scanner(file,"UTF-8");
String scannedFarm;
try{
Pattern p = Pattern.compile("(?:^.++$(?:\\r?+\\n)?+){2,100000}+",Pattern.MULTILINE);
String [] tableName = null;
/*HERE*/while((scannedFarm = fileScanner.findWithinHorizon(p, 0)) != null){
boolean continuePrevStream = false;
Scanner scanner = new Scanner(scannedFarm);
String[] tmpTableName = scanner.nextLine().split(getSeparator());
if (tmpTableName.length==2){
tableName = tmpTableName;
}else{
if (tableName==null){
continue;
}
continuePrevStream = true;
}
scanner.close();
/*HERE*/ InputStream is = new ByteArrayInputStream(scannedFarm.getBytes("UTF-8"));
....
It is acceptable to allocate huge amount of memory since the String is large (i need it too be such large chunk), My main problem is that the same allocation happens twice as a result of getBytes,
So my question is their a way to transfer the findWithinHorizon Result directly to InputStream without allocating memory twice?
Is their more efficient way to achieve the same functionality?

Not exactly the same approach but instead of findWithinHorizon, you could try reading each line and searching for the pattern within the line context. This is sure to reduce memory pressure as you're not buffering the whole file as the API states:
If horizon is 0, then the horizon is ignored and this method continues
to search through the input looking for the specified pattern without
bound. In this case it may buffer all of the input searching for the
pattern.
Something like:
while(String line = fileScanner.nextLine() != null) {
if(grep for pattern in line) {
}
}

Java iteration reading & parsing

I have a log file that I am reading to a string
public static String read (String path) throws IOException {
StringBuilder sb = new StringBuilder();
FileInputStream fs = new FileInputStream(path);
InputStream in = new BufferedInputStream(fs);
int r;
while ((r = in.read()) != -1) {
sb.append((char)r);
}
fs.close();
in.close();
return sb.toString();
}
Then I have a parser that iterates over the entire string once
void parse () {
String con = read("log.txt");
for (int i = 0; i < con.length; i++) {
/* parsing action */
}
}
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading?
In C# I understand there is some sort of yield return thing, but I'm locked with Java.
What are my options in Java?

This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
It's worse than just a huge waste of cpu cycles. It's a huge waste of memory to read the entire file into a string, if you're only going to use it once and the use is looking at one character at a time moving forward, as your code indicates. And if your file is large, you'll exhaust memory.
You should parse as you read, and never have the entire file loaded into memory at once.
If the parsing action needs to be called from more than one place, make it a function and call it rather than copying the same code all over the place. Copying a single-line function call is fine.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.