Java iteration reading & parsing

Java iteration reading & parsing - java

I have a log file that I am reading to a string
public static String read (String path) throws IOException {
StringBuilder sb = new StringBuilder();
FileInputStream fs = new FileInputStream(path);
InputStream in = new BufferedInputStream(fs);
int r;
while ((r = in.read()) != -1) {
sb.append((char)r);
}
fs.close();
in.close();
return sb.toString();
}
Then I have a parser that iterates over the entire string once
void parse () {
String con = read("log.txt");
for (int i = 0; i < con.length; i++) {
/* parsing action */
}
}
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading?
In C# I understand there is some sort of yield return thing, but I'm locked with Java.
What are my options in Java?

This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
It's worse than just a huge waste of cpu cycles. It's a huge waste of memory to read the entire file into a string, if you're only going to use it once and the use is looking at one character at a time moving forward, as your code indicates. And if your file is large, you'll exhaust memory.
You should parse as you read, and never have the entire file loaded into memory at once.
If the parsing action needs to be called from more than one place, make it a function and call it rather than copying the same code all over the place. Copying a single-line function call is fine.

Related

how to read a large log file which other process current write

Create log file by day， one file about 400MB，JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions：
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}

It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.

Java: number of lines in a file without processing it

I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}

Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."

You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?

I am basically looking for a solution that allows me to stream the lines and replace them IN THE SAME FILE, a la Files.lines

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?
Basically, no.
Any change to a file that involves changing the number of bytes between offets A and B can only be done by rewriting the file, or creating a new one. In either case, everything after B has to be loaded / read into memory.
This is not a Java-specific restriction. It is a consequence of the way that modern operating systems represent files, and the low-level (ie.e. syscall) APIs that they provide to applications.
In the specific case where you replace one line (or sequence of lines) with a line (or sequence of lines) of exactly the same length, then you can do the replacement using either RandomAccessFile, or by mapping the file into memory. Note that the latter approach won't cause the entire file to be read into memory.
It is also possible to replace or delete lines while updating the file "in place" (changing the file length ...). See #Sergio Montoro's answer for an example. However, with an in place update, there is a risk that the file will be corrupted if the application is interrupted. And this does involve reading and rewriting all bytes in the file after the insertion / deletion point. And that entails loading them into memory.

There was a mechanism in Java 1: RandomAccessFile; but any such in-place mechanism requires that you know the start offset of the line, and that the new line is the same length as the old one.
Otherwise you have to copy the file up to that line, substitute the new line in the output, and then continue the copy.
You certainly don't have to load the entire file into memory.

Yes.
A FileChannel allows random read/write to any position of a file. Therefore, if you have a read ahead buffer which is long enough you can replace lines even if the new line is longer than the former one.
The following example is a toy implementation which makes two assumptions: 1st) the input file is ISO-8859-1 Unix LF encoded and 2nd) each new line is never going to be longer than the next line (one line read ahead buffer).
Unless you definitely cannot create a temporary file, you should benchmark this approach against the more natural stream in -> stream out, because I do not know what performance may a spinning drive provide you for an algorithm that constantly moves forward and backward in a file.
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import static java.nio.file.StandardOpenOption.*;
import java.io.IOException;
public class ReplaceInFile {
public static void main(String args[]) throws IOException {
Path file = Paths.get(args[0]);
ByteBuffer writeBuffer;
long readPos = 0l;
long writePos;
String line_m;
String line_n;
String line_t;
FileChannel channel = FileChannel.open(file, READ, WRITE);
channel.position(0);
writePos = readPos;
line_m = readLine(channel);
do {
readPos += line_m.length() + 1;
channel.position(readPos);
line_n = readLine(channel);
line_t = transformLine(line_m)+"\n";
writeBuffer = ByteBuffer.allocate(line_t.length()+1);
writeBuffer.put(line_t.getBytes("ISO8859_1"));
System.out.print("replaced line "+line_m+" with "+line_t);
channel.position(writePos);
writeBuffer.rewind();
while (writeBuffer.hasRemaining()) {
channel.write(writeBuffer);
}
writePos += line_t.length();
line_m = line_n;
assert writePos > readPos;
} while (line_m.length() > 0);
channel.close();
System.out.println("Done!");
}
public static String transformLine(String input) throws IOException {
return input.replace("<", "<").replace(">", ">");
}
public static String readLine(FileChannel channel) throws IOException {
ByteBuffer readBuffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
do {
int read = channel.read(readBuffer);
if (read<1) break;
readBuffer.rewind();
char c = (char) readBuffer.get();
readBuffer.rewind();
if (c=='\n') break;
line.append(c);
} while (true);
return line.toString();
}
}

Java large String returned from findWithinHorizon converted to InputStream

I have wrote an application which in one of its modules parses huge file and saves this file chunk by chunk into a database.
First of all the following code works, and my main problem is to reduce memory usage and general increase in performance.
The following code snippet is a small part of the big picture, but is the most problematic after doing some YourKit profiling, The lines that are marked by /*Here*/ allocate huge amount of memory.
....
Scanner fileScanner = new Scanner(file,"UTF-8");
String scannedFarm;
try{
Pattern p = Pattern.compile("(?:^.++$(?:\\r?+\\n)?+){2,100000}+",Pattern.MULTILINE);
String [] tableName = null;
/*HERE*/while((scannedFarm = fileScanner.findWithinHorizon(p, 0)) != null){
boolean continuePrevStream = false;
Scanner scanner = new Scanner(scannedFarm);
String[] tmpTableName = scanner.nextLine().split(getSeparator());
if (tmpTableName.length==2){
tableName = tmpTableName;
}else{
if (tableName==null){
continue;
}
continuePrevStream = true;
}
scanner.close();
/*HERE*/ InputStream is = new ByteArrayInputStream(scannedFarm.getBytes("UTF-8"));
....
It is acceptable to allocate huge amount of memory since the String is large (i need it too be such large chunk), My main problem is that the same allocation happens twice as a result of getBytes,
So my question is their a way to transfer the findWithinHorizon Result directly to InputStream without allocating memory twice?
Is their more efficient way to achieve the same functionality?

Not exactly the same approach but instead of findWithinHorizon, you could try reading each line and searching for the pattern within the line context. This is sure to reduce memory pressure as you're not buffering the whole file as the API states:
If horizon is 0, then the horizon is ignored and this method continues
to search through the input looking for the specified pattern without
bound. In this case it may buffer all of the input searching for the
pattern.
Something like:
while(String line = fileScanner.nextLine() != null) {
if(grep for pattern in line) {
}
}

Using StringBuilder to process csv files to save heap space

I am reading a csv file that has about 50,000 lines and 1.1MiB in size (and can grow larger).
In Code1, I use String to process the csv, while in Code2 I use StringBuilder (only one thread executes the code, so no concurrency issues)
Using StringBuilder makes the code a little bit harder to read that using normal String class.
Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory?
Code1
fr = new FileReader(file);
BufferedReader reader = new BufferedReader(fr);
String line = reader.readLine();
while ( line != null )
{
int separator = line.indexOf(',');
String symbol = line.substring(0, seperator);
int begin = separator;
separator = line.indexOf(',', begin+1);
String price = line.substring(begin+1, seperator);
// Publish this update
publisher.publishQuote(symbol, price);
// Read the next line of fake update data
line = reader.readLine();
}
Code2
fr = new FileReader(file);
StringBuilder stringBuilder = new StringBuilder(reader.readLine());
while( stringBuilder.toString() != null ) {
int separator = stringBuilder.toString().indexOf(',');
String symbol = stringBuilder.toString().substring(0, separator);
int begin = separator;
separator = stringBuilder.toString().indexOf(',', begin+1);
String price = stringBuilder.toString().substring(begin+1, separator);
publisher.publishQuote(symbol, price);
stringBuilder.replace(0, stringBuilder.length(), reader.readLine());
}
Edit
I eliminated the toString() call, so there will be less string objects produced.
Code3
while( stringBuilder.length() > 0 ) {
int separator = stringBuilder.indexOf(",");
String symbol = stringBuilder.substring(0, separator);
int begin = separator;
separator = stringBuilder.indexOf(",", begin+1);
String price = stringBuilder.substring(begin+1, separator);
publisher.publishQuote(symbol, price);
Thread.sleep(10);
stringBuilder.replace(0, stringBuilder.length(), reader.readLine());
}
Also, the original code is downloaded from http://www.devx.com/Java/Article/35246/0/page/1

Will the optimized code increase performance of the app? - my question
The second code sample will not save you any memory nor any computation time. I am afraid you might have misunderstood the purpose of StringBuilder, which is really meant for building strings - not reading them.
Within the loop or your second code sample, every single line contains the expression stringBuilder.toString(), essentially turning the buffered string into a String object over and over again. Your actual string operations are done against these objects. Not only is the first code sample easier to read, but it is most certainly as performant of the two.
Am I prematurely optimizing things with StringBuilder? - your question
Unless you have profiled your application and have come to the conclusion that these very lines causes a notable slowdown on the execution speed, yes. Unless you are really sure that something will be slow (eg if you recognize high computational complexity), you definately want to do some profiling before you start making optimizations that hurt the readability of your code.
What kind of optimizations could be done to this code? - my question
If you have profiled the application, and decided this is the right place for an optimization, you should consider looking into the features offered by the Scanner class. Actually, this might both give you better performance (profiling will tell you if this is true) and more simple code.

Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory?
Most probably: yes. But, only one way to find out: profile your code.
Also, I'd use a proper CSV parser instead of what you're doing now: http://ostermiller.org/utils/CSV.html

Code2 is actually less efficient than Code1 because every time you call stringBuilder.toString() you're creating a new java.lang.String instance (in addition to the existing StringBuilder object). This is less efficient in terms of space and time due to the object creation overhead.
Assigning the contents of readLine() directly to a String and then splitting that String will typically be performant enough. You could also consider using the Scanner class.
Memory Saving Tip
If you encounter multiple repeating tokens in your input consider using String.intern() to ensure that each identical token references the same String object; e.g.
String[] tokens = parseTokens(line);
for (String token : tokens) {
// Construct business object referencing interned version of token.
BusinessObject bo = new BusinessObject(token.intern());
// Add business object to collection, etc.
}

StringBuilder is usually used like this:
StringBuilder sb = new StringBuilder();
sb.append("You").append(" can chain ")
.append(" your ").append(" strings ")
.append("for better readability.");
String myString = sb.toString(); // only call once when you are done
System.out.prinln(sb); // also calls sb.toString().. print myString instead

StringBuilder has several good things
StringBuffer's operations are synchronized but StringBuilder is not, so using StringBuilder will improve performance in single threaded scenarios
Once the buffer is expanded the buffer can be reused by invoking setLength(0) on the object. Interestingly if you step into the debugger and examine the contents of StringBuilder you will see that contents are still exists even after invoking setLength(0). The JVM simply resets the pointer beginning of the string. Next time when you start appending the chars the pointer moves
If you are not really sure about length of string, it is better to use StringBuilder because once the buffer is expanded you can reuse the same buffer for smaller or equal size
StringBuffer and StringBuilder are almost same in all operations except that StringBuffer is synchronized and StringBuilder is not
If you dont have multithreading then it is better to use StringBuilder

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.