We have some Java code that processes a user-provided file by looping through the file using BufferedReader.readline() to read in each line.
The problem is that when the user uploads a file that has extremely long lines, like an arbitrary binary JPG or such, this can cause out-of-memory issues. Even the first readline() may not return. We want to reject the files with long lines before it OOMs.
Is there a standard Java idiom to handle this, or do we just change to read() and write our own safe version of readLine()?
You will need to read the file character by character (or chunk by chunk) yourself (via some form of read()), and then form the lines into Strings when you encounter a newline character. This way you can throw an Exception (avoiding the OOM error) if some maximum number of characters is hit before a newline is encountered.
If you use a Reader instance it should not be too difficult to implement this code, just read from the Reader into a buffer (which you allocate to your maximum possible line length), and then convert the buffer to String when you encounter a newline (or throw an exception if you don't).
There doesn't appear to be any way to set a line length limit for BufferedReader.readLine(), so it will accumulate the entire line before feeding it to your code, however long that line may be.
Therefore, you'll have to do the line-splitting part yourself, and give up once a line is too long.
You might use the following as a starting point:
class LineTooLongException extends Exception {}
class ShortLineReader implements AutoCloseable {
final Reader reader;
final char[] buf = new char[8192];
int nextIndex = 0;
int maxIndex = 0;
boolean eof;
public ShortLineReader(Reader reader) {
this.reader = reader;
}
public String readLine() throws IOException, LineTooLongException {
if (eof) {
return null;
}
for (;;) {
for (int i = nextIndex; i < maxIndex; i++) {
if (buf[i] == '\n') {
String result = new String(buf, nextIndex, i - nextIndex);
nextIndex = i + 1;
return result;
}
}
if (maxIndex - nextIndex > 6000) {
throw new LineTooLongException();
}
System.arraycopy(buf, nextIndex, buf, 0, maxIndex - nextIndex);
maxIndex -= nextIndex;
nextIndex = 0;
int c = reader.read(buf, maxIndex, buf.length - maxIndex);
if (c == -1) {
eof = true;
return new String(buf, nextIndex, maxIndex - nextIndex);
} else {
maxIndex += c;
}
}
}
#Override
public void close() throws Exception {
reader.close();
}
}
public class Test {
public static void main(String[] args) throws Exception {
File file = new File("D:\\t\\output.log");
// try (OutputStream fos = new BufferedOutputStream(new FileOutputStream(file))) {
// for (int i = 0; i < 10000000; i++) {
// fos.write(65);
// }
// }
try (ShortLineReader r = new ShortLineReader(new FileReader(file))) {
String s;
while ((s = r.readLine()) != null) {
System.out.println(s);
}
}
}
}
Note: This assumes unix-style line termination.
Use BufferedInputStream to read binary data rather than BufferedReader...
for example if it is an image file, using ImageIO and InputStream you can do it like this..
File file = new File("image.gif");
image = ImageIO.read(file);
InputStream is = new BufferedInputStream(new FileInputStream("image.gif"));
image = ImageIO.read(is);
hope it helps...
There doesn't appear to be a definite way but a few things you can do:
Check file headers. jMimeMagic seems to be a pretty good library for this purpose.
Check the type of characters the file contains. Essentially do statistical analysis on the first 'x' bytes of the file and use that to estimate the rest of the content.
Check for newlines '\n' or '\r' in the files, binary files usually wont contain newlines.
Hope that helps.
Related
I have the following code:
public class Reader {
public static void main(String[] args) throws IOException {
try (FileReader in = new FileReader("D:/test.txt")) {
// BufferedReader br = new BufferedReader(in);
int line = in .read();
for (int i = 0; i < line; i++) {
//System.out.println(line);
System.out.println((char) line);
line = in .read();
}
}
}
}
and a file Test.txt with the content:
Hello
Java
When I run above code it only reads Hello. I would like to read multiple lines using FileReader only. I don't want to use BufferedReader or InputStreamReader etc. Is that possible?
I don't think this version of the code prints "Hello".
You are calling:
int line = in.read();
What does this do? Look in the Javadocs for Reader:
public int read()
throws IOException
Reads a single character. This method will block until a character is available, an I/O error occurs, or the end
of the stream is reached.
(emphasis mine)
Your code reads the 'H' from 'Hello', which is 72 in ASCII.
Then it goes into your loop, with line==72, so it goes into the loop:
for(int i=0;i<line;i++)
... making the decision "is 0 less than 72? Yes, so I'll go into the loop block".
Then each time it reads a character the value of line changes to another integer, and each time loop goes around i increments. So the loop says "Keep going for as long as the ASCII value of the character is greater than the number of iterations I've counted".
... and each time it goes around, it prints that character on a line of its own.
As it happens, for your input, it reads end-of-file (-1), and as -1 < i, the loop continue condition is not met.
But for longer inputs it stop on the first 'a' after the 97th character, or the first 'b' after the 98th character, and so on (because ASCII 'a' is 97, etc.)
H
e
l
l
o
J
a
v
a
This isn't what you want:
You don't want your loop to repeat until i >= "the character I just read". You want it to repeat until in.read() returns -1. You have probably been taught how to loop until a condition is met.
You don't want to println() each character, since that adds newlines you don't want. Use print().
You should also look at the Reader.read(byte[] buffer) method, and see if you can write the code to work in bigger chunks.
Two patterns you'll use over and over again in your programming career are:
Type x = getSomehow();
while(someCondition(x)) {
doSomethingWith(x);
x = getSomehow();
}
... and ...
Type x = value_of_x_which_meets_condition;
while(someCondition(x)) {
x = getSomehow();
doSomethingWith(x);
}
See if you can construct something with FileReader and the value you get from it, filling in the "somehows".
Reading file character by character without any buffering stream is extremely ineffective. I would probably wrap FileReader in some BufferedReader or simply used Scanner to read condent of file, but if you absolutely want/need/have to use only FileReader then you can try with
int line = in.read();
while (line != -1) {
System.out.print((char) line);
line = in.read();
}
instead of your for (int i = 0; i < line; i++) {...} loop.
Read carefully slims answer. In short: reading condition shouldn't care if number of characters you read is less then numeric representation of currently read character (i < line). Like in case of
My name
is
not important now
This file has few characters which you normally will not see like \r and \n and in reality it looks like
My name\r\n
\r\n
is\r\n
\r\n
not important now
where numeric representation of \r is 10, so after you read My name\r\n (which is 9 characters because \r and \n are single character representing line separator) your i will become 10 and since next character you will try to read is \r which is also represented by 10 your condition i<line will fail (10<10 is not true).
So instead of checking i<line you should check if read value is not EoF (End of File, or End of Stream in out case) which is represented by -1 as specified in read method documentation so your condition should look like line != -1. And because you don't need i just use while loop here.
Returns:
The character read, or -1 if the end of the stream has been reached
You will have to read the content char by char and parse for a new line sequence.
A new line sequence can be any of the following:
a single cariage return '\r'
a single line feed '\n'
a carriage return followed by a line feed "\r\n"
EDIT
You could try the following:
public List<String> readLinesUsingFileReader(String filename) throws IOException {
List<String> lines = null;
try (FileReader fileReader = new FileReader(filename)) {
lines = readLines(fileReader);
}
return lines;
}
private List<String> readLines(FileReader fileReader) throws IOException {
List<String> lines = new ArrayList<>();
boolean newLine = false;
int c, p = 0;
StringBuilder line = new StringBuilder();
while(-1 != (c = fileReader.read())) {
if(c == '\n' && p != '\r') {
newLine = true;
} else if(c == '\r') {
newLine = true;
} else {
if(c != '\n' && c != '\r') {
line.append((char) c);
}
}
if(newLine) {
lines.add(line.toString());
line = new StringBuilder();
newLine = false;
}
p = c;
}
if(line.length() > 0) {
lines.add(line.toString());
}
return lines;
}
Note that the code above reads the whole file into a List, this might not be well suited for large files! You may want in such a case to implement an approach which uses streaming, i.e. read one line at a time, for example String readNextLine(FileReader fileReader) { ... }.
Some basic tests:
Create test files to read
private final static String txt0 = "testnl0.txt";
private final static String txt1 = "testnl1.txt";
private final static String txt2 = "testnl2.txt";
#BeforeClass
public static void genTestFile() throws IOException {
try (OutputStream os = new FileOutputStream(txt0)) {
os0.write((
"Hello\n" +
",\r\n" +
"World!" +
"").getBytes());
}
try (OutputStream os = new FileOutputStream(txt1)) {
os.write((
"\n" +
"\r\r" +
"\r\n" +
"").getBytes());
}
try (OutputStream os = new FileOutputStream(txt2)) {
os.write((
"").getBytes());
}
}
Test using the created files
#Test
public void readLinesUsingFileReader0() throws IOException {
List<String> lines = readLinesUsingFileReader(txt0);
Assert.assertEquals(3, lines.size());
Assert.assertEquals("Hello", lines.get(0));
Assert.assertEquals(",", lines.get(1));
Assert.assertEquals("World!", lines.get(2));
}
#Test
public void readLinesUsingFileReader1() throws IOException {
List<String> lines = readLinesUsingFileReader(txt1);
Assert.assertEquals(4, lines.size());
Assert.assertEquals("", lines.get(0));
Assert.assertEquals("", lines.get(1));
Assert.assertEquals("", lines.get(2));
Assert.assertEquals("", lines.get(3));
}
#Test
public void readLinesUsingFileReader2() throws IOException {
List<String> lines = readLinesUsingFileReader(txt2);
Assert.assertTrue(lines.isEmpty());
}
If you have the new line character
public static void main(String[]args) throws IOException{
FileReader in = new FileReader("D:/test.txt");
char [] a = new char[50];
in.read(a); // reads the content to the array
for(char c : a)
System.out.print(c); //prints the characters one by one
in.close();
}
It will print
Hello
Java
I solved the above problem by using this code
public class Reader
{
public static void main(String[]args) throws IOException{
try (FileReader in = new FileReader("D:/test.txt")) {
int line = in.read();
while(line!=-1)
{
System.out.print((char)line);
line = in.read();
} }
}
}
But there is one more question if I write for loop instead of while like this
for(int i=0;i<line;i++)
It prints only first line.Could anybody tell me why?
Reader.read() returns int code of single char or -1 if end of the file is reached:
http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html#read()
So, read the file char by char and check LF (Line feed, '\n', 0x0A, 10 in decimal), CR (Carriage return, '\r', 0x0D, 13 in decimal)and end-of-line codes.
Note: Windows OS uses 2 chars to encode the end of line: "\r\n". The most of others including Linux, MacOS, etc. use only "\n" to encode the end of line.
final StringBuilder line = new StringBuilder(); // line buffer
try (FileReader in = new FileReader("D:/test.txt")) {
int chAr, prevChar = 0x0A; // chAr - just read char, prevChar - previously read char
while (prevChar != -1) { // until the last read char is EOF
chAr = in.read(); // read int code of the next char
switch (chAr) {
case 0x0D: // CR - just
break; // skip
case -1: // EOF
if (prevChar == 0x0A) {
break; // no need a new line if EOF goes right after LF
// or no any chars were read before (prevChar isn't
// changed from its initial 0x0A)
}
case 0x0A: // or LF
System.out.println("line:" + line.toString()); // get string from the line buffer
line.setLength(0); // cleanup the line buffer
break;
default: // if any other char code is read
line.append((char) chAr); // append to the line buffer
}
prevChar = chAr; // remember the current char as previous one for the next iteration
}
}
I have a file which is split in two parts by "\n\n" - first part is not too long String and second is byte array, which can be quite long.
I am trying to read the file as follows:
byte[] result;
try (final FileInputStream fis = new FileInputStream(file)) {
final InputStreamReader isr = new InputStreamReader(fis);
final BufferedReader reader = new BufferedReader(isr);
String line;
// reading until \n\n
while (!(line = reader.readLine()).trim().isEmpty()){
// processing the line
}
// copying the rest of the byte array
result = IOUtils.toByteArray(reader);
reader.close();
}
Even though the resulting array is the size it should be, its contents are broken. If I try to use toByteArray directly on fis or isr, the contents of result are empty.
How can I read the rest of the file correctly and efficiently?
Thanks!
The reason your contents are broken is because the IOUtils.toByteArray(...) function reads your data as a string in the default character encoding, i.e. it converts the 8-bit binary values into text characters using whatever logic your default encoding prescribes. This usually leads to many of the binary values getting corrupted.
Depending on how exactly the charset is implemented, there is a slight chance that this might work:
result = IOUtils.toByteArray(reader, "ISO-8859-1");
ISO-8859-1 uses only a single byte per character. Not all character values are defined, but many implementations will pass them anyways. Maybe you're lucky with it.
But a much cleaner solution would be to instead read the String in the beginning as binary data first and then converting it to text via new String(bytes) rather than reading the binary data at the end as a String and then converting it back.
This might mean, though, that you need to implement your own version of a BufferedReader for performance purposes.
You can find the source code of the standard BufferedReader via the obvious Google search, which will (for example) lead you here:
http://www.docjar.com/html/api/java/io/BufferedReader.java.html
It's a bit long, but conceptually not too difficult to understand, so hopefully it will be useful as a reference.
Alternatively, you could read the file into byte array, find \n\n position and split the array into the line and bytes
byte[] a = Files.readAllBytes(Paths.get("file"));
String line = "";
byte[] result = a;
for (int i = 0; i < a.length - 1; i++) {
if (a[i] == '\n' && a[i + 1] == '\n') {
line = new String(a, 0, i);
int len = a.length - i - 1;
result = new byte[len];
System.arraycopy(a, i + 1, result, 0, len);
break;
}
}
Thanks for all the comments - the final implementation was done in this way:
try (final FileInputStream fis = new FileInputStream(file)) {
ByteBuffer buffer = ByteBuffer.allocate(64);
boolean wasLast = false;
String headerValue = null, headerKey = null;
byte[] result = null;
while (true) {
byte current = (byte) fis.read();
if (current == '\n') {
if (wasLast) {
// this is \n\n
break;
} else {
// just a new line in header
wasLast = true;
headerValue = new String(buffer.array(), 0, buffer.position()));
buffer.clear();
}
} else if (current == '\t') {
// headerKey\theaderValue\n
headerKey = new String(buffer.array(), 0, buffer.position());
buffer.clear();
} else {
buffer.put(current);
wasLast = false;
}
}
// reading the rest
result = IOUtils.toByteArray(fis);
}
I am getting OutOfMemory Exception. Why? I am using this code for logging. Does this approach correct?
Exceptions and closing of streams are handled in parent methods.
private static void writeToFile(File file, FileWriter out, String message) throws IOException {
if (file.exists() && file.isFile()) {
if ((file.length() + message.getBytes().length) <= FILE_MAX_SIZE_B) {
out.write(message);
} else {
int cutLenght = (int) (file.length() + message.getBytes().length - FILE_MAX_SIZE_B);
FileInputStream fileInputStream = new FileInputStream(file);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream));
char[] buf = new char[1024];
int numRead = 0;
StringBuffer text = new StringBuffer(1000);
while ((numRead=bufferedReader.read(buf)) != -1) {
text.append(buf,0,numRead);
}
String result = new String(text).substring(cutLenght);
result += message;
FileWriter fileWriter = new FileWriter(file, appendToFile);
writeToFile(file, fileWriter, result);
bufferedReader.close();
}
}
}
EDIT:
I am using this method for writting my logs in file. So for example in one second I can call 10 logs. I am getting error on lines:
while ((numRead=bufferedReader.read(buf)) != -1) {
text.append(buf,0,numRead);
}
My guess is that you are getting the OutOfMemoryError because you are reading the entire contents of the log file back into memory once it has gotten too close to its maximum size.
You could instead read and write it in smaller chunks, but that could be tricky since you have to avoid overwriting something you haven't already read.
Overall, this technique seems like a very inefficient method of maintaining the log data. Some alternative approaches off the top of my head:
(1) maintain a set of n log files, each with maximum size FILE_MAX_SIZE_B/n. When the first log fills up, open the next one for writing, and so on; when the last one fills up, go back to the first one. In this way you are discarding some of the oldest log data each time you switch files, but not all of it, and still maintaining your overall size limit.
(2) rotate the data within a single file. After each write, add a marker that indicates this is the end of the log stream. When the file has reached its maximum size, just start again at the beginning, overwriting the data that is there. The marker will tell you where the latest message is.
Try something like this:
void appendToFile(File f, CharSequence message, Charset cs, long maximumSize) throws IOException {
long available = maximumSize - f.length();
if (available > 0) {
FileOutputStream fos = new FileOutputStream(f, true);
try {
CharBuffer chars = CharBuffer.wrap(message);
ByteBuffer bytes = ByteBuffer.allocate(8 * 1024); // Re-used when encoding the string
CharsetEncoder enc = cs.newEncoder();
CoderResult res;
do {
res = enc.encode(chars, bytes, true);
bytes.flip();
long len = Math.min(available, bytes.remaining());
available -= len;
fos.write(bytes.array(), bytes.position(), (int) len);
bytes.clear();
} while (res == CoderResult.OVERFLOW && available > 0);
} finally {
fos.close();
}
}
}
Testable with this:
File f = new File(getCacheDir(), "tmp.txt");
f.delete();
// Or whatever charset you want.
Charset cs = Charset.forName("UTF-8");
int maxlen = 2 * 1024; // For this test, 2kb
try {
for (int i = 0; i < maxlen / 20; i++) {
// Write 30 characters for maxlen/20 times == guaranteed overflow
appendToFile(f, "123456789012345678901234567890", cs, maxlen);
System.out.println("Length=" + f.length());
}
} catch (Throwable t) {
t.printStackTrace();
}
f.delete();
Well, you're getting OOM because you're trying to load a huge file into memory.
Did you try opening it with append option instead?
you get OOME because you load the whole file, then get some part of the string. Instead, do a skip on your input stream and read.
I have a log file which gets updated every second. I need to read the log file periodically, and once I do a read, I need to store the file pointer position at the end of the last line I read and in the next periodic read I should start from that point.
Currently, I am using a random access file in Java and using the getFilePointer() method to get he offset value and the seek() method to go to the offset position.
However, I have read in most articles and even the Java doc recommendations to use BufferredReader for efficient reading of a file. How can I achieve this (getting the filepointer and moving to the last line) using a BufferedReader, or is there any other efficient way to achieve this task?
A couple of ways that should work:
open the file using a FileInputStream, skip() the relevant number of bytes, then wrap the BufferedReader around the stream (via an InputStreamReader);
open the file (with either FileInputStream or RandomAccessFile), call getChannel() on the stream/RandomAccessFile to get an underlying FileChannel, call position() on the channel, then call Channels.newInputStream() to get an input stream from the channel, which you can pass to InputStreamReader -> BufferedReader.
I haven't honestly profiled these to see which is better performance-wise, but you should see which works better in your situation.
The problem with RandomAccessFile is essentially that its readLine() method is very inefficient. If it's convenient for you to read from the RAF and do your own buffering to split the lines, then there's nothing wrong with RAF per se-- just that its readLine() is poorly implemented
Neil Coffey's solution is good if you are reading fixed length files. However for files that have variable length (data keep coming in) there are some problems with using BufferedReader directly on FileInputStream or FileChannel inputstream via an InputStreamReader. For ex consider the cases
1)
You want to read data from some offset to current file length. So you use BR on FileInputStream/FileChannel(via an InputStreamReader) and use its readLine method. But while you are busy reading the data let say some data got added which causes BF's readLine to read more data than what you expected(the previous file length)
2)
You finished readLine stuff but when you try to read the current file length/channel position some data got added suddenly which causes the current file length/channel position to increase but you have already read less data than this.
In both of the above cases it is difficult to know the actual data you have read (you cannot just use the length of data read using readLine because it skips some chars like carriage return)
So it is better to read the data in buffered bytes and use a BufferedReader wrapper around this. I wrote some methods like this
/** Read data from offset to length bytes in RandomAccessFile using BufferedReader
* #param offset
* #param length
* #param accessFile
* #throws IOException
*/
public static void readBufferedLines(long offset, long length, RandomAccessFile accessFile) throws IOException{
if(accessFile == null) return;
int bufferSize = BYTE_BUFFER_SIZE;// constant say 4096
if(offset < length && offset >= 0){
int index = 1;
long curPosition = offset;
/*
* iterate (length-from)/BYTE_BUFFER_SIZE times to read into buffer no matter where new line occurs
*/
while((curPosition + (index * BYTE_BUFFER_SIZE)) < length){
accessFile.seek(offset); // seek to last parsed data rather than last data read in to buffer
byte[] buf = new byte[bufferSize];
int read = accessFile.read(buf, 0, bufferSize);
index++;// Increment whether or not read successful
if(read > 0){
int lastnewLine = getLastLine(read,buf);
if(lastnewLine <= 0){ // no new line found in the buffer reset buffer size and continue
bufferSize = bufferSize+read;
continue;
}
else{
bufferSize = BYTE_BUFFER_SIZE;
}
readLine(buf, 0, lastnewLine); // read the lines from buffer and parse the line
offset = offset+lastnewLine; // update the last data read
}
}
// Read last chunk. The last chunk size in worst case is the total file when no newline occurs
if(offset < length){
accessFile.seek(offset);
byte[] buf = new byte[(int) (length-offset)];
int read = accessFile.read(buf, 0, buf.length);
if(read > 0){
readLine(buf, 0, read);
offset = offset+read; // update the last data read
}
}
}
}
private static void readLine(byte[] buf, int from , int lastnewLine) throws IOException{
String readLine = "";
BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(buf,from,lastnewLine) ));
while( (readLine = reader.readLine()) != null){
//do something with readLine
System.out.println(readLine);
}
reader.close();
}
private static int getLastLine(int read, byte[] buf) {
if(buf == null ) return -1;
if(read > buf.length) read = buf.length;
while( read > 0 && !(buf[read-1] == '\n' || buf[read-1] == '\r')) read--;
return read;
}
public static void main(String[] args) throws IOException {
RandomAccessFile accessFile = new RandomAccessFile("C:/sri/test.log", "r");
readBufferedLines(0, accessFile.length(), accessFile);
accessFile.close();
}
I had a similar problem, and I created this class to take lines from BufferedStream, and count how many bytes you have read so far by using getBytes(). We assume the line separator has a single byte by default, and we re-instance the BufferedReader for seek() to work.
public class FileCounterIterator {
public Long position() {
return _position;
}
public Long fileSize() {
return _fileSize;
}
public FileCounterIterator newlineLength(Long newNewlineLength) {
this._newlineLength = newNewlineLength;
return this;
}
private Long _fileSize = 0L;
private Long _position = 0L;
private Long _newlineLength = 1L;
private RandomAccessFile fp;
private BufferedReader itr;
public FileCounterIterator(String filename) throws IOException {
fp = new RandomAccessFile(filename, "r");
_fileSize = fp.length();
this.seek(0L);
}
public FileCounterIterator seek(Long newPosition) throws IOException {
this.fp.seek(newPosition);
this._position = newPosition;
itr = new BufferedReader(new InputStreamReader(new FileInputStream(fp.getFD())));
return this;
}
public Boolean hasNext() throws IOException {
return this._position < this._fileSize;
}
public String readLine() throws IOException {
String nextLine = itr.readLine();
this._position += nextLine.getBytes().length + _newlineLength;
return nextLine;
}
}
How can you get the contents of a text file while preserving whether or not it has a newline at the end of the file? Using this technique, it is impossible to tell if the file ends in a newline:
BufferedReader reader = new BufferedReader(new FileReader(fromFile));
StringBuilder contents = new StringBuilder();
String line = null;
while ((line=reader.readLine()) != null) {
contents.append(line);
contents.append("\n");
}
Don't use readLine(); transfer the contents one character at a time using the read() method. If you use it on a BufferedReader, this will have the same performance, although unlike your code above it will not "normalize" Windows-style CR/LF line breaks.
You can read the whole file content using one of the techniques listed here
My favorite is this one:
public static long copyLarge(InputStream input, OutputStream output)
throws IOException {
byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
long count = 0;
int n = 0;
while ((n = input.read(buffer))>=0) {
output.write(buffer, 0, n);
count += n;
}
return count;
}