Java Read Large Text File With 70million line of text

Java Read Large Text File With 70million line of text - java

I have a big test file with 70 million lines of text.
I have to read the file line by line.
I used two different approaches:
InputStreamReader isr = new InputStreamReader(new FileInputStream(FilePath),"unicode");
BufferedReader br = new BufferedReader(isr);
while((cur=br.readLine()) != null);
and
LineIterator it = FileUtils.lineIterator(new File(FilePath), "unicode");
while(it.hasNext()) cur=it.nextLine();
Is there another approach that can make this task faster?

1) I am sure there is no difference speedwise, both use FileInputStream internally and buffering
2) You can take measurements and see for yourself
3) Though there's no performance benefits I like the 1.7 approach
try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) {
for (String line = null; (line = br.readLine()) != null;) {
//
}
}
4) Scanner based version
try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
}
5) This may be faster than the rest
try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) {
ByteBuffer bb = ByteBuffer.allocateDirect(1000);
for(;;) {
StringBuilder line = new StringBuilder();
int n = ch.read(bb);
// add chars to line
// ...
}
}
it requires a bit of coding but it can be really faster because of ByteBuffer.allocateDirect. It allows OS to read bytes from file to ByteBuffer directly, without copying
6) Parallel processing would definitely increase speed. Make a big byte buffer, run several tasks that read bytes from file into that buffer in parallel, when ready find first end of line, make a String, find next...

If you are looking out at performance, you could have a look at the java.nio.* packages - those are supposedly faster than java.io.*

In Java 8, for anyone looking now to read file large files line by line,
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});

I actually did a research in this topic for months in my free time and came up with a benchmark and here is a code to benchmark all the different ways to read a File line by line.The individual performance may vary based on the underlying system.
I ran on a windows 10 Java 8 Intel i5 HP laptop:Here is the code.
import java.io.*;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.stream.Stream;
public class ReadComplexDelimitedFile {
private static long total = 0;
private static final Pattern FIELD_DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");
#SuppressWarnings("unused")
private void readFileUsingScanner() {
String s;
try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
while (stdin.hasNextLine()) {
s = stdin.nextLine();
String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
//Winner
private void readFileUsingCustomBufferedReader() {
try (CustomBufferedReader stdin = new CustomBufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReader() {
try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingLineReader() {
try (LineNumberReader stdin = new LineNumberReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingStreams() {
try (Stream<String> stream = Files.lines((new File(this.getClass().getResource("input.txt").getPath())).toPath())) {
total += stream.mapToInt(s -> FIELD_DELIMITER_PATTERN.split(s, 0).length).sum();
} catch (IOException e1) {
e1.printStackTrace();
}
}
private void readFileUsingBufferedReaderFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (CustomBufferedReader stdin = new CustomBufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
}
} catch (Exception e) {
System.err.println("Error");
}
} catch (Exception e) {
System.err.println("Error");
}
}
public static void main(String args[]) {
//JVM wamrup
for (int i = 0; i < 100000; i++) {
total += i;
}
// We know scanner is slow-Still warming up
ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
List<Long> longList = new ArrayList<>(50);
for (int i = 0; i < 50; i++) {
total = 0;
long startTime = System.nanoTime();
//readComplexDelimitedFile.readFileUsingScanner();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingScanner");
longList.forEach(System.out::println);
// Actual performance test starts here
longList = new ArrayList<>(10);
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingStreams();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingStreams");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingCustomBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingCustomBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingLineReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingLineReader");
longList.forEach(System.out::println);
}
}
I had to rewrite BufferedReader to avoid synchronized and a couple of boundary conditions that is not needed.(Atleast that's what I felt.It is not unit tested so use it at your own risk.)
import com.sun.istack.internal.NotNull;
import java.io.*;
import java.util.Iterator;
import java.util.NoSuchElementException;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
/**
* Reads text from a character-input stream, buffering characters so as to
* provide for the efficient reading of characters, arrays, and lines.
* <p>
* <p> The buffer size may be specified, or the default size may be used. The
* default is large enough for most purposes.
* <p>
* <p> In general, each read request made of a Reader causes a corresponding
* read request to be made of the underlying character or byte stream. It is
* therefore advisable to wrap a CustomBufferedReader around any Reader whose read()
* operations may be costly, such as FileReaders and InputStreamReaders. For
* example,
* <p>
* <pre>
* CustomBufferedReader in
* = new CustomBufferedReader(new FileReader("foo.in"));
* </pre>
* <p>
* will buffer the input from the specified file. Without buffering, each
* invocation of read() or readLine() could cause bytes to be read from the
* file, converted into characters, and then returned, which can be very
* inefficient.
* <p>
* <p> Programs that use DataInputStreams for textual input can be localized by
* replacing each DataInputStream with an appropriate CustomBufferedReader.
*
* #author Mark Reinhold
* #see FileReader
* #see InputStreamReader
* #see java.nio.file.Files#newBufferedReader
* #since JDK1.1
*/
public class CustomBufferedReader extends Reader {
private final Reader in;
private char cb[];
private int nChars, nextChar;
private static final int INVALIDATED = -2;
private static final int UNMARKED = -1;
private int markedChar = UNMARKED;
private int readAheadLimit = 0; /* Valid only when markedChar > 0 */
/**
* If the next character is a line feed, skip it
*/
private boolean skipLF = false;
/**
* The skipLF flag when the mark was set
*/
private boolean markedSkipLF = false;
private static int defaultCharBufferSize = 8192;
private static int defaultExpectedLineLength = 80;
private ReadWriteLock rwlock;
/**
* Creates a buffering character-input stream that uses an input buffer of
* the specified size.
*
* #param in A Reader
* #param sz Input-buffer size
* #throws IllegalArgumentException If {#code sz <= 0}
*/
public CustomBufferedReader(#NotNull final Reader in, int sz) {
super(in);
if (sz <= 0)
throw new IllegalArgumentException("Buffer size <= 0");
this.in = in;
cb = new char[sz];
nextChar = nChars = 0;
rwlock = new ReentrantReadWriteLock();
}
/**
* Creates a buffering character-input stream that uses a default-sized
* input buffer.
*
* #param in A Reader
*/
public CustomBufferedReader(#NotNull final Reader in) {
this(in, defaultCharBufferSize);
}
/**
* Fills the input buffer, taking the mark into account if it is valid.
*/
private void fill() throws IOException {
int dst;
if (markedChar <= UNMARKED) {
/* No mark */
dst = 0;
} else {
/* Marked */
int delta = nextChar - markedChar;
if (delta >= readAheadLimit) {
/* Gone past read-ahead limit: Invalidate mark */
markedChar = INVALIDATED;
readAheadLimit = 0;
dst = 0;
} else {
if (readAheadLimit <= cb.length) {
/* Shuffle in the current buffer */
System.arraycopy(cb, markedChar, cb, 0, delta);
markedChar = 0;
dst = delta;
} else {
/* Reallocate buffer to accommodate read-ahead limit */
char ncb[] = new char[readAheadLimit];
System.arraycopy(cb, markedChar, ncb, 0, delta);
cb = ncb;
markedChar = 0;
dst = delta;
}
nextChar = nChars = delta;
}
}
int n;
do {
n = in.read(cb, dst, cb.length - dst);
} while (n == 0);
if (n > 0) {
nChars = dst + n;
nextChar = dst;
}
}
/**
* Reads a single character.
*
* #return The character read, as an integer in the range
* 0 to 65535 (<tt>0x00-0xffff</tt>), or -1 if the
* end of the stream has been reached
* #throws IOException If an I/O error occurs
*/
public char readChar() throws IOException {
for (; ; ) {
if (nextChar >= nChars) {
fill();
if (nextChar >= nChars)
return (char) -1;
}
return cb[nextChar++];
}
}
/**
* Reads characters into a portion of an array, reading from the underlying
* stream if necessary.
*/
private int read1(char[] cbuf, int off, int len) throws IOException {
if (nextChar >= nChars) {
/* If the requested length is at least as large as the buffer, and
if there is no mark/reset activity, and if line feeds are not
being skipped, do not bother to copy the characters into the
local buffer. In this way buffered streams will cascade
harmlessly. */
if (len >= cb.length && markedChar <= UNMARKED && !skipLF) {
return in.read(cbuf, off, len);
}
fill();
}
if (nextChar >= nChars) return -1;
int n = Math.min(len, nChars - nextChar);
System.arraycopy(cb, nextChar, cbuf, off, n);
nextChar += n;
return n;
}
/**
* Reads characters into a portion of an array.
* <p>
* <p> This method implements the general contract of the corresponding
* <code>{#link Reader#read(char[], int, int) read}</code> method of the
* <code>{#link Reader}</code> class. As an additional convenience, it
* attempts to read as many characters as possible by repeatedly invoking
* the <code>read</code> method of the underlying stream. This iterated
* <code>read</code> continues until one of the following conditions becomes
* true: <ul>
* <p>
* <li> The specified number of characters have been read,
* <p>
* <li> The <code>read</code> method of the underlying stream returns
* <code>-1</code>, indicating end-of-file, or
* <p>
* <li> The <code>ready</code> method of the underlying stream
* returns <code>false</code>, indicating that further input requests
* would block.
* <p>
* </ul> If the first <code>read</code> on the underlying stream returns
* <code>-1</code> to indicate end-of-file then this method returns
* <code>-1</code>. Otherwise this method returns the number of characters
* actually read.
* <p>
* <p> Subclasses of this class are encouraged, but not required, to
* attempt to read as many characters as possible in the same fashion.
* <p>
* <p> Ordinarily this method takes characters from this stream's character
* buffer, filling it from the underlying stream as necessary. If,
* however, the buffer is empty, the mark is not valid, and the requested
* length is at least as large as the buffer, then this method will read
* characters directly from the underlying stream into the given array.
* Thus redundant <code>CustomBufferedReader</code>s will not copy data
* unnecessarily.
*
* #param cbuf Destination buffer
* #param off Offset at which to start storing characters
* #param len Maximum number of characters to read
* #return The number of characters read, or -1 if the end of the
* stream has been reached
* #throws IOException If an I/O error occurs
*/
public int read(char cbuf[], int off, int len) throws IOException {
int n = read1(cbuf, off, len);
if (n <= 0) return n;
while ((n < len) && in.ready()) {
int n1 = read1(cbuf, off + n, len - n);
if (n1 <= 0) break;
n += n1;
}
return n;
}
/**
* Reads a line of text. A line is considered to be terminated by any one
* of a line feed ('\n'), a carriage return ('\r'), or a carriage return
* followed immediately by a linefeed.
*
* #param ignoreLF If true, the next '\n' will be skipped
* #return A String containing the contents of the line, not including
* any line-termination characters, or null if the end of the
* stream has been reached
* #throws IOException If an I/O error occurs
* #see java.io.LineNumberReader#readLine()
*/
String readLine(boolean ignoreLF) throws IOException {
StringBuilder s = null;
int startChar;
bufferLoop:
for (; ; ) {
if (nextChar >= nChars)
fill();
if (nextChar >= nChars) { /* EOF */
if (s != null && s.length() > 0)
return s.toString();
else
return null;
}
boolean eol = false;
char c = 0;
int i;
/* Skip a leftover '\n', if necessary */
charLoop:
for (i = nextChar; i < nChars; i++) {
c = cb[i];
if ((c == '\n')) {
eol = true;
break charLoop;
}
}
startChar = nextChar;
nextChar = i;
if (eol) {
String str;
if (s == null) {
str = new String(cb, startChar, i - startChar);
} else {
s.append(cb, startChar, i - startChar);
str = s.toString();
}
nextChar++;
return str;
}
if (s == null)
s = new StringBuilder(defaultExpectedLineLength);
s.append(cb, startChar, i - startChar);
}
}
/**
* Reads a line of text. A line is considered to be terminated by any one
* of a line feed ('\n'), a carriage return ('\r'), or a carriage return
* followed immediately by a linefeed.
*
* #return A String containing the contents of the line, not including
* any line-termination characters, or null if the end of the
* stream has been reached
* #throws IOException If an I/O error occurs
* #see java.nio.file.Files#readAllLines
*/
public String readLine() throws IOException {
return readLine(false);
}
/**
* Skips characters.
*
* #param n The number of characters to skip
* #return The number of characters actually skipped
* #throws IllegalArgumentException If <code>n</code> is negative.
* #throws IOException If an I/O error occurs
*/
public long skip(long n) throws IOException {
if (n < 0L) {
throw new IllegalArgumentException("skip value is negative");
}
rwlock.readLock().lock();
long r = n;
try{
while (r > 0) {
if (nextChar >= nChars)
fill();
if (nextChar >= nChars) /* EOF */
break;
if (skipLF) {
skipLF = false;
if (cb[nextChar] == '\n') {
nextChar++;
}
}
long d = nChars - nextChar;
if (r <= d) {
nextChar += r;
r = 0;
break;
} else {
r -= d;
nextChar = nChars;
}
}
} finally {
rwlock.readLock().unlock();
}
return n - r;
}
/**
* Tells whether this stream is ready to be read. A buffered character
* stream is ready if the buffer is not empty, or if the underlying
* character stream is ready.
*
* #throws IOException If an I/O error occurs
*/
public boolean ready() throws IOException {
rwlock.readLock().lock();
try {
/*
* If newline needs to be skipped and the next char to be read
* is a newline character, then just skip it right away.
*/
if (skipLF) {
/* Note that in.ready() will return true if and only if the next
* read on the stream will not block.
*/
if (nextChar >= nChars && in.ready()) {
fill();
}
if (nextChar < nChars) {
if (cb[nextChar] == '\n')
nextChar++;
skipLF = false;
}
}
} finally {
rwlock.readLock().unlock();
}
return (nextChar < nChars) || in.ready();
}
/**
* Tells whether this stream supports the mark() operation, which it does.
*/
public boolean markSupported() {
return true;
}
/**
* Marks the present position in the stream. Subsequent calls to reset()
* will attempt to reposition the stream to this point.
*
* #param readAheadLimit Limit on the number of characters that may be
* read while still preserving the mark. An attempt
* to reset the stream after reading characters
* up to this limit or beyond may fail.
* A limit value larger than the size of the input
* buffer will cause a new buffer to be allocated
* whose size is no smaller than limit.
* Therefore large values should be used with care.
* #throws IllegalArgumentException If {#code readAheadLimit < 0}
* #throws IOException If an I/O error occurs
*/
public void mark(int readAheadLimit) throws IOException {
if (readAheadLimit < 0) {
throw new IllegalArgumentException("Read-ahead limit < 0");
}
rwlock.readLock().lock();
try {
this.readAheadLimit = readAheadLimit;
markedChar = nextChar;
markedSkipLF = skipLF;
} finally {
rwlock.readLock().unlock();
}
}
/**
* Resets the stream to the most recent mark.
*
* #throws IOException If the stream has never been marked,
* or if the mark has been invalidated
*/
public void reset() throws IOException {
rwlock.readLock().lock();
try {
if (markedChar < 0)
throw new IOException((markedChar == INVALIDATED)
? "Mark invalid"
: "Stream not marked");
nextChar = markedChar;
skipLF = markedSkipLF;
} finally {
rwlock.readLock().unlock();
}
}
public void close() throws IOException {
rwlock.readLock().lock();
try {
in.close();
} finally {
cb = null;
rwlock.readLock().unlock();
}
}
public Stream<String> lines() {
Iterator<String> iter = new Iterator<String>() {
String nextLine = null;
#Override
public boolean hasNext() {
if (nextLine != null) {
return true;
} else {
try {
nextLine = readLine();
return (nextLine != null);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
}
#Override
public String next() {
if (nextLine != null || hasNext()) {
String line = nextLine;
nextLine = null;
return line;
} else {
throw new NoSuchElementException();
}
}
};
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(
iter, Spliterator.ORDERED | Spliterator.NONNULL), false);
}
}
And now the results:
Time taken for readFileUsingBufferedReaderFileChannel
2902690903
1845190694
1894071377
1815161868
1861056735
1867693540
1857521371
1794176251
1768008762
1853089582
Time taken for readFileUsingBufferedReader
2022837353
1925901163
1802266711
1842689572
1899984555
1843101306
1998642345
1821242301
1820168806
1830375108
Time taken for readFileUsingStreams
1992855461
1930827034
1850876033
1843402533
1800378283
1863581324
1810857226
1798497108
1809531144
1796345853
Time taken for readFileUsingCustomBufferedReader
1759732702
1765987214
1776997357
1772999486
1768559162
1755248431
1744434555
1750349867
1740582606
1751390934
Time taken for readFileUsingLineReader
1845307174
1830950256
1829847321
1828125293
1827936280
1836947487
1832186310
1820276327
1830157935
1829171481
Process finished with exit code 0
Inference:
The test was run on a 200 MB file.
The test was repeated several times.
The data looked like this
Start Date^|^Start Time^|^End Date^|^End Time^|^Event Title ^|^All Day Event^|^No End Time^|^Event Description^|^Contact ^|^Contact Email^|^Contact Phone^|^Location^|^Category^|^Mandatory^|^Registration^|^Maximum^|^Last Date To Register
9/5/2011^|^3:00:00 PM^|^9/5/2011^|^^|^Social Studies Dept. Meeting^|^N^|^Y^|^Department meeting^|^Chris Gallagher^|^cgallagher#schoolwires.com^|^814-555-5179^|^High School^|^2^|^N^|^N^|^25^|^9/2/2011
Bottomline not much difference between BufferedReader and my CustomReader and it is very miniscule and hence you can use this to read your file.
Trust me you don't have to break your head.use BufferedReader with readLine,it is properly tested.At worst if you feel you can improve it just override and change it to StringBuilder instead of StringBuffer just to shave off half a second

I had a similar problem, but I only needed the bytes from the file. I read through links provided in the various answers, and ultimately tried writing one similar to #5 in Evgeniy's answer. They weren't kidding, it took a lot of code.
The basic premise is that each line of text is of unknown length. I will start with a SeekableByteChannel, read data into a ByteBuffer, then loop over it looking for EOL. When something is a "carryover" between loops, it increments a counter and then ultimately moves the SeekableByteChannel position around and reads the entire buffer.
It is verbose ... but it works. It was plenty fast for what I needed, but I'm sure there are more improvements that can be made.
The process method is stripped down to the basics for kicking off reading the file.
private long startOffset;
private long endOffset;
private SeekableByteChannel sbc;
private final ByteBuffer buffer = ByteBuffer.allocateDirect(1024);
public void process() throws IOException
{
startOffset = 0;
sbc = Files.newByteChannel(FILE, EnumSet.of(READ));
byte[] message = null;
while((message = readRecord()) != null)
{
// do something
}
}
public byte[] readRecord() throws IOException
{
endOffset = startOffset;
boolean eol = false;
boolean carryOver = false;
byte[] record = null;
while(!eol)
{
byte data;
buffer.clear();
final int bytesRead = sbc.read(buffer);
if(bytesRead == -1)
{
return null;
}
buffer.flip();
for(int i = 0; i < bytesRead && !eol; i++)
{
data = buffer.get();
if(data == '\r' || data == '\n')
{
eol = true;
endOffset += i;
if(carryOver)
{
final int messageSize = (int)(endOffset - startOffset);
sbc.position(startOffset);
final ByteBuffer tempBuffer = ByteBuffer.allocateDirect(messageSize);
sbc.read(tempBuffer);
tempBuffer.flip();
record = new byte[messageSize];
tempBuffer.get(record);
}
else
{
record = new byte[i];
// Need to move the buffer position back since the get moved it forward
buffer.position(0);
buffer.get(record, 0, i);
}
// Skip past the newline characters
if(isWindowsOS())
{
startOffset = (endOffset + 2);
}
else
{
startOffset = (endOffset + 1);
}
// Move the file position back
sbc.position(startOffset);
}
}
if(!eol && sbc.position() == sbc.size())
{
// We have hit the end of the file, just take all the bytes
record = new byte[bytesRead];
eol = true;
buffer.position(0);
buffer.get(record, 0, bytesRead);
}
else if(!eol)
{
// The EOL marker wasn't found, continue the loop
carryOver = true;
endOffset += bytesRead;
}
}
// System.out.println(new String(record));
return record;
}

This article is a great way to start.
Also, you need to create test cases in which you read first 10k(or something else, but shouldn't be too small) lines and calculate the reading times accordingly.
Threading might be a good way to go, but it's important that we know what you will be doing with the data.
Another thing to be considered is, how you will store that size of data.

I tried the following three methods, my file size is 1M, and I got results:
I run the program several times it looks that BufferedReader is faster.
#Test
public void testLargeFileIO_Scanner() throws Exception {
long start = new Date().getTime();
String fileName = "/Downloads/SampleTextFile_1000kb.txt"; //this path is on my local
InputStream inputStream = new FileInputStream(fileName);
try (Scanner fileScanner = new Scanner(inputStream, StandardCharsets.UTF_8.name())) {
while (fileScanner.hasNextLine()) {
String line = fileScanner.nextLine();
//System.out.println(line);
}
}
long end = new Date().getTime();
long time = end - start;
System.out.println("Scanner Time Consumed => " + time);
}
#Test
public void testLargeFileIO_BufferedReader() throws Exception {
long start = new Date().getTime();
String fileName = "/Downloads/SampleTextFile_1000kb.txt"; //this path is on my local
try (BufferedReader fileBufferReader = new BufferedReader(new FileReader(fileName))) {
String fileLineContent;
while ((fileLineContent = fileBufferReader.readLine()) != null) {
//System.out.println(fileLineContent);
}
}
long end = new Date().getTime();
long time = (long) (end - start);
System.out.println("BufferedReader Time Consumed => " + time);
}
#Test
public void testLargeFileIO_Stream() throws Exception {
long start = new Date().getTime();
String fileName = "/Downloads/SampleTextFile_1000kb.txt"; //this path is on my local
try (Stream inputStream = Files.lines(Paths.get(fileName), StandardCharsets.UTF_8)) {
//inputStream.forEach(System.out::println);
}
long end = new Date().getTime();
long time = end - start;
System.out.println("Stream Time Consumed => " + time);
}

Related

Java FileChannel Vs BufferedReader - Spring Batch - Reader

We process huge files (sometimes 50 GB each file). The application reads this one file and based on the business logic, it will write multiple output files (4-6).
The records in the file are of variable length and each field in a record is a delimiter separated.
Going by the understanding that reading a file using FileChannel with a ByteBuffer was always better than using a BufferedReader.readLine and then using a split by the delimiter.
BufferSizes tried 10240(10KB) and even more
Commit interval - 5000, 10000 etc
Below is how we used file channel to read:
Read byte by byte. Check if the read byte is a new line char(10) -
which means end of line.
check for delimiter bytes. capture the bytes read in a byte array(we initialized this byte array with a maximum field size of 350 bytes) until delimiter bytes are encountered.
convert these bytes read until this time, to String using UTF-8 encoding - new String(byteArr, 0, index,"UTF-8") to be specific - index is the number of bytes read until delimiter.
Using this method of reading the file using FileChannel took 57 minutes to process the file.
We want to decrease this time and tried using BufferredReader.readLine() and then use a split by delimiter, to see how it fares.
And shockingly the same file completed processing only in 7 minutes.
What's the catch here? Why FileChannel is taking more time than a buffered reader and then using a string split.
I was always under the assumption that ReadLine and Split combination will have a big performance impact?
Can any one throw light on if I was using FileChannel in a wrong way? One
Thanks in advance. Hope I have summarized the issue properly.
The below is sample code :
while (inputByteBuffer.hasRemaining() && (b = inputByteBuffer.get()) != 0){
boolean endOfField = false;
if (b == 10){
break;
}
else{
if (b == 94){//^
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b2 = inputByteBuffer.get();
if (b2 == 124){//|
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b3 = inputByteBuffer.get();
if (b3 == 94){//^
String field = new String(fieldBytes, 0, index, encoding);
if(fieldIndex == -1){
fields = new String[sizeFromAConfiguration];
}else{
fields[fieldIndex] = field;
}
fieldBytes = new byte[maxFieldSize];
endOfField = true;
fieldIndex++;
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b3, index);
}
}
else{
endOfFile = true;
//fields.add(new String(fieldBytes, 0, index, encoding));
fields[fieldIndex] = new String(fieldBytes, 0, index, encoding);
fieldBytes = new byte[maxFieldSize];
endOfField = true;
}
}else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
}
}else{
endOfFile = true;
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (endOfField){
index = 0;
}
else{
index++;
}
}

You're causing a lot of overhead with the constant hasRemaining()/read() checks as well as the constant get() calls. It would probably be better to get() the entire buffer into an array and process that directly, only calling read() when you get to the end.
And to answer a question in comments, you should not allocate a new ByteBuffer per read. This is expensive. Keep using the same one. And NB do not use a DirectByteBuffer for this application. It is not appropriate: it's only appropriate when you want the data to stay south of the JVM/JNI boundary, e.g. when merely copying between channels.
But I think I would throw this away, or rather rewrite it, using BufferedReader.read(), rather than readLine() followed by string splits, and using much the same logic as you have here, except of course that you don't need to keep calling hasRemaining() and filling the buffer, which BufferedReader will do automatically for you.
You have to take care to store the result of read() into an int, and to check it for -1 after every read().
It isn't clear to me that you should be using a Reader at all actually, unless you know you have multibyte text. Possibly a simple BufferedInputStream would be more appropriate.

While one cannot tell with certainty how a particular code will behave I would imagine the best way is to profile it just like you did.The FileChannel while percieved to be faster is actually not helping in your case.But this may not be because of reading from the file but actual processing that you do with the content you read.
One article I would like to point out while dealing with files is
https://www.redgreencode.com/why-is-java-io-slow/
Also the corresponding Github codebase
Java IO benchmark
I would like to point out this code to use a combination of both worlds
fos = new FileOutputStream(outputFile);
outFileChannel = fos.getChannel();
bufferedWriter = new BufferedWriter(Channels.newWriter(outFileChannel, "UTF-8"));
Since it is read in your case I will consider
File inputFile = new File("C:\\input.txt");
FileInputStream fis = new FileInputStream(inputFile);
FileChannel inputChannel = fis.getChannel();
BufferedReader bufferedReader = new BufferedReader(Channels.newReader(inputChannel,"UTF-8"));
Also I will tweak the chunksize and with Spring batch it is always trial and error to find sweet spot.
On a completely unrelated note the reason for your problem of not able to use BufferedReader is because of doubling of charecters and I am assuming this happens more commonly with ebcdic charecters.I will simply run a loop like this to identfy the troublemakers and eliminate at the source.
import java.io.UnsupportedEncodingException;
public class EbcdicConvertor {
public static void main(String[] args) throws UnsupportedEncodingException {
int index = 0;
for (int i = -127; i < 128; i++) {
byte[] b = new byte[1];
b[0] = (byte) i;
String cp037 = new String(b, "CP037");
if (cp037.getBytes().length == 2) {
index++;
System.out.println(i + "::" + cp037);
}
}
System.out.println(index);
}
}
The above answer is without testing my actual hypothesis.Here is an actual program to measure time.The results speak for themselves on a 200 MB file
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
public class ReadComplexDelimitedFile {
private static long total = 0;
private static final Pattern DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");
private void readFileUsingScanner() {
String s;
try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
while (stdin.hasNextLine()) {
s = stdin.nextLine();
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingCustomBufferedReader() {
try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReader() {
try (java.io.BufferedReader stdin = new java.io.BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
}
} catch (Exception e) {
System.err.println("Error");
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderByteFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
int b;
StringBuilder sb = new StringBuilder();
while ((b = stdin.read()) != -1) {
if (b == 10) {
total = total + DELIMITER_PATTERN.split(sb, 0).length;
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingFileChannelStream() {
try (RandomAccessFile fis = new RandomAccessFile(new File(this.getClass().getResource("input.txt").getPath()), "r")) {
try (FileChannel inputChannel = fis.getChannel()) {
ByteBuffer byteBuffer = ByteBuffer.allocate(8192);
ByteBuffer recordBuffer = ByteBuffer.allocate(250);
int recordLength = 0;
while ((inputChannel.read(byteBuffer)) != -1) {
byte b;
byteBuffer.flip();
while (byteBuffer.hasRemaining() && (b = byteBuffer.get()) != -1) {
if (b == 10) {
recordBuffer.flip();
total = total + splitIntoFields(recordBuffer, recordLength);
recordBuffer.clear();
recordLength = 0;
} else {
++recordLength;
recordBuffer.put(b);
}
}
byteBuffer.clear();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
private int splitIntoFields(ByteBuffer recordBuffer, int recordLength) {
byte b;
String[] fields = new String[17];
int fieldCount = -1;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < recordLength - 1; i++) {
b = recordBuffer.get(i);
if (b == 94 && recordBuffer.get(++i) == 124 && recordBuffer.get(++i) == 94) {
fields[++fieldCount] = sb.toString();
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
fields[++fieldCount] = sb.toString();
return fields.length;
}
public static void main(String args[]) {
//JVM wamrup
for (int i = 0; i < 100000; i++) {
total += i;
}
// We know scanner is slow-Still warming up
ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
List<Long> longList = new ArrayList<>(50);
for (int i = 0; i < 50; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingScanner();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingScanner");
longList.forEach(System.out::println);
// Actual performance test starts here
longList = new ArrayList<>(10);
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingCustomBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingCustomBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderByteFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderByteFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingFileChannelStream();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingFileChannelStream");
longList.forEach(System.out::println);
}
}
BufferedReader was written very long back and hence we can rewrite some parts relevant to this example.For instance we don't care about \r and skipLF or skipCR or those kinds of stuff
We are going to read the file( no need for syncrhonized)
By extension no need for StringBuffer even otherwise StringBuilder can be used.Performance improvement immediately seen.
dangerous hack,remove synchronized and replace StringBuffer with StringBuilder don't use it without proper testing and not knowing what you are doing
public String readLine() throws IOException {
StringBuilder s = null;
int startChar;
bufferLoop:
for (; ; ) {
if (nextChar >= nChars)
fill();
if (nextChar >= nChars) { /* EOF */
if (s != null && s.length() > 0)
return s.toString();
else
return null;
}
boolean eol = false;
char c = 0;
int i;
/* Skip a leftover '\n', if necessary */
charLoop:
for (i = nextChar; i < nChars; i++) {
c = cb[i];
if (c == '\n') {
eol = true;
break charLoop;
}
}
startChar = nextChar;
nextChar = i;
if (eol) {
String str;
if (s == null) {
str = new String(cb, startChar, i - startChar);
} else {
s.append(cb, startChar, i - startChar);
str = s.toString();
}
nextChar++;
return str;
}
if (s == null)
s = new StringBuilder(defaultExpectedLineLength);
s.append(cb, startChar, i - startChar);
}
}
Java 8 Intel i5 12 GB RAM Windows 10
Result:
Time taken for readFileUsingBufferedReaderFileChannel::
2581635057 1849820885 1763992972 1770510738 1746444157 1733491399
1740530125 1723907177 1724280512 1732445638
Time taken for readFileUsingBufferedReader
1851027073 1775304769 1803507033 1789979554 1786974538 1802675458
1789672780 1798036307 1789847714 1785302003
Time taken for readFileUsingCustomBufferedReader
1745220476 1721039975 1715383650 1728548462 1724746005 1718177466
1738026017 1748077438 1724608192 1736294175
Time taken for readFileUsingBufferedReaderByteFileChannel
2872857919 2480237636 2917488143 2913491126 2880117231 2904614745
2911756298 2878777496 2892169722 2888091211
Time taken for readFileUsingFileChannelStream
3039447073 2896156498 2538389366 2906287280 2887612064 2929288046
2895626578 2955326255 2897535059 2884476915
Process finished with exit code 0

I did try NIO with all possible options(provided in this post and to the best of my knowledge and research) and found that it no where came close to BufferedReader in terms of reading a text file.
Changing BufferedReader to use StringBuilder in place of StringBuffer, I don't see any significant improvement in performance (only very few seconds for some files and some of them were better using StringBuffer itself).
Removing synchronized block also didn't give much/any improvement. And it's not worth to tweak something by which we didn't receive any benefit.
The below is the time taken(reading, processing, writing - time taken for processing and writing is not significant - not even 20% of time) for file which is around 50 GB
NIO : 71.67 (Minutes)
IO (BufferedReader) : 10.84 (Minutes)
Thank you all for your time to reading and responding to this post and providing suggestions.

The main issue here is creating a new byte[] very rapidly(fieldBytes = new byte[maxFieldSize];).
Since for every iteration a new array is being created, garbage collection is being kicked off very often which triggers "stop the world" to reclaim the memory.
And also, the object creation could be expensive.
We could rather initialize the byte array once and then track the indexes to just convert the field to string with an end index.
And anyway, BufferedReader is faster than FileChannel, atleast to read the ASCII files, and to keep the code simple, we continued using Bufferred Reader itself.
Using Bufferred reader, the development and testing effort can be reduced by not having tedious logic to find delimiters and populating the object.

How to split binary data into hex strings when characters are in the start and end of the strings

I want to split data based on character values which are two right parenthesis )) as start of substring and carriage return CR as the end of substring. The data comes in form of bytes Am stuck on how to split it. This is so far what I have come up with.
public class ByteDecoder {
public static void main(String[] args) throws IOException {
InputStream is = null;
DataInputStream dis = null;
try{
is = new FileInputStream("byte.log");
dis = new DataInputStream(is);
int count = is.available();
byte[] bs = new byte[count];
dis.read(bs);
for (byte b:bs)
{
char c = (char)b;
System.out.println(c);
//convert bytes to hex string
// String c = DatatypeConverter.printHexBinary( bs);
}
}catch(Exception e){
e.printStackTrace();
}finally{
if(is!=null)
is.close();
if(dis!=null)
dis.close();
}
}
}

CR (unlucky 13) as end marker of binary data might be a bit dangerous. More dangerous seems how the text and bytes became written: the text must be written as bytes in some encoding.
But considering that, one could wrap the FileInputStream in your own ByteLogInputStream, and there hold the reading state:
/**
* An InputStream converting bytes between ASCII "))" and CR to hexadecimal.
* Typically wrapped as:
* <pre>
* try (BufferedReader in = new BufferedReader(
* new InputStreamReader(
* new ByteLogInputStream(
* new FileInputStream(file), "UTF-8"))) {
* ...
* }
* </pre>
*/
public class ByteLogInputStream extends InputStream {
private enum State {
TEXT,
AFTER_RIGHT_PARENT,
BINARY
}
private final InputStream in;
private State state = State.TEXT;
private int nextHexDigit = 0;
public ByteLogInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
if (nextHexDigit != 0) {
int hex = nextHexDigit;
nextHexDigit = 0;
return hex;
}
int ch = in.read();
if (ch != -1) {
switch (state) {
case TEXT:
if (ch == ')') {
state = State.AFTER_RIGHT_PARENT;
}
break;
case AFTER_RIGHT_PARENT:
if (ch == ')') {
state = State.BINARY;
}
break;
case BINARY:
if (ch == '\r') {
state = State.TEXT;
} else {
String hex2 = String.format("%02X", ch);
ch = hex2.charAt(0);
nextHexDigit = hex2.charAt(1);
}
break;
}
}
return ch;
}
}
As one binary byte results in two hexadecimal digits, you need to buffer a nextHexDigit for the next digit.
I did not override available (to account for a possible nextHexDigit).
If you want to check whether \r\n follows, one should use a PushBackReader. I did use an InputStream, as you did not specify the encoding.

Unchecked or unsafe operations error in javac

I am completing a lab assignment for school and get this error when I compile. The program runs fine, bit would like to fix what is causing the error. The program code and the complete error is below. Thanks as always!
Note: Recompile with -Xlint:unchecked for details.
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package ie.moguntia.webcrawler;
import java.net.*;
import java.io.*;
import java.util.*;
/**
*
* #author Cong
*/
public class SaveURL
{
/**
* Opens a buffered stream on the url and copies the contents to writer
*/
public static void saveURL(URL url, Writer writer)
throws IOException {
BufferedInputStream in = new BufferedInputStream(url.openStream());
for (int c = in.read(); c != -1; c = in.read()) {
writer.write(c);
}
}
/**
* Opens a buffered stream on the url and copies the contents to OutputStream
*/
public static void saveURL(URL url, OutputStream os)
throws IOException {
InputStream is = url.openStream();
byte[] buf = new byte[1048576];
int n = is.read(buf);
while (n != -1) {
os.write(buf, 0, n);
n = is.read(buf);
}
}
/**
* Writes the contents of the url to a string by calling saveURL with a
* string writer as argument
*/
public static String getURL(URL url)
throws IOException {
StringWriter sw = new StringWriter();
saveURL(url, sw);
return sw.toString();
}
/**
* Writes the contents of the url to a new file by calling saveURL with
* a file writer as argument
*/
public static void writeURLtoFile(URL url, String filename)
throws IOException {
// FileWriter writer = new FileWriter(filename);
// saveURL(url, writer);
// writer.close();
FileOutputStream os = new FileOutputStream(filename);
saveURL(url, os);
os.close();
}
/**
* Extract links directly from a URL by calling extractLinks(getURL())
*/
public static Vector extractLinks(URL url)
throws IOException {
return extractLinks(getURL(url));
}
public static Map extractLinksWithText(URL url)
throws IOException {
return extractLinksWithText(getURL(url));
}
/**
* Extract links from a html page given as a raw and a lower case string
* In order to avoid the possible double conversion from mixed to lower case
* a second method is provided, where the conversion is done externally.
*/
public static Vector extractLinks(String rawPage, String page) {
int index = 0;
Vector links = new Vector();
while ((index = page.indexOf("<a ", index)) != -1)
{
if ((index = page.indexOf("href", index)) == -1) break;
if ((index = page.indexOf("=", index)) == -1) break;
String remaining = rawPage.substring(++index);
StringTokenizer st
= new StringTokenizer(remaining, "\t\n\r\"'>#");
String strLink = st.nextToken();
if (! links.contains(strLink)) links.add(strLink);
}
return links;
}
/**
* Extract links (key) with link text (value)
* Note that due to the nature of a Map only one link text is returned per
* URL, even if a link occurs multiple times with different texts.
*/
public static Map extractLinksWithText(String rawPage, String page) {
int index = 0;
Map links = new HashMap();
while ((index = page.indexOf("<a ", index)) != -1)
{
int tagEnd = page.indexOf(">", index);
if ((index = page.indexOf("href", index)) == -1) break;
if ((index = page.indexOf("=", index)) == -1) break;
int endTag = page.indexOf("</a", index);
String remaining = rawPage.substring(++index);
StringTokenizer st
= new StringTokenizer(remaining, "\t\n\r\"'>#");
String strLink = st.nextToken();
String strText = "";
if (tagEnd != -1 && tagEnd + 1 <= endTag) {
strText = rawPage.substring(tagEnd + 1, endTag);
}
strText = strText.replaceAll("\\s+", " ");
links.put(strLink, strText);
}
return links;
}
/**
* Extract links from a html page given as a String
* The return value is a vector of strings. This method does neither check
* the validity of its results nor does it care about html comments, so
* links that are commented out are also retrieved.
*/
public static Vector extractLinks(String rawPage) {
return extractLinks(rawPage, rawPage.toLowerCase().replaceAll("\\s", " "));
}
public static Map extractLinksWithText(String rawPage) {
return extractLinksWithText(rawPage, rawPage.toLowerCase().replaceAll("\\s", " "));
}
/**
* As a standalone program this class is capable of copying a url to a file
*/
public static void main(String[] args) {
try {
if (args.length == 1) {
URL url = new URL(args[0]);
System.out.println("Content-Type: " +
url.openConnection().getContentType());
// Vector links = extractLinks(url);
// for (int n = 0; n < links.size(); n++) {
// System.out.println((String) links.elementAt(n));
// }
Set links = extractLinksWithText(url).entrySet();
Iterator it = links.iterator();
while (it.hasNext()) {
Map.Entry en = (Map.Entry) it.next();
String strLink = (String) en.getKey();
String strText = (String) en.getValue();
System.out.println(strLink + " \"" + strText + "\" ");
}
return;
} else if (args.length == 2) {
writeURLtoFile(new URL(args[0]), args[1]);
return;
}
} catch (Exception e) {
System.err.println("An error occured: ");
e.printStackTrace();
// System.err.println(e.toString());
}
// Display usage information
// (If the program had done anything sensible, we wouldn't be here.)
System.err.println("Usage: java SaveURL <url> [<file>]");
System.err.println("Saves a URL to a file.");
System.err.println("If no file is given, extracts hyperlinks on url to console.");
}
}

You are using the raw (i.e. non-generic) forms of several classes that have generic type parameters, including
Map
HashMap
Vector
Iterator
Set
Map.Entry
Use the generic forms of these classes by supplying appropriate type parameters.

Reading files bits and saving them

i have file reader which read entire file and write it's bits.
I have this class which help reading:
import java.io.*;
public class FileReader extends ByteArrayInputStream{
private int bitsRead;
private int bitPosition;
private int currentByte;
private int myMark;
private final static int NUM_BITS_IN_BYTE = 8;
private final static int END_POSITION = -1;
private boolean readingStarted;
/**
* Create a BitInputStream for a File on disk.
*/
public FileReader( byte[] buf ) throws IOException {
super( buf );
myMark = 0;
bitsRead = 0;
bitPosition = NUM_BITS_IN_BYTE-1;
currentByte = 0;
readingStarted = false;
}
/**
* Read a binary "1" or "0" from the File.
*/
public int readBit() throws IOException {
int theBit = -1;
if( bitPosition == END_POSITION || !readingStarted ) {
currentByte = super.read();
bitPosition = NUM_BITS_IN_BYTE-1;
readingStarted = true;
}
theBit = (0x01 << bitPosition) & currentByte;
bitPosition--;
if( theBit > 0 ) {
theBit = 1;
}
return( theBit );
}
/**
* Return the next byte in the File as lowest 8 bits of int.
*/
public int read() {
currentByte = super.read();
bitPosition = END_POSITION;
readingStarted = true;
return( currentByte );
}
/**
*
*/
public void mark( int readAheadLimit ) {
super.mark(readAheadLimit);
myMark = bitPosition;
}
/**
* Add needed functionality to super's reset() method. Reset to
* the last valid position marked in the input stream.
*/
public void reset() {
super.pos = super.mark-1;
currentByte = super.read();
bitPosition = myMark;
}
/**
* Returns the number of bits still available to be read.
*/
public int availableBits() throws IOException {
return( ((super.available() * 8) + (bitPosition + 1)) );
}
}
In class where i call this, i do:
FileInputStream inputStream = new FileInputStream(file);
byte[] fileBits = new byte[inputStream.available()];
inputStream.read(fileBits, 0, inputStream.available());
inputStream.close();
FileReader bitIn = new FileReader(fileBits);
and this work correctly.
However i have problems with big files above 100 mb because byte[] have the end.
So i want to read bigger files. Maybe some could suggest how i can improve this code ?
Thanks.

If scaling to large file sizes is important, you'd be better off not reading the entire file into memory. The downside is that handling the IOException in more locations can be a little messy. Also, it doesn't look like your application needs something that implements the InputStream API, it just needs the readBit() method. So, you can safely encapsulate, rather than extend, the InputStream.
class FileReader {
private final InputStream src;
private final byte[] bits = new byte[8192];
private int len;
private int pos;
FileReader(InputStream src) {
this.src = src;
}
int readBit() throws IOException {
int idx = pos / 8;
if (idx >= len) {
int n = src.read(bits);
if (n < 0)
return -1;
len = n;
pos = 0;
idx = 0;
}
return ((bits[idx] & (1 << (pos++ % 8))) == 0) ? 0 : 1;
}
}
Usage would look similar.
FileInputStream src = new FileInputStream(file);
try {
FileReader bitIn = new FileReader(src);
...
} finally {
src.close();
}
If you really do want to read in the entire file, and you are working with an actual file, you can query the length of the file first.
File file = new File(path);
if (file.length() > Integer.MAX_VALUE)
throw new IllegalArgumentException("File is too large: " + file.length());
int len = (int) file.length();
FileInputStream inputStream = new FileInputStream(file);
try {
byte[] fileBits = new byte[len];
for (int pos = 0; pos < len; ) {
int n = inputStream.read(fileBits, pos, len - pos);
if (n < 0)
throw new EOFException();
pos += n;
}
/* Use bits. */
...
} finally {
inputStream.close();
}

org.apache.commons.io.IOUtils.copy(InputStream in, OutputStream out)

Multiple readers for InputStream in Java

I have an InputStream from which I'm reading characters. I would like multiple readers to access this InputStream. It seems that a reasonable way to achieve this is to write incoming data to a StringBuffer or StringBuilder, and have the multiple readers read that. Unfortunately, StringBufferInputStream is deprecated. StringReader reads a string, not a mutable object that's continuously being updated. What are my options? Write my own?

Note: My other answer is more general (and better in my opinion).
As noted by #dimo414, the answer below requires the first reader to always be ahead of the second reader. If this is indeed the case for you, then this answer might still be preferable since it builds upon standard classes.
To create two readers that read independently from the same source, you'll have to make sure they don't consume data from the same stream.
This can be achieved by combining TeeInputStream from Apache Commons and a PipedInputStream and PipedOutputStream as follows:
import java.io.*;
import org.apache.commons.io.input.TeeInputStream;
class Test {
public static void main(String[] args) throws IOException {
// Create the source input stream.
InputStream is = new FileInputStream("filename.txt");
// Create a piped input stream for one of the readers.
PipedInputStream in = new PipedInputStream();
// Create a tee-splitter for the other reader.
TeeInputStream tee = new TeeInputStream(is, new PipedOutputStream(in));
// Create the two buffered readers.
BufferedReader br1 = new BufferedReader(new InputStreamReader(tee));
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
// Do some interleaved reads from them.
System.out.println("One line from br1:");
System.out.println(br1.readLine());
System.out.println();
System.out.println("Two lines from br2:");
System.out.println(br2.readLine());
System.out.println(br2.readLine());
System.out.println();
System.out.println("One line from br1:");
System.out.println(br1.readLine());
System.out.println();
}
}
Output:
One line from br1:
Line1: Lorem ipsum dolor sit amet, <-- reading from start
Two lines from br2:
Line1: Lorem ipsum dolor sit amet, <-- reading from start
Line2: consectetur adipisicing elit,
One line from br1:
Line2: consectetur adipisicing elit, <-- resumes on line 2

As you've probably noted, once you've read a byte from an input stream, it's gone forever (unless you've saved it somewhere yourself).
The solution below does save the bytes until all subscribing input streams have read it.
It works as follows:
// Create a SplittableInputStream from the originalStream
SplittableInputStream is = new SplittableInputStream(originalStream);
// Fork this to get more input streams reading independently from originalStream
SplittableInputStream is2 = is.split();
SplittableInputStream is3 = is.split();
Each time is is split() it will yield a new InputStream that will read the bytes from the point where is was split.
The SplittableInputStream looks as follows (copy'n'paste away!):
class SplittableInputStream extends InputStream {
// Almost an input stream: The read-method takes an id.
static class MultiplexedSource {
static int MIN_BUF = 4096;
// Underlying source
private InputStream source;
// Read positions of each SplittableInputStream
private List<Integer> readPositions = new ArrayList<>();
// Data to be read by the SplittableInputStreams
int[] buffer = new int[MIN_BUF];
// Last valid position in buffer
int writePosition = 0;
public MultiplexedSource(InputStream source) {
this.source = source;
}
// Add a multiplexed reader. Return new reader id.
int addSource(int splitId) {
readPositions.add(splitId == -1 ? 0 : readPositions.get(splitId));
return readPositions.size() - 1;
}
// Make room for more data (and drop data that has been read by
// all readers)
private void readjustBuffer() {
int from = Collections.min(readPositions);
int to = Collections.max(readPositions);
int newLength = Math.max((to - from) * 2, MIN_BUF);
int[] newBuf = new int[newLength];
System.arraycopy(buffer, from, newBuf, 0, to - from);
for (int i = 0; i < readPositions.size(); i++)
readPositions.set(i, readPositions.get(i) - from);
writePosition -= from;
buffer = newBuf;
}
// Read and advance position for given reader
public int read(int readerId) throws IOException {
// Enough data in buffer?
if (readPositions.get(readerId) >= writePosition) {
readjustBuffer();
buffer[writePosition++] = source.read();
}
int pos = readPositions.get(readerId);
int b = buffer[pos];
if (b != -1)
readPositions.set(readerId, pos + 1);
return b;
}
}
// Non-root fields
MultiplexedSource multiSource;
int myId;
// Public constructor: Used for first SplittableInputStream
public SplittableInputStream(InputStream source) {
multiSource = new MultiplexedSource(source);
myId = multiSource.addSource(-1);
}
// Private constructor: Used in split()
private SplittableInputStream(MultiplexedSource multiSource, int splitId) {
this.multiSource = multiSource;
myId = multiSource.addSource(splitId);
}
// Returns a new InputStream that will read bytes from this position
// onwards.
public SplittableInputStream split() {
return new SplittableInputStream(multiSource, myId);
}
#Override
public int read() throws IOException {
return multiSource.read(myId);
}
}
Finally, a demo:
String str = "Lorem ipsum\ndolor sit\namet\n";
InputStream is = new ByteArrayInputStream(str.getBytes("UTF-8"));
// Create the two buffered readers.
SplittableInputStream is1 = new SplittableInputStream(is);
SplittableInputStream is2 = is1.split();
BufferedReader br1 = new BufferedReader(new InputStreamReader(is1));
BufferedReader br2 = new BufferedReader(new InputStreamReader(is2));
// Do some interleaved reads from them.
System.out.println("One line from br1:");
System.out.println(br1.readLine());
System.out.println();
System.out.println("Two lines from br2:");
System.out.println(br2.readLine());
System.out.println(br2.readLine());
System.out.println();
System.out.println("One line from br1:");
System.out.println(br1.readLine());
System.out.println();
Output:
One line from br1:
Lorem ipsum
Two lines from br2:
Lorem ipsum
dolor sit
One line from br1:
dolor sit

Use TeeInputStream to copy all the bytes read from InputStream to secondary OutputStream, e.g. ByteArrayOutputStream.

Input stream work like this: once you read a portion from it, it's gone forever. You can't go back and re-read it. what you could do is something like this:
class InputStreamSplitter {
InputStreamSplitter(InputStream toReadFrom) {
this.reader = new InputStreamReader(toReadFrom);
}
void addListener(Listener l) {
this.listeners.add(l);
}
void work() {
String line = this.reader.readLine();
while(line != null) {
for(Listener l : this.listeners) {
l.processLine(line);
}
}
}
}
interface Listener {
processLine(String line);
}
have all interested parties implement Listener and add them to InputStreamSplitter

Instead of using StringWriter/StringBufferInputStream, write your original InputStream to a ByteArrayOutputStream. Once you've finished reading from the original InputStream, pass the byte array returned from ByteArrayOutputStream.toByteArray to a ByteArrayInputStream. Use this InputStream as the InputStream of choice for passing around other things that need to read from it.
Essentially, all you'd be doing here is storing the contents of the original InputStream into a byte[] cache in memory as you tried to do originally with StringWriter/StringBufferInputStream.

Here's another way to read from two streams independently, without presuming one is ahead of the other, but with standard classes. It does, however, eagerly read from the underlying input stream in the background, which may be undesirable, depending on your application.
public static void main(String[] args) throws IOException {
// Create the source input stream.
InputStream is = new ByteArrayInputStream("line1\nline2\nline3".getBytes());
// Create a piped input stream for each reader;
PipedInputStream in1 = new PipedInputStream();
PipedInputStream in2 = new PipedInputStream();
// Start copying the input stream to both piped input streams.
startCopy(is, new TeeOutputStream(
new PipedOutputStream(in1), new PipedOutputStream(in2)));
// Create the two buffered readers.
BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
BufferedReader br2 = new BufferedReader(new InputStreamReader(in2));
// Do some interleaved reads from them.
// ...
}
private static void startCopy(InputStream in, OutputStream out) {
(new Thread() {
public void run() {
try {
IOUtils.copy(in, out);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}).start();
}

Looking for a possible way to have an outputstream sending bytes to two or more different Inputstream, I found this forum.
Unfortunately, the exact solution was directing to PipedOutputStream and PipedInputStream.
So, I was declined to write a PipeOutputStream extension.
Here it is. The example is written in the PipedOutputStream's "main" method.
/**
* Extensao de {#link PipedOutputStream}, onde eh possivel conectar mais de um {#link PipedInputStream}
*/
public class PipedOutputStreamEx extends PipedOutputStream {
/**
*
*/
public PipedOutputStreamEx() {
// TODO Auto-generated constructor stub
}
/* REMIND: identification of the read and write sides needs to be
more sophisticated. Either using thread groups (but what about
pipes within a thread?) or using finalization (but it may be a
long time until the next GC). */
private PipedInputStreamEx[] sinks=null;
public synchronized void connect(PipedInputStreamEx... pIns) throws IOException {
for (PipedInputStreamEx snk : pIns) {
if (snk == null) {
throw new NullPointerException();
} else if (sinks != null || snk.connected) {
throw new IOException("Already connected");
}
snk.in = -1;
snk.out = 0;
snk.connected = true;
}
this.sinks = pIns;
}
/**
* Writes the specified <code>byte</code> to the piped output stream.
* <p>
* Implements the <code>write</code> method of <code>OutputStream</code>.
*
* #param b the <code>byte</code> to be written.
* #exception IOException if the pipe is <a href=#BROKEN> broken</a>,
* {#link #connect(java.io.PipedInputStream) unconnected},
* closed, or if an I/O error occurs.
*/
public void write(int b) throws IOException {
if (this.sinks == null) {
throw new IOException("Pipe(s) not connected");
}
for (PipedInputStreamEx sink : this.sinks) {
sink.receive(b);
}
}
/**
* Writes <code>len</code> bytes from the specified byte array
* starting at offset <code>off</code> to this piped output stream.
* This method blocks until all the bytes are written to the output
* stream.
*
* #param b the data.
* #param off the start offset in the data.
* #param len the number of bytes to write.
* #exception IOException if the pipe is <a href=#BROKEN> broken</a>,
* {#link #connect(java.io.PipedInputStream) unconnected},
* closed, or if an I/O error occurs.
*/
public void write(byte b[], int off, int len) throws IOException {
if (sinks == null) {
throw new IOException("Pipe not connected");
} else if (b == null) {
throw new NullPointerException();
} else if ((off < 0) || (off > b.length) || (len < 0) ||
((off + len) > b.length) || ((off + len) < 0)) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return;
}
for (PipedInputStreamEx sink : this.sinks) {
sink.receive(b, off, len);
}
}
/**
* Flushes this output stream and forces any buffered output bytes
* to be written out.
* This will notify any readers that bytes are waiting in the pipe.
*
* #exception IOException if an I/O error occurs.
*/
public synchronized void flush() throws IOException {
if (sinks != null) {
for (PipedInputStreamEx sink : this.sinks) {
synchronized (sink) {
sink.notifyAll();
}
}
}
}
/**
* Closes this piped output stream and releases any system resources
* associated with this stream. This stream may no longer be used for
* writing bytes.
*
* #exception IOException if an I/O error occurs.
*/
public void close() throws IOException {
if (sinks != null) {
for (PipedInputStreamEx sink : this.sinks) {
sink.receivedLast();
}
}
}
/**
* Teste desta extensao de {#link PipedOutputStream}
* #param args
* #throws InterruptedException
* #throws IOException
*/
public static void main(String[] args) throws InterruptedException, IOException {
final PipedOutputStreamEx pOut = new PipedOutputStreamEx();
final PipedInputStreamEx pInHash = new PipedInputStreamEx();
final PipedInputStreamEx pInConsole = new PipedInputStreamEx();
pOut.connect(pInHash, pInConsole);
Thread escreve = new Thread("Escrevendo") {
#Override
public void run() {
String[] paraGravar = new String[]{
"linha1 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha2 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha3 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha4 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha5 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha6 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha7 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha8 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha9 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha10 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha11 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha12 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha13 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha14 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha15 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha16 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha17 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha18 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha19 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
, "linha20 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n"
};
for (String s :paraGravar) {
try {
pOut.write(s.getBytes("ISO-8859-1") );
Thread.sleep(100);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
try {
pOut.close();
} catch (IOException e) {
e.printStackTrace();
}
}
};
Thread le1 = new Thread("Le1 - hash") {
#Override
public void run() {
try {
System.out.println("HASH: "+HashUtil.getHashCRC(pInHash,true));
} catch (Exception e) {
e.printStackTrace();
}
}
};
Thread le2 = new Thread("Le2 - escreve no console") {
#Override
public void run() {
BufferedReader bIn = new BufferedReader(new InputStreamReader(pInConsole));
String s;
try {
while ( (s=bIn.readLine())!=null) {
Thread.sleep(700); //teste simulando o um leitor lento...
System.out.println(s);
}
} catch (IOException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
};
escreve.start();
le1.start();
le2.start();
escreve.join();
le1.join();
le2.join();
pInHash.close();
pInConsole.close();
}
}
Here is the PipedInputStreamEx code. Unfortunately, I had to copy all JDK code, to have access to "connected", "in" and "out" properties.
/**
* Extensao de {#link PipedInputStream}, que permite conetar mais de um destes no {#link PipedOutputStream}
* Como a classe ancestral possui propriedades 'package friend', tivemos que copiar o codigo herdado :/
*/
public class PipedInputStreamEx extends PipedInputStream {
#Override
public void connect(PipedOutputStream src) throws IOException {
throw new IOException("conecte usando PipedOutputStream.connect()");
}
//----------------------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------------------
//----------------------------------------------------------------------------------------------------------
//--------- INICIO codigo da classe herdada (alguns metodos comentados...)----------------------------------
//----------------------------------------------------------------------------------------------------------
boolean closedByWriter = false;
volatile boolean closedByReader = false;
boolean connected = false;
/* REMIND: identification of the read and write sides needs to be
more sophisticated. Either using thread groups (but what about
pipes within a thread?) or using finalization (but it may be a
long time until the next GC). */
Thread readSide;
Thread writeSide;
private static final int DEFAULT_PIPE_SIZE = 1024;
/**
* The default size of the pipe's circular input buffer.
* #since JDK1.1
*/
// This used to be a constant before the pipe size was allowed
// to change. This field will continue to be maintained
// for backward compatibility.
protected static final int PIPE_SIZE = DEFAULT_PIPE_SIZE;
/**
* The circular buffer into which incoming data is placed.
* #since JDK1.1
*/
protected byte buffer[];
/**
* The index of the position in the circular buffer at which the
* next byte of data will be stored when received from the connected
* piped output stream. <code>in<0</code> implies the buffer is empty,
* <code>in==out</code> implies the buffer is full
* #since JDK1.1
*/
protected int in = -1;
/**
* The index of the position in the circular buffer at which the next
* byte of data will be read by this piped input stream.
* #since JDK1.1
*/
protected int out = 0;
// /**
// * Creates a <code>PipedInputStream</code> so
// * that it is connected to the piped output
// * stream <code>src</code>. Data bytes written
// * to <code>src</code> will then be available
// * as input from this stream.
// *
// * #param src the stream to connect to.
// * #exception IOException if an I/O error occurs.
// */
// public PipedInputStream(PipedOutputStream src) throws IOException {
// this(src, DEFAULT_PIPE_SIZE);
// }
//
// /**
// * Creates a <code>PipedInputStream</code> so that it is
// * connected to the piped output stream
// * <code>src</code> and uses the specified pipe size for
// * the pipe's buffer.
// * Data bytes written to <code>src</code> will then
// * be available as input from this stream.
// *
// * #param src the stream to connect to.
// * #param pipeSize the size of the pipe's buffer.
// * #exception IOException if an I/O error occurs.
// * #exception IllegalArgumentException if <code>pipeSize <= 0</code>.
// * #since 1.6
// */
// public PipedInputStream(PipedOutputStream src, int pipeSize)
// throws IOException {
// initPipe(pipeSize);
// connect(src);
// }
/**
* Creates a <code>PipedInputStream</code> so
* that it is not yet {#linkplain #connect(java.io.PipedOutputStream)
* connected}.
* It must be {#linkplain java.io.PipedOutputStream#connect(
* java.io.PipedInputStream) connected} to a
* <code>PipedOutputStream</code> before being used.
*/
public PipedInputStreamEx() {
initPipe(DEFAULT_PIPE_SIZE);
}
/**
* Creates a <code>PipedInputStream</code> so that it is not yet
* {#linkplain #connect(java.io.PipedOutputStream) connected} and
* uses the specified pipe size for the pipe's buffer.
* It must be {#linkplain java.io.PipedOutputStream#connect(
* java.io.PipedInputStream)
* connected} to a <code>PipedOutputStream</code> before being used.
*
* #param pipeSize the size of the pipe's buffer.
* #exception IllegalArgumentException if <code>pipeSize <= 0</code>.
* #since 1.6
*/
public PipedInputStreamEx(int pipeSize) {
initPipe(pipeSize);
}
private void initPipe(int pipeSize) {
if (pipeSize <= 0) {
throw new IllegalArgumentException("Pipe Size <= 0");
}
buffer = new byte[pipeSize];
}
// /**
// * Causes this piped input stream to be connected
// * to the piped output stream <code>src</code>.
// * If this object is already connected to some
// * other piped output stream, an <code>IOException</code>
// * is thrown.
// * <p>
// * If <code>src</code> is an
// * unconnected piped output stream and <code>snk</code>
// * is an unconnected piped input stream, they
// * may be connected by either the call:
// * <p>
// * <pre><code>snk.connect(src)</code> </pre>
// * <p>
// * or the call:
// * <p>
// * <pre><code>src.connect(snk)</code> </pre>
// * <p>
// * The two
// * calls have the same effect.
// *
// * #param src The piped output stream to connect to.
// * #exception IOException if an I/O error occurs.
// */
// public void connect(PipedOutputStream src) throws IOException {
// src.connect(this);
// }
/**
* Receives a byte of data. This method will block if no input is
* available.
* #param b the byte being received
* #exception IOException If the pipe is <a href=#BROKEN> <code>broken</code></a>,
* {#link #connect(java.io.PipedOutputStream) unconnected},
* closed, or if an I/O error occurs.
* #since JDK1.1
*/
protected synchronized void receive(int b) throws IOException {
checkStateForReceive();
writeSide = Thread.currentThread();
if (in == out)
awaitSpace();
if (in < 0) {
in = 0;
out = 0;
}
buffer[in++] = (byte)(b & 0xFF);
if (in >= buffer.length) {
in = 0;
}
}
/**
* Receives data into an array of bytes. This method will
* block until some input is available.
* #param b the buffer into which the data is received
* #param off the start offset of the data
* #param len the maximum number of bytes received
* #exception IOException If the pipe is <a href=#BROKEN> broken</a>,
* {#link #connect(java.io.PipedOutputStream) unconnected},
* closed,or if an I/O error occurs.
*/
synchronized void receive(byte b[], int off, int len) throws IOException {
checkStateForReceive();
writeSide = Thread.currentThread();
int bytesToTransfer = len;
while (bytesToTransfer > 0) {
if (in == out)
awaitSpace();
int nextTransferAmount = 0;
if (out < in) {
nextTransferAmount = buffer.length - in;
} else if (in < out) {
if (in == -1) {
in = out = 0;
nextTransferAmount = buffer.length - in;
} else {
nextTransferAmount = out - in;
}
}
if (nextTransferAmount > bytesToTransfer)
nextTransferAmount = bytesToTransfer;
assert(nextTransferAmount > 0);
System.arraycopy(b, off, buffer, in, nextTransferAmount);
bytesToTransfer -= nextTransferAmount;
off += nextTransferAmount;
in += nextTransferAmount;
if (in >= buffer.length) {
in = 0;
}
}
}
private void checkStateForReceive() throws IOException {
if (!connected) {
throw new IOException("Pipe not connected");
} else if (closedByWriter || closedByReader) {
throw new IOException("Pipe closed");
} else if (readSide != null && !readSide.isAlive()) {
throw new IOException("Read end dead");
}
}
private void awaitSpace() throws IOException {
while (in == out) {
checkStateForReceive();
/* full: kick any waiting readers */
notifyAll();
try {
wait(1000);
} catch (InterruptedException ex) {
throw new java.io.InterruptedIOException();
}
}
}
/**
* Notifies all waiting threads that the last byte of data has been
* received.
*/
synchronized void receivedLast() {
closedByWriter = true;
notifyAll();
}
/**
* Reads the next byte of data from this piped input stream. The
* value byte is returned as an <code>int</code> in the range
* <code>0</code> to <code>255</code>.
* This method blocks until input data is available, the end of the
* stream is detected, or an exception is thrown.
*
* #return the next byte of data, or <code>-1</code> if the end of the
* stream is reached.
* #exception IOException if the pipe is
* {#link #connect(java.io.PipedOutputStream) unconnected},
* <a href=#BROKEN> <code>broken</code></a>, closed,
* or if an I/O error occurs.
*/
public synchronized int read() throws IOException {
if (!connected) {
throw new IOException("Pipe not connected");
} else if (closedByReader) {
throw new IOException("Pipe closed");
} else if (writeSide != null && !writeSide.isAlive()
&& !closedByWriter && (in < 0)) {
throw new IOException("Write end dead");
}
readSide = Thread.currentThread();
int trials = 2;
while (in < 0) {
if (closedByWriter) {
/* closed by writer, return EOF */
return -1;
}
if ((writeSide != null) && (!writeSide.isAlive()) && (--trials < 0)) {
throw new IOException("Pipe broken");
}
/* might be a writer waiting */
notifyAll();
try {
wait(1000);
} catch (InterruptedException ex) {
throw new java.io.InterruptedIOException();
}
}
int ret = buffer[out++] & 0xFF;
if (out >= buffer.length) {
out = 0;
}
if (in == out) {
/* now empty */
in = -1;
}
return ret;
}
/**
* Reads up to <code>len</code> bytes of data from this piped input
* stream into an array of bytes. Less than <code>len</code> bytes
* will be read if the end of the data stream is reached or if
* <code>len</code> exceeds the pipe's buffer size.
* If <code>len </code> is zero, then no bytes are read and 0 is returned;
* otherwise, the method blocks until at least 1 byte of input is
* available, end of the stream has been detected, or an exception is
* thrown.
*
* #param b the buffer into which the data is read.
* #param off the start offset in the destination array <code>b</code>
* #param len the maximum number of bytes read.
* #return the total number of bytes read into the buffer, or
* <code>-1</code> if there is no more data because the end of
* the stream has been reached.
* #exception NullPointerException If <code>b</code> is <code>null</code>.
* #exception IndexOutOfBoundsException If <code>off</code> is negative,
* <code>len</code> is negative, or <code>len</code> is greater than
* <code>b.length - off</code>
* #exception IOException if the pipe is <a href=#BROKEN> <code>broken</code></a>,
* {#link #connect(java.io.PipedOutputStream) unconnected},
* closed, or if an I/O error occurs.
*/
public synchronized int read(byte b[], int off, int len) throws IOException {
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
/* possibly wait on the first character */
int c = read();
if (c < 0) {
return -1;
}
b[off] = (byte) c;
int rlen = 1;
while ((in >= 0) && (len > 1)) {
int available;
if (in > out) {
available = Math.min((buffer.length - out), (in - out));
} else {
available = buffer.length - out;
}
// A byte is read beforehand outside the loop
if (available > (len - 1)) {
available = len - 1;
}
System.arraycopy(buffer, out, b, off + rlen, available);
out += available;
rlen += available;
len -= available;
if (out >= buffer.length) {
out = 0;
}
if (in == out) {
/* now empty */
in = -1;
}
}
return rlen;
}
/**
* Returns the number of bytes that can be read from this input
* stream without blocking.
*
* #return the number of bytes that can be read from this input stream
* without blocking, or {#code 0} if this input stream has been
* closed by invoking its {#link #close()} method, or if the pipe
* is {#link #connect(java.io.PipedOutputStream) unconnected}, or
* <a href=#BROKEN> <code>broken</code></a>.
*
* #exception IOException if an I/O error occurs.
* #since JDK1.0.2
*/
public synchronized int available() throws IOException {
if(in < 0)
return 0;
else if(in == out)
return buffer.length;
else if (in > out)
return in - out;
else
return in + buffer.length - out;
}
/**
* Closes this piped input stream and releases any system resources
* associated with the stream.
*
* #exception IOException if an I/O error occurs.
*/
public void close() throws IOException {
closedByReader = true;
synchronized (this) {
in = -1;
}
}
//----------------------------------------------------------------------------------------------------------
//--------- FIM codigo da classe herdada -------------------------------------------------------------------
//----------------------------------------------------------------------------------------------------------
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Read Large Text File With 70million line of text - java

If you are looking out at performance, you could have a look at the java.nio.* packages - those are supposedly faster than java.io.*

In Java 8, for anyone looking now to read file large files line by line, Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt")); lines.forEach(l -> { // Do anything line by line });

Related

Java FileChannel Vs BufferedReader - Spring Batch - Reader

How to split binary data into hex strings when characters are in the start and end of the strings

Unchecked or unsafe operations error in javac

Reading files bits and saving them

Multiple readers for InputStream in Java

Categories

Resources