Fast read text file character-by-character(java)

Fast read text file character-by-character(java) - java

Sorry for my english. I try read realy fast big size text file character-by-character(not use readLine()) but it has not yet obtained. My code:
for(int i = 0; (i = textReader.read()) != -1; ) {
char character = (char) i;
}
It read 1GB text file 56666ms, how can i read faster?
UDP
Its method read 1GB file 28833ms
FileInputStream fIn = null;
FileChannel fChan = null;
ByteBuffer mBuf;
int count;
try {
fIn = new FileInputStream(textReader);
fChan = fIn.getChannel();
mBuf = ByteBuffer.allocate(128);
do {
count = fChan.read(mBuf);
if(count != -1) {
mBuf.rewind();
for(int i = 0; i < count; i++) {
char c = (char)mBuf.get();
}
}
} while(count != -1);
}catch(Exception e) {
}

The fastest way to read input is to use buffer. Here is an example of a class that has internal buffer.
class Parser
{
final private int BUFFER_SIZE = 1 << 16;
private DataInputStream din;
private byte[] buffer;
private int bufferPointer, bytesRead;
public Parser(InputStream in)
{
din = new DataInputStream(in);
buffer = new byte[BUFFER_SIZE];
bufferPointer = bytesRead = 0;
}
public int nextInt() throws Exception
{
int ret = 0;
byte c = read();
while (c <= ' ') c = read();
//boolean neg = c == '-';
//if (neg) c = read();
do
{
ret = ret * 10 + c - '0';
c = read();
} while (c > ' ');
//if (neg) return -ret;
return ret;
}
private void fillBuffer() throws Exception
{
bytesRead = din.read(buffer, bufferPointer = 0, BUFFER_SIZE);
if (bytesRead == -1) buffer[0] = -1;
}
private byte read() throws Exception
{
if (bufferPointer == bytesRead) fillBuffer();
return buffer[bufferPointer++];
}
}
This parser has function that will give you nextInt, if you want next char you can can call read() function.
This is the fastest way to read from a file (as far as I know)
You would initialize this parser like this:
Parser p = new Parser(new FileInputStream("text.txt"));
int c;
while((c = p.read()) != -1)
System.out.print((char)c);
This code reads 250mb in 7782ms.
Disclaimer:
the code is not mine, it has been posted as a solution to a problem on CodeChef by the user 'Kamalakannan CM'

I would use BufferedReader, it reads buffered. A short sample:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.nio.CharBuffer;
public class Main {
public static void main(String... args) {
try (FileReader fr = new FileReader("a.txt")) {
try (BufferedReader reader = new BufferedReader(fr)) {
CharBuffer charBuffer = CharBuffer.allocate(8192);
reader.read(charBuffer);
} catch (IOException e) {
e.printStackTrace();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The default constructor uses a default buffersize of 8192. In case you want to use a different buffer size you can use this constructor. Alternatively you can read in an array buffer:
....
char[] buffer = new char[255];
reader.read(buffer);
....
or read one character at a time:
int char = reader.read();

Related

How to convert Reader to InputStream in java

I need to convert a Reader object into InputStream. My solution right now is below. But my concern is since this will handle big chunks of data, it will increase the memory usage drastically.
private static InputStream getInputStream(final Reader reader) {
char[] buffer = new char[10240];
StringBuilder builder = new StringBuilder();
int charCount;
try {
while ((charCount = reader.read(buffer, 0, buffer.length)) != -1) {
builder.append(buffer, 0, charCount);
}
reader.close();
} catch (final IOException e) {
e.printStackTrace();
}
return new ByteArrayInputStream(builder.toString().getBytes(StandardCharsets.UTF_8));
}
Since I use StringBuilder this will keep the full content of the reader object in memory. I want to avoid this. Is there a way I can pipe Reader object? Any help regarding this highly appreciated.

Using the Apache Commons IO library, you can do this conversion in one line:
//import org.apache.commons.io.input.ReaderInputStream;
InputStream inputStream = new ReaderInputStream(reader, StandardCharsets.UTF_8);
You can read the documentaton for this Class at https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
It might be worth trying this to see if it solves the memory issue too.

First: a rare requirement, often it is the other way around, or there is a FileChannel, so one can use a ByteBuffer.
A PipedInputStream would be possible, starting a PipedOutputStream in a second thread. However that is unneeded.
A Reader gives chars. Unicode code points are derived from either one or two chars (the latter a surrogate pair).
/**
* Reader for an InputSteam of UTF-8 text bytes.
*/
public class ReaderInputStream extends InputStream {
private final Reader reader;
private boolean eof;
private int byteCount;
private byte[] bytes = new byte[6];
public ReaderInputStream(Reader reader) {
this.reader = reader;
}
#Override
public int read() throws IOException {
if (byteCount > 0) {
int c = bytes[0];
--byteCount;
for (int i = 0; i < byteCount; ++i) {
bytes[i] = bytes[i + 1];
}
return c;
}
if (eof) {
return -1;
}
int c = reader.read();
if (c == -1) {
eof = true;
return -1;
}
char ch = (char) c;
String s;
if (Character.isHighSurrogate(ch)) {
c = reader.read();
if (c == -1) {
// Error, low surrogate expected.
eof = true;
//return -1;
throw new IOException("Expected a low surrogate char i.o. EOF");
}
char ch2 = (char) c;
if (!Character.isLowSurrogate(ch2)) {
throw new IOException("Expected a low surrogate char");
}
s = new String(new char [] {ch, ch2});
} else {
s = Character.toString(ch);
}
byte[] bs = s.getBytes(StandardCharsets.UTF_8);
byteCount = bs.length;
System.arraycopy(bs, 0, bytes, 0, byteCount);
return read();
}
}
Path source = Paths.get("...");
Path target = Paths.get("...");
try (Reader reader = Files.newBufferedReader(source, StandardCharsets.UTF_8);
InputStream in = new ReaderInputStream(reader)) {
Files.copy(in, target);
}

Java compressing a .txt file

I am currently trying to write a program which reads in a compressed file which is written in bits or 0s and 1s, and convert them in to strings of 0s and 1s.
The School provided a class and method for reading 1 bit and converting that in to a character char. So to read and convert one bit to a char, all i need to do is type in my code:
char oneBit = inputFile.readBit();
in my main method.
How do I get my program to read over every bit within the compressed file and convert them to char? using the .readBit method? And how would I convert all the char 0s and 1s in to strings of 0s and 1s?
The readBit method:
public char readBit() {
char c = 0;
if (bitsRead == 8)
try {
if (in.available() > 0) { // We have not reached the end of the
// file
buffer = (char) in.read();
bitsRead = 0;
} else
return 0;
} catch (IOException e) {
System.out.println("Error reading from file ");
System.exit(0); // Terminate the program
}
// return next bit from the buffer; bit is converted first to char
if ((buffer & 128) == 0)
c = '0';
else
c = '1';
buffer = (char) (buffer << 1);
++bitsRead;
return c;
}
where in is the input file.

Try using this resource
Sample implementation.
public class BitAnswer {
final static int RADIX = 10;
public static void main(String[] args) {
BitInputStream bis = new BitInputStream("<file_name>");
int result = bis.readBit();
while( result != -1 ) {
System.out.print(Character.forDigit(result, RADIX));
result = bis.readBit();
}
System.out.println("\nAll bits read!");
}
}

public void compress(){
String inputFileName = "c://tmp//content.txt";
String outputFileName = "c://tmp//compressedContent.txt";
FileOutputStream fos = null;
StringBuilder sb = new StringBuilder();
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
OutputStream outputStream= null;
try (BufferedReader br = new BufferedReader(new FileReader(new File(inputFileName)))) {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
outputStream = new DeflaterOutputStream(byteArrayOutputStream); // GZIPOutputStream(byteArrayOutputStream) - use if you want unix .gz format
outputStream.write(sb.toString().getBytes());
String compressedText = Base64.getEncoder().encodeToString(byteArrayOutputStream.toByteArray());
fos=new FileOutputStream(outputFileName);
fos.write(compressedText.getBytes());
System.out.println("done compress");
} catch (Exception e) {
e.printStackTrace();
}finally{
try{
if (outputStream != null) {
outputStream.close();
}
if (byteArrayOutputStream != null) {
byteArrayOutputStream.close();
}
if(fos != null){
fos.close();
}
}catch (Exception e) {
e.printStackTrace();
}
System.out.println("closed streams !!! ");
}
}

Given InputStream replace character and produce OutputStream

I have a lot of massive files I need convert to CSV by replacing certain characters.
I am looking for reliable approach given InputStream return OutputStream and replace all characters c1 to c2.
Trick here is to read and write in parallel, I can't fit whole file in memory.
Do I need to run it in separate thread if I want read and write at the same time?
Thanks a lot for your advices.

To copy data from an input stream to an output stream you write data while you're reading it either a byte (or character) or a line at a time.
Here is an example that reads in a file converting all 'x' characters to 'y'.
BufferedInputStream in = new BufferedInputStream(new FileInputStream("input.dat"));
BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream("output.dat"));
int ch;
while((ch = in.read()) != -1) {
if (ch == 'x') ch = 'y';
out.write(ch);
}
out.close();
in.close();
Or if can use a Reader and process a line at a time then can use this aproach:
BufferedReader reader = new BufferedReader(new FileReader("input.dat"));
PrintWriter writer = new PrintWriter(
new BufferedOutputStream(new FileOutputStream("output.dat")));
String str;
while ((str = reader.readLine()) != null) {
str = str.replace('x', 'y'); // replace character at a time
str = str.replace("abc", "ABC"); // replace string sequence
writer.println(str);
}
writer.close();
reader.close();
BufferedInputStream and BufferedReader read ahead and keep 8K of characters in a buffer for performance. Very large files can be processed while only keeping 8K of characters in memory at a time.

FileWriter writer = new FileWriter("Report.csv");
BufferedReader reader = new BufferedReader(new InputStreamReader(YOURSOURCE, Charsets.UTF_8));
String line;
while ((line = reader.readLine()) != null) {
line.replace('c1', 'c2');
writer.append(line);
writer.append('\n');
}
writer.flush();
writer.close();

You can find related answer here: Filter (search and replace) array of bytes in an InputStream
I took #aioobe's answer in that thread, and built the replacing input stream module in Java, which you can find it in my GitHub gist: https://gist.github.com/lhr0909/e6ac2d6dd6752871eb57c4b083799947
Putting the source code here as well:
import java.io.FilterInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Queue;
/**
* Created by simon on 8/29/17.
*/
public class ReplacingInputStream extends FilterInputStream {
private Queue<Integer> inQueue, outQueue;
private final byte[] search, replacement;
public ReplacingInputStream(InputStream in, String search, String replacement) {
super(in);
this.inQueue = new LinkedList<>();
this.outQueue = new LinkedList<>();
this.search = search.getBytes();
this.replacement = replacement.getBytes();
}
private boolean isMatchFound() {
Iterator<Integer> iterator = inQueue.iterator();
for (byte b : search) {
if (!iterator.hasNext() || b != iterator.next()) {
return false;
}
}
return true;
}
private void readAhead() throws IOException {
// Work up some look-ahead.
while (inQueue.size() < search.length) {
int next = super.read();
inQueue.offer(next);
if (next == -1) {
break;
}
}
}
#Override
public int read() throws IOException {
// Next byte already determined.
while (outQueue.isEmpty()) {
readAhead();
if (isMatchFound()) {
for (byte a : search) {
inQueue.remove();
}
for (byte b : replacement) {
outQueue.offer((int) b);
}
} else {
outQueue.add(inQueue.remove());
}
}
return outQueue.remove();
}
#Override
public int read(byte b[]) throws IOException {
return read(b, 0, b.length);
}
// copied straight from InputStream inplementation, just needed to to use `read()` from this class
#Override
public int read(byte b[], int off, int len) throws IOException {
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
}

Read large file error "outofmemoryerror"(java)

sorry for my english. I want to read a large file, but when I read error occurs outOfMemoryError. I do not understand how to work with memory in the application. The following code does not work:
try {
StringBuilder fileData = new StringBuilder(1000);
BufferedReader reader = new BufferedReader(new FileReader(file));
char[] buf = new char[8192];
int bytesread = 0,
bytesBuffered = 0;
while( (bytesread = reader.read( buf )) > -1 ) {
String readData = String.valueOf(buf, 0, bytesread);
bytesBuffered += bytesread;
fileData.append(readData); //this is error
if (bytesBuffered > 1024 * 1024) {
bytesBuffered = 0;
}
}
System.out.println(fileData.toString().toCharArray());
} finally {
}

You need pre allocate a large buffer to avoid reallocate.
File file = ...;
StringBuilder fileData = new StringBuilder(file.size());
And running with large heap size:
java -Xmx2G
==== update
A while loop using buffer doesn't need too memory to run. Treat input like a stream, match your search string with the stream. It's a really simple state machine. If you need search multiple words, you can find a TrieTree implementation(support stream) for that.
// the match state model
...xxxxxxabxxxxxaxxxxxabcdexxxx...
ab a abcd
File file = new File("path_to_your_file");
String yourSearchWord = "abcd";
int matchIndex = 0;
boolean matchPrefix = false;
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
int chr;
while ((chr = reader.read()) != -1) {
if (matchPrefix == false) {
char searchChar = yourSearchWord.charAt(0);
if (chr == searchChar) {
matchPrefix = true;
matchIndex = 0;
}
} else {
char searchChar = yourSearchWord.charAt(++matchIndex);
if (chr == searchChar) {
if (matchIndex == yourSearchWord.length() - 1) {
// match!!
System.out.println("match: " + matchIndex);
matchPrefix = false;
matchIndex = 0;
}
} else {
matchPrefix = false;
matchIndex = 0;
}
}
}
}

Try this. This might be helpful :-
try{
BufferedReader reader = new BufferedReader(new FileReader(file));
String txt = "";
while( (txt = reader.read()) != null){
System.out.println(txt);
}
}catch(Exception e){
System.out.println("Error : "+e.getMessage());
}

You should not hold such big files in memory, because you run out of it, as you see. Since you use Java 7, you need to read the file manually as stream and check the content on the fly. Otherwise you could use the stream API of Java 8. This is just an example. It works, but keep in mind, that the position of the found word could vary due to encoding issues, so this is no production code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class FileReader
{
private static String wordToFind = "SEARCHED_WORD";
private static File file = new File("YOUR_FILE");
private static int currentMatchingPosition;
private static int foundAtPosition = -1;
private static int charsRead;
public static void main(String[] args) throws IOException
{
try (FileInputStream fis = new FileInputStream(file))
{
System.out.println("Total size to read (in bytes) : " + fis.available());
int c;
while ((c = fis.read()) != -1)
{
charsRead++;
checkContent(c);
}
if (foundAtPosition > -1)
{
System.out.println("Found word at position: " + (foundAtPosition - wordToFind.length()));
}
else
{
System.out.println("Didnt't find the word!");
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
private static void checkContent(int c)
{
if (currentMatchingPosition >= wordToFind.length())
{
//already found....
return;
}
if (wordToFind.charAt(currentMatchingPosition) == (char)c)
{
foundAtPosition = charsRead;
currentMatchingPosition++;
}
else
{
currentMatchingPosition = 0;
foundAtPosition = -1;
}
}
}

Removing ASCII characters in a string with encoding

I have a byte array which is filled by a serial port event and code is shown below:
private InputStream input = null;
......
......
public void SerialEvent(SerialEvent se){
if(se.getEventType == SerialPortEvent.DATA_AVAILABLE){
int length = input.available();
if(length > 0){
byte[] array = new byte[length];
int numBytes = input.read(array);
String text = new String(array);
}
}
}
The variable text contains the below characters,
"\033[K", "\033[m", "\033[H2J", "\033[6;1H" ,"\033[?12l", "\033[?25h", "\033[5i", "\033[4i", "\033i" and similar types..
As of now, I use String.replace to remove all these characters from the string.
I have tried new String(array , 'CharSet'); //Tried with all CharSet options but I couldn't able to remove those.
Is there any way where I can remove those characters without using replace method?

I gave a unsatisfying answer, thanks to #OlegEstekhin for pointing that out.
As noone else answered yet, and a solution is not a two-liner, here it goes.
Make a wrapping InputStream that throws away escape sequences. I have used a PushbackInputStream, where a partial sequence skipped, may still be pushed back for reading first. Here a FilterInputStream would suffice.
public class EscapeRemovingInputStream extends PushbackInputStream {
public static void main(String[] args) {
String s = "\u001B[kHello \u001B[H12JWorld!";
byte[] buf = s.getBytes(StandardCharsets.ISO_8859_1);
ByteArrayInputStream bais = new ByteArrayInputStream(buf);
EscapeRemovingInputStream bin = new EscapeRemovingInputStream(bais);
try (InputStreamReader in = new InputStreamReader(bin,
StandardCharsets.ISO_8859_1)) {
int c;
while ((c = in.read()) != -1) {
System.out.print((char) c);
}
System.out.println();
} catch (IOException ex) {
Logger.getLogger(EscapeRemovingInputStream.class.getName()).log(
Level.SEVERE, null, ex);
}
}
private static final Pattern ESCAPE_PATTERN = Pattern.compile(
"\u001B\\[(k|m|H\\d+J|\\d+:\\d+H|\\?\\d+\\w|\\d*i)");
private static final int MAX_ESCAPE_LENGTH = 20;
private final byte[] escapeSequence = new byte[MAX_ESCAPE_LENGTH];
private int escapeLength = 0;
private boolean eof = false;
public EscapeRemovingInputStream(InputStream in) {
this(in, MAX_ESCAPE_LENGTH);
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
for (int i = 0; i < len; ++i) {
int c = read();
if (c == -1) {
return i == 0 ? -1 : i;
}
b[off + i] = (byte) c;
}
return len;
}
#Override
public int read() throws IOException {
int c = eof ? -1 : super.read();
if (c == -1) { // Throw away a trailing half escape sequence.
eof = true;
return c;
}
if (escapeLength == 0 && c != 0x1B) {
return c;
} else {
escapeSequence[escapeLength] = (byte) c;
++escapeLength;
String esc = new String(escapeSequence, 0, escapeLength,
StandardCharsets.ISO_8859_1);
if (ESCAPE_PATTERN.matcher(esc).matches()) {
escapeLength = 0;
} else if (escapeLength == MAX_ESCAPE_LENGTH) {
escapeLength = 0;
unread(escapeSequence);
return super.read(); // No longer registering the escape
}
return read();
}
}
}
User calls EscapeRemovingInputStream.read
this read may call some read's itself to fill an byte buffer escapeSequence
(a push-back may be done calling unread)
the original read returns.
The recognition of an escape sequence seems grammatical: command letter, numerical argument(s). Hence I use a regular expression.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fast read text file character-by-character(java) - java

Related

How to convert Reader to InputStream in java

Java compressing a .txt file

Given InputStream replace character and produce OutputStream

Read large file error "outofmemoryerror"(java)

Removing ASCII characters in a string with encoding

Categories

Resources