I want to read fast line by line big csv files (approx ~ 1gb) in UTF-8. I have created a class for it, but it doesn't work properly. UTF-8 decodes Cyrillic symbol from 2 bytes. I use byte buffer to read it, for example, it has 10 bytes length. So if symbol composed from 10 and 11 bytes in the file it wouldn't be decoded normally :(
public class MyReader extends InputStream {
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(10);
private int buffSize = 0;
private int position = 0;
private boolean EOF = false;
private CharBuffer charBuffer;
private MyReader() {}
static MyReader getFromFile(final String path) throws IOException {
MyReader myReader = new MyReader();
myReader.channel = FileChannel.open(Path.of(path),
StandardOpenOption.READ);
myReader.initNewBuffer();
return myReader;
}
private void initNewBuffer() {
try {
buffSize = channel.read(buffer);
buffer.position(0);
charBuffer = Charset.forName("UTF-8").decode(buffer);
buffer.position(0);
} catch (IOException e) {
throw new RuntimeException("Error reading file: {}", e);
}
}
#Override
public int read() throws IOException {
if (EOF) {
return -1;
}
if (position < charBuffer.length()) {
return charBuffer.array()[position++];
} else {
initNewBuffer();
if (buffSize < 1) {
EOF = true;
} else {
position = 0;
}
return read();
}
}
public char[] readLine() throws IOException {
int readResult = 0;
int startPos = position;
while (readResult != -1) {
readResult = read();
}
return Arrays.copyOfRange(charBuffer.array(), startPos, position);
}
}
Bad solution, but it works)
private void initNewBuffer() {
try {
buffSize = channel.read(buffer);
buffer.position(0);
charBuffer = StandardCharsets.UTF_8.decode(buffer);
if (buffSize > 0) {
byte edgeByte = buffer.array()[buffSize - 1];
if (edgeByte == (byte) 0xd0 ||
edgeByte == (byte) 0xd1 ||
edgeByte == (byte) 0xc2 ||
edgeByte == (byte) 0xd2 ||
edgeByte == (byte) 0xd3
) {
channel.position(channel.position() - 1);
charBuffer.limit(charBuffer.limit()-1);
}
}
buffer.position(0);
} catch (IOException e) {
throw new RuntimeException("Error reading file: {}", e);
}
}
First: the gain is questionable.
The Files class has many nice and quite production fast methods.
Bytes with high bit 1 (< 0) are part of a UTF-8 multibyte sequence.
With high bits 10 they are continuation bytes.
Sequences might be upto 6 bytes nowadays (I believe).
So the next buffer starts with some continuation bytes, they belong to the previous buffer.
The programming logic I gladly leave to you.
Related
I need to convert a Reader object into InputStream. My solution right now is below. But my concern is since this will handle big chunks of data, it will increase the memory usage drastically.
private static InputStream getInputStream(final Reader reader) {
char[] buffer = new char[10240];
StringBuilder builder = new StringBuilder();
int charCount;
try {
while ((charCount = reader.read(buffer, 0, buffer.length)) != -1) {
builder.append(buffer, 0, charCount);
}
reader.close();
} catch (final IOException e) {
e.printStackTrace();
}
return new ByteArrayInputStream(builder.toString().getBytes(StandardCharsets.UTF_8));
}
Since I use StringBuilder this will keep the full content of the reader object in memory. I want to avoid this. Is there a way I can pipe Reader object? Any help regarding this highly appreciated.
Using the Apache Commons IO library, you can do this conversion in one line:
//import org.apache.commons.io.input.ReaderInputStream;
InputStream inputStream = new ReaderInputStream(reader, StandardCharsets.UTF_8);
You can read the documentaton for this Class at https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
It might be worth trying this to see if it solves the memory issue too.
First: a rare requirement, often it is the other way around, or there is a FileChannel, so one can use a ByteBuffer.
A PipedInputStream would be possible, starting a PipedOutputStream in a second thread. However that is unneeded.
A Reader gives chars. Unicode code points are derived from either one or two chars (the latter a surrogate pair).
/**
* Reader for an InputSteam of UTF-8 text bytes.
*/
public class ReaderInputStream extends InputStream {
private final Reader reader;
private boolean eof;
private int byteCount;
private byte[] bytes = new byte[6];
public ReaderInputStream(Reader reader) {
this.reader = reader;
}
#Override
public int read() throws IOException {
if (byteCount > 0) {
int c = bytes[0];
--byteCount;
for (int i = 0; i < byteCount; ++i) {
bytes[i] = bytes[i + 1];
}
return c;
}
if (eof) {
return -1;
}
int c = reader.read();
if (c == -1) {
eof = true;
return -1;
}
char ch = (char) c;
String s;
if (Character.isHighSurrogate(ch)) {
c = reader.read();
if (c == -1) {
// Error, low surrogate expected.
eof = true;
//return -1;
throw new IOException("Expected a low surrogate char i.o. EOF");
}
char ch2 = (char) c;
if (!Character.isLowSurrogate(ch2)) {
throw new IOException("Expected a low surrogate char");
}
s = new String(new char [] {ch, ch2});
} else {
s = Character.toString(ch);
}
byte[] bs = s.getBytes(StandardCharsets.UTF_8);
byteCount = bs.length;
System.arraycopy(bs, 0, bytes, 0, byteCount);
return read();
}
}
Path source = Paths.get("...");
Path target = Paths.get("...");
try (Reader reader = Files.newBufferedReader(source, StandardCharsets.UTF_8);
InputStream in = new ReaderInputStream(reader)) {
Files.copy(in, target);
}
Is there any ways to compare two files in Android?
For example: I am having two files under same folder, which are same.
They are same(also in size), but their namea are like
myFileA.pdf and myFileB.pdf. So how can I identify that they are
same or not.
What already I had tried:
compareTo() method: Tried myFileA.compare(myFileB), but that's giving some weird values like -1, -2, etc. I think those values are files' PATH dependent.
myFile.length(): but in some rare cases (very rare cases), two different files can have same size, so I think this is not a proper way.
NOTE: I told that files are under same folder for just example, they can be anywhere like myFileA.pdf can be in
NewFolder1 and myFileB.pdf can be in NewFolder2.
Some times ago I've written an utility to compare the content of two stream in an efficient way: stop the comparison when the first difference is found.
Here is the code that I think it's quite self explicable:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.Comparator;
public class FileComparator implements Comparator<File> {
#Override
public int compare(File file1, File file2) {
// one or both null
if (file1 == file2) {
return 0;
} else if (file1 == null && file2 != null) {
return -1;
} else if (file1 != null && file2 == null) {
return 1;
}
if (file1.isDirectory() || file2.isDirectory()) {
throw new IllegalArgumentException("Unable to compare directory content");
}
// not same size
if (file1.length() < file2.length()) {
return -1;
} else if (file1.length() > file2.length()) {
return 1;
}
try {
return compareContent(file1, file2);
} catch (IOException e) {
throw new RuntimeException(e.getMessage(), e);
}
}
private int bufferSize(long fileLength) {
int multiple = (int) (fileLength / 1024);
if (multiple <= 1) {
return 1024;
} else if (multiple <= 8) {
return 1024 * 2;
} else if (multiple <= 16) {
return 1024 * 4;
} else if (multiple <= 32) {
return 1024 * 8;
} else if (multiple <= 64) {
return 1024 * 16;
} else {
return 1024 * 64;
}
}
private int compareContent(File file1, File file2) throws IOException {
final int BUFFER_SIZE = bufferSize(file1.length());
// check content
try (BufferedInputStream is1 = new BufferedInputStream(new FileInputStream(file1), BUFFER_SIZE); BufferedInputStream is2 = new BufferedInputStream(new FileInputStream(file2), BUFFER_SIZE);) {
byte[] b1 = new byte[BUFFER_SIZE];
byte[] b2 = new byte[BUFFER_SIZE];
int read1 = -1;
int read2 = -1;
int read = -1;
do {
read1 = is1.read(b1);
read2 = is2.read(b2);
if (read1 < read2) {
return -1;
} else if (read1 > read2) {
return 1;
} else {
// read1 is equals to read2
read = read1;
}
if (read >= 0) {
if (read != BUFFER_SIZE) {
// clear the buffer not filled from the read
Arrays.fill(b1, read, BUFFER_SIZE, (byte) 0);
Arrays.fill(b2, read, BUFFER_SIZE, (byte) 0);
}
// compare the content of the two buffers
if (!Arrays.equals(b1, b2)) {
return new String(b1).compareTo(new String(b2));
}
}
} while (read >= 0);
// no difference found
return 0;
}
}
}
Comparing two files:
public static boolean compareFiles(File file1, File file2) {
byte[] buffer1 = new byte[1024];
byte[] buffer2 = new byte[1024];
try {
FileInputStream fileInputStream1 = new FileInputStream(file1);
FileInputStream fileInputStream2 = new FileInputStream(file2);
while (fileInputStream1.read(buffer1) != -1) {
if (fileInputStream2.read(buffer2) != -1 && !Arrays.equals(buffer1, buffer2))
return false;
}
return true;
} catch (Exception ignore) {
return false;
}
}
Of course, before you do that, you have to compare file sizes. Only if it matches, then compare the contents.
Sorry for my english. I try read realy fast big size text file character-by-character(not use readLine()) but it has not yet obtained. My code:
for(int i = 0; (i = textReader.read()) != -1; ) {
char character = (char) i;
}
It read 1GB text file 56666ms, how can i read faster?
UDP
Its method read 1GB file 28833ms
FileInputStream fIn = null;
FileChannel fChan = null;
ByteBuffer mBuf;
int count;
try {
fIn = new FileInputStream(textReader);
fChan = fIn.getChannel();
mBuf = ByteBuffer.allocate(128);
do {
count = fChan.read(mBuf);
if(count != -1) {
mBuf.rewind();
for(int i = 0; i < count; i++) {
char c = (char)mBuf.get();
}
}
} while(count != -1);
}catch(Exception e) {
}
The fastest way to read input is to use buffer. Here is an example of a class that has internal buffer.
class Parser
{
final private int BUFFER_SIZE = 1 << 16;
private DataInputStream din;
private byte[] buffer;
private int bufferPointer, bytesRead;
public Parser(InputStream in)
{
din = new DataInputStream(in);
buffer = new byte[BUFFER_SIZE];
bufferPointer = bytesRead = 0;
}
public int nextInt() throws Exception
{
int ret = 0;
byte c = read();
while (c <= ' ') c = read();
//boolean neg = c == '-';
//if (neg) c = read();
do
{
ret = ret * 10 + c - '0';
c = read();
} while (c > ' ');
//if (neg) return -ret;
return ret;
}
private void fillBuffer() throws Exception
{
bytesRead = din.read(buffer, bufferPointer = 0, BUFFER_SIZE);
if (bytesRead == -1) buffer[0] = -1;
}
private byte read() throws Exception
{
if (bufferPointer == bytesRead) fillBuffer();
return buffer[bufferPointer++];
}
}
This parser has function that will give you nextInt, if you want next char you can can call read() function.
This is the fastest way to read from a file (as far as I know)
You would initialize this parser like this:
Parser p = new Parser(new FileInputStream("text.txt"));
int c;
while((c = p.read()) != -1)
System.out.print((char)c);
This code reads 250mb in 7782ms.
Disclaimer:
the code is not mine, it has been posted as a solution to a problem on CodeChef by the user 'Kamalakannan CM'
I would use BufferedReader, it reads buffered. A short sample:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.nio.CharBuffer;
public class Main {
public static void main(String... args) {
try (FileReader fr = new FileReader("a.txt")) {
try (BufferedReader reader = new BufferedReader(fr)) {
CharBuffer charBuffer = CharBuffer.allocate(8192);
reader.read(charBuffer);
} catch (IOException e) {
e.printStackTrace();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The default constructor uses a default buffersize of 8192. In case you want to use a different buffer size you can use this constructor. Alternatively you can read in an array buffer:
....
char[] buffer = new char[255];
reader.read(buffer);
....
or read one character at a time:
int char = reader.read();
I have a byte array which is filled by a serial port event and code is shown below:
private InputStream input = null;
......
......
public void SerialEvent(SerialEvent se){
if(se.getEventType == SerialPortEvent.DATA_AVAILABLE){
int length = input.available();
if(length > 0){
byte[] array = new byte[length];
int numBytes = input.read(array);
String text = new String(array);
}
}
}
The variable text contains the below characters,
"\033[K", "\033[m", "\033[H2J", "\033[6;1H" ,"\033[?12l", "\033[?25h", "\033[5i", "\033[4i", "\033i" and similar types..
As of now, I use String.replace to remove all these characters from the string.
I have tried new String(array , 'CharSet'); //Tried with all CharSet options but I couldn't able to remove those.
Is there any way where I can remove those characters without using replace method?
I gave a unsatisfying answer, thanks to #OlegEstekhin for pointing that out.
As noone else answered yet, and a solution is not a two-liner, here it goes.
Make a wrapping InputStream that throws away escape sequences. I have used a PushbackInputStream, where a partial sequence skipped, may still be pushed back for reading first. Here a FilterInputStream would suffice.
public class EscapeRemovingInputStream extends PushbackInputStream {
public static void main(String[] args) {
String s = "\u001B[kHello \u001B[H12JWorld!";
byte[] buf = s.getBytes(StandardCharsets.ISO_8859_1);
ByteArrayInputStream bais = new ByteArrayInputStream(buf);
EscapeRemovingInputStream bin = new EscapeRemovingInputStream(bais);
try (InputStreamReader in = new InputStreamReader(bin,
StandardCharsets.ISO_8859_1)) {
int c;
while ((c = in.read()) != -1) {
System.out.print((char) c);
}
System.out.println();
} catch (IOException ex) {
Logger.getLogger(EscapeRemovingInputStream.class.getName()).log(
Level.SEVERE, null, ex);
}
}
private static final Pattern ESCAPE_PATTERN = Pattern.compile(
"\u001B\\[(k|m|H\\d+J|\\d+:\\d+H|\\?\\d+\\w|\\d*i)");
private static final int MAX_ESCAPE_LENGTH = 20;
private final byte[] escapeSequence = new byte[MAX_ESCAPE_LENGTH];
private int escapeLength = 0;
private boolean eof = false;
public EscapeRemovingInputStream(InputStream in) {
this(in, MAX_ESCAPE_LENGTH);
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
for (int i = 0; i < len; ++i) {
int c = read();
if (c == -1) {
return i == 0 ? -1 : i;
}
b[off + i] = (byte) c;
}
return len;
}
#Override
public int read() throws IOException {
int c = eof ? -1 : super.read();
if (c == -1) { // Throw away a trailing half escape sequence.
eof = true;
return c;
}
if (escapeLength == 0 && c != 0x1B) {
return c;
} else {
escapeSequence[escapeLength] = (byte) c;
++escapeLength;
String esc = new String(escapeSequence, 0, escapeLength,
StandardCharsets.ISO_8859_1);
if (ESCAPE_PATTERN.matcher(esc).matches()) {
escapeLength = 0;
} else if (escapeLength == MAX_ESCAPE_LENGTH) {
escapeLength = 0;
unread(escapeSequence);
return super.read(); // No longer registering the escape
}
return read();
}
}
}
User calls EscapeRemovingInputStream.read
this read may call some read's itself to fill an byte buffer escapeSequence
(a push-back may be done calling unread)
the original read returns.
The recognition of an escape sequence seems grammatical: command letter, numerical argument(s). Hence I use a regular expression.
I am writing an FLV parser in Java and have come up against an issue. The program successfully parses and groups together tags into packets and correctly identifies and assigns a byte array for each tag's body based upon the BodyLength flag in the header. However in my test files it successfully completes this but stops before the last 4 bytes.
The byte sequence left out in the first file is :
00 00 14 C3
And in the second:
00 00 01 46
Clearly it is an issue with the final 4 bytes of both files however I cannot spot the error in my logic. I suspect it might be:
while (in.available() != 0)
However I also doubt this is the case as the program is successfully entering the loop for the final tag however it is just stopping 4 bytes short. Any help is greatly appreciated. (I know proper exception handling is as yet not taking place)
Parser.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.lang.reflect.Array;
import java.net.URI;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.InputMismatchException;
/**
*
* #author A
*
* Parser class for FLV files
*/
public class Parser {
private static final int HEAD_SIZE = 9;
private static final int TAG_HEAD_SIZE = 15;
private static final byte[] FLVHEAD = { 0x46, 0x4C, 0x56 };
private static final byte AUDIO = 0x08;
private static final byte VIDEO = 0x09;
private static final byte DATA = 0x12;
private static final int TYPE_INDEX = 4;
private File file;
private FileInputStream in;
private ArrayList<Packet> packets;
private byte[] header = new byte[HEAD_SIZE];
Parser() throws FileNotFoundException {
throw new FileNotFoundException();
}
Parser(URI uri) {
file = new File(uri);
init();
}
Parser(File file) {
this.file = file;
init();
}
private void init() {
packets = new ArrayList<Packet>();
}
public void parse() {
boolean test = false;
try {
test = parseHeader();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
if (test) {
System.out.println("Header Verified");
// Add header packet to beginning of list & then null packet
Packet p = new Packet(PTYPE.P_HEAD);
p.setSize(header.length);
p.setByteArr(header);
packets.add(p);
p = null;
try {
parseTags();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} else {
try {
in.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// throw FileNotFoundException because incorrect file
}
}
private boolean parseHeader() throws FileNotFoundException, IOException {
if (file == null)
throw new FileNotFoundException();
in = new FileInputStream(file);
in.read(header, 0, 9);
return Arrays.equals(FLVHEAD, Arrays.copyOf(header, FLVHEAD.length));
}
private void parseTags() throws IOException {
if (file == null)
throw new FileNotFoundException();
byte[] tagHeader = new byte[TAG_HEAD_SIZE];
Arrays.fill(tagHeader, (byte) 0x00);
byte[] body;
byte[] buf;
PTYPE pt;
int OFFSET = 0;
while (in.available() != 0) {
// Read first 5 - bytes, previous tag size + tag type
in.read(tagHeader, 0, 5);
if (tagHeader[TYPE_INDEX] == AUDIO) {
pt = PTYPE.P_AUD;
} else if (tagHeader[TYPE_INDEX] == VIDEO) {
pt = PTYPE.P_VID;
} else if (tagHeader[TYPE_INDEX] == DATA) {
pt = PTYPE.P_DAT;
} else {
// Header should've been dealt with - if previous data types not
// found then throw exception
System.out.println("Unexpected header format: ");
System.out.print(String.format("%02x\n", tagHeader[TYPE_INDEX]));
System.out.println("Last Tag");
packets.get(packets.size()-1).diag();
System.out.println("Number of tags found: " + packets.size());
throw new InputMismatchException();
}
OFFSET = TYPE_INDEX;
// Read body size - 3 bytes
in.read(tagHeader, OFFSET + 1, 3);
// Body size buffer array - padding for 1 0x00 bytes
buf = new byte[4];
Arrays.fill(buf, (byte) 0x00);
// Fill size bytes
buf[1] = tagHeader[++OFFSET];
buf[2] = tagHeader[++OFFSET];
buf[3] = tagHeader[++OFFSET];
// Calculate body size
int bSize = ByteBuffer.wrap(buf).order(ByteOrder.BIG_ENDIAN)
.getInt();
// Initialise Array
body = new byte[bSize];
// Timestamp
in.read(tagHeader, ++OFFSET, 3);
Arrays.fill(buf, (byte) 0x00);
// Fill size bytes
buf[1] = tagHeader[OFFSET++];
buf[2] = tagHeader[OFFSET++];
buf[3] = tagHeader[OFFSET++];
int milliseconds = ByteBuffer.wrap(buf).order(ByteOrder.BIG_ENDIAN)
.getInt();
// Read padding
in.read(tagHeader, OFFSET, 4);
// Read body
in.read(body, 0, bSize);
// Diagnostics
//printBytes(body);
Packet p = new Packet(pt);
p.setSize(tagHeader.length + body.length);
p.setByteArr(concat(tagHeader, body));
p.setMilli(milliseconds);
packets.add(p);
p = null;
// Zero out for next iteration
body = null;
Arrays.fill(buf, (byte)0x00);
Arrays.fill(tagHeader, (byte)0x00);
milliseconds = 0;
bSize = 0;
OFFSET = 0;
}
in.close();
}
private byte[] concat(byte[] tagHeader, byte[] body) {
int aLen = tagHeader.length;
int bLen = body.length;
byte[] C = (byte[]) Array.newInstance(tagHeader.getClass()
.getComponentType(), aLen + bLen);
System.arraycopy(tagHeader, 0, C, 0, aLen);
System.arraycopy(body, 0, C, aLen, bLen);
return C;
}
private void printBytes(byte[] b) {
System.out.println("\n--------------------");
for (int i = 0; i < b.length; i++) {
System.out.print(String.format("%02x ", b[i]));
if (((i % 8) == 0 ) && i != 0)
System.out.println();
}
}
}
Packet.java
public class Packet {
private PTYPE type = null;
byte[] buf;
int milliseconds;
Packet(PTYPE t) {
this.setType(t);
}
public void setSize(int s) {
buf = new byte[s];
}
public PTYPE getType() {
return type;
}
public void setType(PTYPE type) {
if (this.type == null)
this.type = type;
}
public void setByteArr(byte[] b) {
this.buf = b;
}
public void setMilli(int milliseconds) {
this.milliseconds = milliseconds;
}
public void diag(){
System.out.println("|-- Tag Type: " + type);
System.out.println("|-- Milliseconds: " + milliseconds);
System.out.println("|-- Size: " + buf.length);
System.out.println("|-- Bytes: ");
for(int i = 0; i < buf.length; i++){
System.out.print(String.format("%02x ", buf[i]));
if (((i % 8) == 0 ) && i != 0)
System.out.println();
}
System.out.println();
}
}
jFLV.java
import java.net.URISyntaxException;
public class jFLV {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
Parser p = null;
try {
p = new Parser(jFLV.class.getResource("sample.flv").toURI());
} catch (URISyntaxException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
p.parse();
}
}
PTYPE.java
public enum PTYPE {
P_HEAD,P_VID,P_AUD,P_DAT
};
Both your use of available() and your call to read are broken. Admittedly I would have somewhat expected this to be okay for a FileInputStream (until you reach the end of the stream, at which point ignoring the return value for read could still be disastrous) but I personally assume that streams can always return partial data.
available() only tells you whether there's any data available right now. It's very rarely useful - just ignore it. If you want to read until the end of the stream, you should usually keep calling read until it returns -1. It's slightly tricky to combine that with "I'm trying to read the next block", admittedly. (It would be nice if InputStream had a peek() method, but it doesn't. You can wrap it in a BufferedInputStream and use mark/reset to test that at the start of each loop... ugly, but it should work.)
Next, you're ignoring the result of InputStream.read (in multiple places). You should always use the result of this, rather than assuming it has read the amount of data you've asked for. You might want a couple of helper methods, e.g.
static byte[] readExactly(InputStream input, int size) throws IOException {
byte[] data = new byte[size];
readExactly(input, data);
return data;
}
static void readExactly(InputStream input, byte[] data) throws IOException {
int index = 0;
while (index < data.length) {
int bytesRead = input.read(data, index, data.length - index);
if (bytesRead < 0) {
throw new EOFException("Expected more data");
}
}
}
You should use one of the read methods instead of available, as available() "Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream."
It is not designed to check how long you can read.