OutputStream translating bytes into characters with charset (opposite of OutputStreamWriter)

OutputStream translating bytes into characters with charset (opposite of OutputStreamWriter) - java

I need to manipulate content as it is being written to an OutputStream. Specifically, I need to replace CR or LF with CRLF to canonicalize text. This is easy for simple character sets where CR=13 and LF=10, but not so simple with multi-byte character sets. The characters should be replaced, not the bytes. It is non-trivial in general to do that in the output stream itself.
The built-in class OutputStreamWriter converts from characters to bytes for a configured encoding. I'm looking for a class that does the opposite, that is an OutputStream configured with a character set that buffers data as needed and translates the written bytes into characters with the character set (or throws on invalid byte sequences), making the characters available in some way, for example by forwarding the call to a Writer.
In other words I want to convert from bytes to characters on-the-fly as content is being written. I could write everything to a buffer and read it back with an InputStreamReader, but that is inefficient for very large payloads that won't fit in memory.
Is there a class like this somewhere (ideally open source, as I don't think it is built in)? If not, are there similar examples for efficient streaming conversion I could use as a starting point? The JDK classes I've seen are optimized for converting many bytes at a time, not for streaming use.

I wrote an implementation based on CharsetDecoder. Create a decoder and allocate a ByteBuffer and CharBuffer in the constructor:
decoder = charset.newDecoder();
byteBuf = ByteBuffer.allocate(bufferSize);
charBuf = CharBuffer.allocate(bufferSize);
Then implement write:
public void write(int b) throws IOException {
if (!byteBuf.hasRemaining()) {
decodeAndWriteByteBuffer(false);
}
byteBuf.put((byte) b);
}
And decodeAndWriteByteBuffer:
private void decodeAndWriteByteBuffer(boolean endOfInput) throws IOException {
byteBuf.flip();
CoderResult cr;
do {
cr = byteBuf.hasRemaining() || endOfInput
? decoder.decode(byteBuf, charBuf, endOfInput)
: CoderResult.UNDERFLOW;
if (cr.isUnderflow()) {
if (endOfInput) {
do {
cr = decoder.flush(charBuf);
writeCharBuffer();
} while (cr.isOverflow());
if (cr.isError()) {
cr.throwException();
}
}
} else if (cr.isOverflow()) {
writeCharBuffer();
} else {
cr.throwException();
}
} while (cr.isOverflow());
byteBuf.compact();
}
The remaining details are left as an exercise to the reader. It seems to work, though it is to early to say anything about performance.

Related

how to read a large log file which other process current write

Create log file by day， one file about 400MB，JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions：
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}

It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.

Splitting a string with byte length limits in java

I want to split a String to a String[] array, whose elements meet following conditions.
s.getBytes(encoding).length should not exceed maxsize(int).
If I join the splitted strings with StringBuilder or + operator, the result should be exactly the original string.
The input string may have unicode characters which can have multiple bytes when encoded in e.g. UTF-8.
The desired prototype is shown below.
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)
And the testing code:
public boolean isNice(String str, String encoding, int max)
{
//boolean success=true;
StringBuilder b=new StringBuilder();
String[] splitted= SplitStringByByteLength(str,encoding,max);
for(String s: splitted)
{
if(s.getBytes(encoding).length>max)
return false;
b.append(s);
}
if(str.compareTo(b.toString()!=0)
return false;
return true;
}
Though it seems easy when the input string has only ASCII characters, the fact that it could cobtain multibyte characters makes me confused.
Thank you in advance.
Edit: I added my code impementation. (Inefficient)
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
ArrayList<String> splitted=new ArrayList<String>();
StringBuilder builder=new StringBuilder();
//int l=0;
int i=0;
while(true)
{
String tmp=builder.toString();
char c=src.charAt(i);
if(c=='\0')
break;
builder.append(c);
if(builder.toString().getBytes(encoding).length>maxsize)
{
splitted.add(new String(tmp));
builder=new StringBuilder();
}
++i;
}
return splitted.toArray(new String[splitted.size()]);
}
Is this the only way to solve this problem?

The class CharsetEncode has provision for your requirement. Extract from the Javadoc of the Encode method:
public final CoderResult encode(CharBuffer in,
ByteBuffer out,
boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...
In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:
...
CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.
A possible code could be:
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
Charset cs = Charset.forName(encoding);
CharsetEncoder coder = cs.newEncoder();
ByteBuffer out = ByteBuffer.allocate(maxsize); // output buffer of required size
CharBuffer in = CharBuffer.wrap(src);
List<String> ss = new ArrayList<>(); // a list to store the chunks
int pos = 0;
while(true) {
CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
int newpos = src.length() - in.length();
String s = src.substring(pos, newpos);
ss.add(s); // add what has been encoded to the list
pos = newpos; // store new input position
out.rewind(); // and rewind output buffer
if (! cr.isOverflow()) {
break; // everything has been encoded
}
}
return ss.toArray(new String[0]);
}
This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).

The problem lies in the existence of Unicode "supplementary characters" (see Javadoc of the Character class), that take up two "character places" (a surrogate pair) in a String, and you shouldn't split your String in the middle of such a pair.
An easy approach to splitting would be to stick to the worst-case that a single Unicode code point can take at most four bytes in UTF-8, and split the string after every 99 code points (using string.offsetByCodePoints(pos, 99) ). In most cases, you won't fill the 400 bytes, but you'll be on the safe side.
Some words about code points and characters
When Java started, Unicode had less than 65536 characters, so Java decided that 16 bits were enough for a character. Later the Unicode standard exceeded the 16-bit limit, and Java had a problem: a single Unicode element (now called a "code point") no longer fit into a single Java character.
They decided to go for an encoding into 16-bit entities, being 1:1 for most usual code points, and occupying two "characters" for the exotic code points beyond the 16-bit limit (the pair built from so-called "surrogate characters" from a spare code range below 65535). So now it can happen that e.g. string.charAt(5) and string.charAt(6) must be seen in combination, as a "surrogate pair", together encoding one Unicode code point.
That's the reason why you shouldn't split a string at an arbitrary index.
To help the application programmer, the String class then got a new set of methods, working in code point units, and e.g. string.offsetByCodePoints(pos, 99) means: from the index pos, advance by 99 code points forward, giving an index that will often be pos+99 (in case the string doesn't contain anything exotic), but might be up to pos+198, if all the following string elements happen to be surrogate pairs.
Using the code-point methods, you are safe not to land in the middle of a surrogate pair.

How to convert clob to string with encoding in java

We are doing massive batch of xml processing and the logic to convert clob to string is shown below.
import java.sql.Clob
import org.apache.commons.io.IOUtils
String extractXml(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
String sourceXml
try {
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream()), encoding) // 1. Encoding not working
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream(), encoding), encoding) // 2. Encoding working
} catch (Exception e) {
...
}
return sourceXml
}
My queries:
a. I am not sure why (1) doesn't work even though I am using getCharacterStream() instead of getAsciiStream().
but (2) seems to work fine may be I am using explicit overriding of system encoding ?
b. The solution (2) looks bit odd as you are specifing 2 times the encoding format (one for bytes array and one for string creation).
I am not sure if there are any performance issues or wondered if there are better ways to write them?
c. I thought of not using the Apache-commons libraries and use a simple java package solution.
But the suprising thing is, I did not give any explicit encoding but it seems to work perfectly.
Is it because It does "streams character -> straight to string buffering" ?
/*
* working perfectly and retuns encoding correctly
*/
String extractXmlWithoutApacheCommons(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
StringBuffer sb = new StringBuffer((int) xmlClob.length())
try {
Reader r = xmlClob.getCharacterStream()
char[] cbuf = new char[2048]
int n = 0
while ((n = r.read(cbuf, 0, cbuf.length)) != -1) {
if (n > 0) {
sb.append(cbuf, 0, n)
}
}
} catch (Exception e) {
...
}
return sb.toString()
}
Can you guys please shed some light to understand them.

The Clob already has an encoding. It's whatever you've specified in the database, and once you read it on Java side it'll be a String (with the implicit UTF-16 encoding, not that it matters at all).
Whatever you think you're doing with all those encoding tricks is wrong and useless. You only need to specify an encoding when turning bytes to chars or the other way around. You're dealing with chars only (except in your first example where you for some unknown reason want to turn them to bytes).
If you want to use IOUtils, then readFully(Reader input, char[] buffer) would be the method to use.
The platform default encoding has no effect in this whole question, since you shouldn't be working with bytes at all.
Edit:
A slightly more modern way with the standard JDK classes would be to use Reader.read(CharBuffer target) like
CharBuffer cb = CharBuffer.allocate((int) xmlClob.length());
while(r.read(cb) != -1)
;
return cb.toString();
but it doesn't really make a huge difference (it's a bit nicer looking).

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?

I am basically looking for a solution that allows me to stream the lines and replace them IN THE SAME FILE, a la Files.lines

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?
Basically, no.
Any change to a file that involves changing the number of bytes between offets A and B can only be done by rewriting the file, or creating a new one. In either case, everything after B has to be loaded / read into memory.
This is not a Java-specific restriction. It is a consequence of the way that modern operating systems represent files, and the low-level (ie.e. syscall) APIs that they provide to applications.
In the specific case where you replace one line (or sequence of lines) with a line (or sequence of lines) of exactly the same length, then you can do the replacement using either RandomAccessFile, or by mapping the file into memory. Note that the latter approach won't cause the entire file to be read into memory.
It is also possible to replace or delete lines while updating the file "in place" (changing the file length ...). See #Sergio Montoro's answer for an example. However, with an in place update, there is a risk that the file will be corrupted if the application is interrupted. And this does involve reading and rewriting all bytes in the file after the insertion / deletion point. And that entails loading them into memory.

There was a mechanism in Java 1: RandomAccessFile; but any such in-place mechanism requires that you know the start offset of the line, and that the new line is the same length as the old one.
Otherwise you have to copy the file up to that line, substitute the new line in the output, and then continue the copy.
You certainly don't have to load the entire file into memory.

Yes.
A FileChannel allows random read/write to any position of a file. Therefore, if you have a read ahead buffer which is long enough you can replace lines even if the new line is longer than the former one.
The following example is a toy implementation which makes two assumptions: 1st) the input file is ISO-8859-1 Unix LF encoded and 2nd) each new line is never going to be longer than the next line (one line read ahead buffer).
Unless you definitely cannot create a temporary file, you should benchmark this approach against the more natural stream in -> stream out, because I do not know what performance may a spinning drive provide you for an algorithm that constantly moves forward and backward in a file.
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import static java.nio.file.StandardOpenOption.*;
import java.io.IOException;
public class ReplaceInFile {
public static void main(String args[]) throws IOException {
Path file = Paths.get(args[0]);
ByteBuffer writeBuffer;
long readPos = 0l;
long writePos;
String line_m;
String line_n;
String line_t;
FileChannel channel = FileChannel.open(file, READ, WRITE);
channel.position(0);
writePos = readPos;
line_m = readLine(channel);
do {
readPos += line_m.length() + 1;
channel.position(readPos);
line_n = readLine(channel);
line_t = transformLine(line_m)+"\n";
writeBuffer = ByteBuffer.allocate(line_t.length()+1);
writeBuffer.put(line_t.getBytes("ISO8859_1"));
System.out.print("replaced line "+line_m+" with "+line_t);
channel.position(writePos);
writeBuffer.rewind();
while (writeBuffer.hasRemaining()) {
channel.write(writeBuffer);
}
writePos += line_t.length();
line_m = line_n;
assert writePos > readPos;
} while (line_m.length() > 0);
channel.close();
System.out.println("Done!");
}
public static String transformLine(String input) throws IOException {
return input.replace("<", "<").replace(">", ">");
}
public static String readLine(FileChannel channel) throws IOException {
ByteBuffer readBuffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
do {
int read = channel.read(readBuffer);
if (read<1) break;
readBuffer.rewind();
char c = (char) readBuffer.get();
readBuffer.rewind();
if (c=='\n') break;
line.append(c);
} while (true);
return line.toString();
}
}

Using something else instead of String

I have a big file and I want to do some „operations” on it.(find some text, check if some text exists, get the offset of some text, maybe changing the file).
My current aproach is this:
public ResultSet getResultSet(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] buffer = new byte[CAPACITY];
byte[] doubleBuffer = new byte[2 * CAPACITY];
long len = in.read(doubleBuffer);
while (true) {
String reconstitutedString = new String(doubleBuffer, 0 ,doubleBuffer.length);
//...do stuff
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write(doubleBuffer, CAPACITY, CAPACITY);
readUntilNow += len;
len = in.read(buffer);
if (len <= 0) {
break;
}
os.write(buffer, 0, CAPACITY);
doubleBuffer = os.toByteArray();
os.close();
}
in.close();
return makeResult();
}
I would like to change the String reconstitutedString into something else. What would be the best alternative considering I want to be able to get some information about the content of that data, information that I may get calling an IndexOf on a String

You may use StringBuffer or StringBuilder . This two class has almost like String class with the advantage of mutability.
Moreover you can easily convert them to String whenever you required some functionality that only String provides. To convert them you can just use the toString() method.
You may use some other data type as an alternative to String based on your situation. But in general StringBuffer and StringBuilder is the best alternative instead of string. Use StringBuffer for synchronization and StringBuilder in other case.

The best type to do split or indexOf on is String. Just use it.

The most natural choice would be CharBuffer. Like String and StringBuilder it implements the CharSequence interface, therefore it can be used with a lot of text oriented APIs, most notably the regex engine which is the back-end for most search, split, and replacing operations.
What makes CharBuffer the natural choice is that it is also the type that is used by the charset package which provides the necessary operations for converting characters from and to bytes. By dealing with this API you can do the conversion directly from and to CharBuffers without additional data copying steps.
Note that Java’s regex API is prepared for processing buffers containing partially read files and can report whether reading more data might change the result (see hitEnd() and requireEnd()).
These are the necessary tools for building applications which can process large files in smaller chunks and without creating String instance out of it (or only when necessary, e.g. when extracting a matching subsequence).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.