Testing for unseen characters in java

Testing for unseen characters in java - java

I'm writing a program in Java that tests the validity of several FTP commands. These commands must in a a carriage return and new line feed (the sequence "\r\n"). I'm using a BufferedReader to read in lines, but I cannot come up with a way to check if the line ends in this sequence. Any ideas?

Do not use the BufferedReader because abstraction level seems too high for your specific tasks. Use the ordinary InputStream, and read into byte array. InputStream will read all bytes as they are. You can process them and later produce strings yourself later using new String(array, offset, length). Maybe it can be other invalid characters like 0x0C in the input.

Related

Read lines of characters and get file position

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.

This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );

The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.

Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549

RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.

Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

BufferedReader and InputStreamReader in Java

I recently started with Java and want to understand a java module of a large app. I came across this line of java code:
String line = (new BufferedReader(new InputStreamReader(System.in))).readLine();
What does this java code do. Is there a C/C++ equivalent of this?

System.in is the standard input.
InputStreamReader allows you to associate a stream that reads from the specified input (in this case the standard input), so now we have a stream.
BufferedReader is an "abstraction" to help you to work with streams. For example, it implements readLine instead of reading character by character until you find a '\n' to get the whole line. It just returns a String after this proccess.
So this line means: "Read a line from standard input and store it in line variable".

> What does this java code do:
String line is your string object
new BufferedReader().readLine() is the instance of a BufferedReader to read text from a character input stream; and readline() is a method it implements to read until a newline character.
new InputStreamReader() gives you a instance of an InputStreamReader which is the "bridge" between the standard in byte stream and the character stream which a BufferedReader wants.
System.in is the standard input (byte stream)
> Is there a C/C++ equivalent of this
Well... there's no language called C/C++... ;)
So I'll assume you wanted an answer for each of them.
In C, there are no "strings" you have to use a character array, but you can read data in to a character array from stdin with something like:
char input[100];
...
scanf("%99[^\n]", input);
or
fgets (input, 100 , stdin)
In C++, you'd use:
using namespace std;
string line;
getline(cin, line);

Your snippet uses a BufferedReader, chained to an InputStreamReader, to read aline from the standard input console and store it to the String line .
BufferedReader
Read text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders.
BufferedReader#readLine()
Read a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.
InputStreamReader
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream. To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
System
The System class contains several useful class fields and methods. It cannot be instantiated.
Among the facilities provided by the System class are standard input, standard output, and error output streams; access to externally defined "properties"; a means of loading files and libraries; and a utility method for quickly copying a portion of an array.
System.in
The "standard" input stream. This stream is already open and ready to supply input data. Typically this stream corresponds to keyboard input or another input source specified by the host environment or user.

What the code does is just simply read a line from input stream. from pattern point of view, this is a decorator. As to using BufferedReader is aiming to improve IO performance.

An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream. To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
For top efficiency, we consider wrapping an InputStreamReader within a BufferedReader. For example:
BufferedReader in
= new BufferedReader(new InputStreamReader(System.in));

Periodically incrementally reading text file as byte array ignoring partially appended lines

I am trying to read into byte array a text file which is appended by another process. I'd like to "poll" the file periodically to extract only new "full" lines of text (the lines ended with a new line). What's the best way to do this in Java 6 using standard libraries?
I am not interested in storing and creating Strings, so probably all the "readLine()" methods from readers are not the ones I should look into. I am thinking of using RandomAccessFile, but I am wondering how to truncate the read byte array so it would end at the last new line character.

You can have one thread which polls the file's length. When the length increases, read that length as a byte[] (and no more, so don't use BufferedInputStream) and you will be able to continue reading the file later. Take the byte[] and write it to a PipedOutputStream.
In your main thread you can use BufferedReader + InputStreamReader + PipedInputStream.
Using that a plain readLine() will do fine.

Most elegant way to read a file and operating on lines as bytes

I have a database dump file I need to operate on rawly. I need to read the file in, operating on it line by line, but I can't have the whole file in memory (they can be 10gb + theoretically).
I want to be able to read it and operate on each line individually as I go, until the end of the file. It has to be weird character friendly (can have all sorts of bytes in them).

You could adapt the old nio example grep and remove the pattern match if you don't need it.

if the line break doesn't interest you can use BufferedReader#readLine() and convert the string back to a byte[]
the other way is to use a byte[] as buffer (has to be large enough for a line) and use InputStream#read(byte[]) to fill it with bytes. then you can search the buffer for linefeeds and work with part of the buffer. once you find no more linefeeds, move the data to the left via System#arraycopy() and fill the rest with new data through InputStream#read(byte[], int, int) and go on.
but be careful! depending on the encoding (e.g. unicode) one byte doesn't have to be one character

Readline() in Java does not handle Chinese characters properly

I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?

Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).

If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.