This is on CentOS 6.2. I am writing to a text file, and it is adding a ETX M to the beggining. (ETX is the name of the character)
file.setLength(0);
file.seek(0);
file.writeUTF(somestring);
To quote from the documentation for RandomAccessile.writeUTF()
First, two bytes are written to the file, starting at the current file pointer, as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for each character.
If you don't want this, convert the string to bytes manually and write those bytes with the basic write() method (nb: writeBytes() is not what you want). However, you're going to need some way to keep track of the size of the string in order to read it again (unless you're using fixed-width fields).
Related
In java.io.RandomAccessFile , when the method writeChar() is used to write a char array in loop to a Text file as
RandomAccessFile txtfile = new RandomAccessFile("Hello.txt","rw");
char c[] = {'S','i','g','n','e','d'};
for(char k:c) txtfile.writeChar(k);
Gives a Result in Hello.txt as when opened in normal notepad
S i g n e d
but , when opened with text Editor NotePad++ the Hello.txt is shown as
[NUL]S[NUL]i[NUL]g[NUL]n[NUL]e[NUL]d
and when i used writeUTF() method to write a String to Hello.txt as
txtfile.writeUTF("hello");
it given result as a blank space in front and when opened in NotePad++ it is showing as
[ENQ]hello
How i can write or append a Normal Line to file with out Spaces(like [NUL] or [ENQ] ) as in this case ?
please post answer how to write any String to a file in RandomAccessFile in Java !
Thanks in Advance !
A char value in Java is two bytes. RandomAccessFile is meant to read and write binary data. It does not do any text transformations. When asked to write a character, it just writes the in-memory representation of that character to disk.
Look at the documentation for the RandomAccessFile class:
writeChar(int v)
Writes a char to the file as a two-byte value, high byte first.
writeByte(int v)
Writes a byte to the file as a one-byte value.
writeBytes(String s)
Writes the string to the file as a sequence of bytes.
So use writeByte instead of writeChar to write a character to the file as a single ASCII byte that all editors should deal with in the same way.
To write a String to the file as single-byte characters in one call, use the writeBytes method, which takes a String and writes it to the file as single-byte characters.
If you only want to write text to a file, it's better to use a FileWriter or OutputStreamWriter class to do so. These classes write text taking character encoding into account. The former assumes a default character encoding. The latter allows you to specify the character encoding you want the class to use to convert text to bytes before writing to the file.
My java program is trying to read a text file (Mainframe VSAM file converted to flat file). I believe this means, the file is encoded in EBCDIC format.
I am using com.ibm.jzos.FileFactory.newBufferedReader(fullyQualifiedFileName, ZFile.DEFAULT_EBCDIC_CODE_PAGE); to open the file.
and use String inputLine = inputFileReader.readLine() to read a line and store it in a java string variable for processing. I read that text when stored in String variable becomes unicode.
How can I ensure that the content is not corrupted when storing in the java string variable?
The Charset Decoder will map the bytes to their correct Unicode for String. And vice versa.
The only problem is that the BufferedReader.readLine will drop the line endings (also the EBCDIC end-of-line NEL char, \u0085 - also a recognized Unicode newline). So on writing write the NEL yourself, or set the System line separator property.
Nothing easier than to write a unit test with 256 EBCDIC characters and convert them back and forth.
If you have read the file with the correct character set (which is the biggest assumption here), then it doesn't matter that Java itself uses Unicode internally, Unicode contains all characters of EBCDIC.
A character set specifies the mapping between a character (codepoint) and one or more bytes. A file is nothing more than a stream of bytes, if you apply the right character set, then the right characters are mapped in memory.
Say the byte 1 maps to 'A' in character set X and bytes 0 and 65 in UTF-16, then reading a file which contains byte 1 using character set X will make the system read character 'A', even if that system in memory uses bytes 0 and 65 to store that character.
However there is no way to know if you used the right character set, unless you specifically now what the actual result should be.
I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.
This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );
The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.
Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549
RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.
Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}
I have an array of strings that I need to save into a txt file.I'm only allowed to make max 64kb files so I need to know when to stop putting strings into the file.
Is there some method that having an array of strings,i can find out how big the file will be without creating the file ?
Is the file going to be ASCII-encoded? If so, every character you write will be 1 byte. Add up the string lengths as you go, and if the total number of characters goes greater than 64k, you know to stop. Don't forget to include newlines between strings, in case you're doing that.
Java brings with it a library to input and output data named NIO. I imagine that you should know about how to use it. If you do not know how to use NIO, look at the following links to learn more:
http://en.wikipedia.org/wiki/New_I/O
https://blogs.oracle.com/slc/entry/javanio_vs_javaio
http://docs.oracle.com/javase/tutorial/essential/io/fileio.html
We all know that all data types are just bytes in the end. With characters, we have the same thing, with a little more detail. The characters (letters, numbers, symbols and so on.) in the World are mapped to a table named Unicode, and using some character encoding algorithms you can get a certain number of bytes when you come to save text to a file. How I'd spend hours talking about it for you, I suggest you take a look at the following links to understand more about character encoding:
http://www.w3schools.com/tags/ref_charactersets.asp
https://stackoverflow.com/questions/3049090/character-sets-explained-for-dummies
https://www.w3.org/International/questions/qa-what-is-encoding.en
http://unicode-table.com/en/
http://en.wikipedia.org/wiki/Character_encoding
By using Charset, CharsetEncoder and CharsetDecoder, you can choose a specific character encoding to save your text, depending on, the final size of your file may vary. With the use of UTF-8 (the 8 here means bits), you will end up saving each character in your file with 1 byte each. With UTF-16 (16 here means bits), you will save each character with 2 bytes. This means that as you use a encoding, you end up with a certain number of bytes for each character saved. On the following link you can find the actual encodings supported by the current Java API:
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
With the NIO library, you do not need to actually save a file to know your size. If you just make the use of ByteBuffer, you may already know the final size of your file without even saving it.
Any questions, please comment.
I have a database dump file I need to operate on rawly. I need to read the file in, operating on it line by line, but I can't have the whole file in memory (they can be 10gb + theoretically).
I want to be able to read it and operate on each line individually as I go, until the end of the file. It has to be weird character friendly (can have all sorts of bytes in them).
You could adapt the old nio example grep and remove the pattern match if you don't need it.
if the line break doesn't interest you can use BufferedReader#readLine() and convert the string back to a byte[]
the other way is to use a byte[] as buffer (has to be large enough for a line) and use InputStream#read(byte[]) to fill it with bytes. then you can search the buffer for linefeeds and work with part of the buffer. once you find no more linefeeds, move the data to the left via System#arraycopy() and fill the rest with new data through InputStream#read(byte[], int, int) and go on.
but be careful! depending on the encoding (e.g. unicode) one byte doesn't have to be one character