Is there a way to perform binary search on a file stored in UTF format in sorted order. I am able to perform binary search on a text file using RandomAccessFile. First I find out the length of the file and then jump to the middle position of the file using fseek, after jumping to the middle position I read the bytes. However, I am not finding it feasible for a file stored in UTF format, as the first characters are random in UTF format. And also with DataInputStream I am unable to jump to a particular position in the file. Is it possible to do binary search on such a file. If yes, then using which classes.
Yes, it is possible. If you jump into the middle of a file, you will first need to go to the nearest record separator and then use the text starting after the record separator.
Depending on the exact file format you have, a line feed, a TAB character or something similar could be used as the record separator.
Locating the record separator is easy if it is a character with a Unicode number below 32 (which NL, CR, TAB fulfill). Then you don't need to care about the multibyte UTF-8 encoding (for locating the separator). If it's a wide character Unicode format, then it isn't much more difficult either.
DataInputStream is the wrong class from random access. (Streaming is sort of the opposite of random access.) Have a look at RandomAccessFile instead.
Related
I am trying to implement Huffman Tree compression. Pretty much how it works is giving < 8-bit codes to the most common characters in text documents and larger codes to the less common characters. Then there is a binary tree encoded that lets you navigate down with 1's telling you to go left and 0's telling you to go right which leads you to the characters.
So obviously there are chunks that aren't 8 bytes long. I have been rounding them off as need be with 0's at the end and converting them to characters. However, I just found that java writes in 3 bytes per characters. Because this is about compression I obviously want one byte.
The problem is that I don't know what bytes are going to end up trying to be written. Three different < 8-bit codes might get mushed together. I need to be able to write any code to the text file. There are invalid byte sequences however and so my entire approach is all gummed up.
Is there any way that I can let any byte sequence be valid in a certain section of the file and just store it as it literally is and not worry about a character ending the file prematurely or causing another mischief? I am coding on a Mac so that is a problem unlike in windows where they just have the length of the file at the beginning so that they don't need an end of file character.
If there is not a direct solution here then perhaps I could make my own encoding that would not exit the file and nest that inside a more common one?
This looks like a good use case for Java's Bitset: https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html
When writing out the data to a file, you should output the number of values which were encoded and afterwards you only need the serialized stream of bits.
My java program is trying to read a text file (Mainframe VSAM file converted to flat file). I believe this means, the file is encoded in EBCDIC format.
I am using com.ibm.jzos.FileFactory.newBufferedReader(fullyQualifiedFileName, ZFile.DEFAULT_EBCDIC_CODE_PAGE); to open the file.
and use String inputLine = inputFileReader.readLine() to read a line and store it in a java string variable for processing. I read that text when stored in String variable becomes unicode.
How can I ensure that the content is not corrupted when storing in the java string variable?
The Charset Decoder will map the bytes to their correct Unicode for String. And vice versa.
The only problem is that the BufferedReader.readLine will drop the line endings (also the EBCDIC end-of-line NEL char, \u0085 - also a recognized Unicode newline). So on writing write the NEL yourself, or set the System line separator property.
Nothing easier than to write a unit test with 256 EBCDIC characters and convert them back and forth.
If you have read the file with the correct character set (which is the biggest assumption here), then it doesn't matter that Java itself uses Unicode internally, Unicode contains all characters of EBCDIC.
A character set specifies the mapping between a character (codepoint) and one or more bytes. A file is nothing more than a stream of bytes, if you apply the right character set, then the right characters are mapped in memory.
Say the byte 1 maps to 'A' in character set X and bytes 0 and 65 in UTF-16, then reading a file which contains byte 1 using character set X will make the system read character 'A', even if that system in memory uses bytes 0 and 65 to store that character.
However there is no way to know if you used the right character set, unless you specifically now what the actual result should be.
I am trying to divide txt files into ArrayList of strings and so far it works, but first words in the file always starts with (int)'65279' and I can't even copy this character here. Also, in GUI it looks like second letter of word is missing but at the same time it works in console. Other words are as they should be. I am using UTF-8 format .txt files. How can I change format in netBeans and GUI made in this IDE?
U+FEFF is the byte order mark. It's used to indicate the character encoding/endianness (to you can easily tell the difference between big and little-endian UTF-16, for example).
If it's causing you a problem, the simplest thing is just to strip it:
if (text.startsWith("\ufeff")) {
text = text.substring(1);
}
I have an array of strings that I need to save into a txt file.I'm only allowed to make max 64kb files so I need to know when to stop putting strings into the file.
Is there some method that having an array of strings,i can find out how big the file will be without creating the file ?
Is the file going to be ASCII-encoded? If so, every character you write will be 1 byte. Add up the string lengths as you go, and if the total number of characters goes greater than 64k, you know to stop. Don't forget to include newlines between strings, in case you're doing that.
Java brings with it a library to input and output data named NIO. I imagine that you should know about how to use it. If you do not know how to use NIO, look at the following links to learn more:
http://en.wikipedia.org/wiki/New_I/O
https://blogs.oracle.com/slc/entry/javanio_vs_javaio
http://docs.oracle.com/javase/tutorial/essential/io/fileio.html
We all know that all data types are just bytes in the end. With characters, we have the same thing, with a little more detail. The characters (letters, numbers, symbols and so on.) in the World are mapped to a table named Unicode, and using some character encoding algorithms you can get a certain number of bytes when you come to save text to a file. How I'd spend hours talking about it for you, I suggest you take a look at the following links to understand more about character encoding:
http://www.w3schools.com/tags/ref_charactersets.asp
https://stackoverflow.com/questions/3049090/character-sets-explained-for-dummies
https://www.w3.org/International/questions/qa-what-is-encoding.en
http://unicode-table.com/en/
http://en.wikipedia.org/wiki/Character_encoding
By using Charset, CharsetEncoder and CharsetDecoder, you can choose a specific character encoding to save your text, depending on, the final size of your file may vary. With the use of UTF-8 (the 8 here means bits), you will end up saving each character in your file with 1 byte each. With UTF-16 (16 here means bits), you will save each character with 2 bytes. This means that as you use a encoding, you end up with a certain number of bytes for each character saved. On the following link you can find the actual encodings supported by the current Java API:
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
With the NIO library, you do not need to actually save a file to know your size. If you just make the use of ByteBuffer, you may already know the final size of your file without even saving it.
Any questions, please comment.
I am new at using WoodStox. I have to read all possible combination of Unicode character and write to XML file. WoodStox fails wile reading certain Unicode character. Can some one help me how can i either skip the character when encountered or some solution to write that Unicode character in XML file.
The exception i get is
Error on line 1 column 1404735 of 24364002-data-set-results.xml:
SXXP0003: Error reported by XML parser: Character reference "" is an invalid XML character.
Exception is : net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException: Character reference "" is an invalid XML character.
I am not familiar with WoodStox either, but I can say that FFFE is indeed not a valid Unicode character, so it is probably more a problem with the input than with the parser. FFFE is most commonly used by some encoders in UTF-16 encoding to indicate the byte order (little or big endian). Depending on whether it is read back as FFFE or as FEFF, the decoder knows which byte order to choose. But it is not a standard and not all decoders support it.
When used as such, it is always the first two bytes of the file.
So, what you need to check is:
Are you using the correct character encoding (usually either UTF-8 or UTF-16)?
If using UTF-16, does your file start with FFFE or FFEF?
Does WoodStox have a setting that enables automatically detection of byte order?
Worst case, if your file starts with FFFE or FFEF, simply remove it from the file before you feed it to WoodStox. Make sure that you set the correct byte-order in WoodStox, though.