string offsets in java and python - java

I'm a starter python programmer and I used to know a bit java in the past.
I have some text files (in Turkish) and corresponding xml files which contains offset numbers of
the connectives in the text. For example
-<Conn>
-<Span>
<Text>ama</Text>
<BeginOffset>281</BeginOffset>
<EndOffset>284</EndOffset>
</Span>
</Conn>
this says that there is an 'ama' at the 281 offset in the txt file. But when I read this file with python,
'ama' is at the 301. byte or it is the 272. character in the file. As far as I know, java application doesn't mention any encoding while reading txt files. And I tried to read files with unicode, UTF8 etc...
I need to find a way from these offsets to correct positions in the files. my guess, problem is due to Turkish characters (which may takes different numbers of bytes in different encodings) but I coudn't figure it out.
Any suggestion will be very very good for me.
thanks
Edit:
I used following code in python3.3:
f = open(path, encoding='utf-8')
text = f.read()
text[272:275] # returns 'ama' but it should be text[281:284]
ibbyte = text.encode(encoding='utf-8')
inbytes[292:295] # returns 'ama' but this is also incorrect

as #Gene says it's end-of-line markers. since the java application written in windows, it counts each '\n' as 2 bytes. But python counts them as 1 byte. I count '\n' until the offset number and substract it from the given offset number.
thank you very much for your insightful comments

Related

Is there any encoding that would let me safely write and read any 8 bit char code (the whole 256 not just the 128) to and from a file?

I am trying to implement Huffman Tree compression. Pretty much how it works is giving < 8-bit codes to the most common characters in text documents and larger codes to the less common characters. Then there is a binary tree encoded that lets you navigate down with 1's telling you to go left and 0's telling you to go right which leads you to the characters.
So obviously there are chunks that aren't 8 bytes long. I have been rounding them off as need be with 0's at the end and converting them to characters. However, I just found that java writes in 3 bytes per characters. Because this is about compression I obviously want one byte.
The problem is that I don't know what bytes are going to end up trying to be written. Three different < 8-bit codes might get mushed together. I need to be able to write any code to the text file. There are invalid byte sequences however and so my entire approach is all gummed up.
Is there any way that I can let any byte sequence be valid in a certain section of the file and just store it as it literally is and not worry about a character ending the file prematurely or causing another mischief? I am coding on a Mac so that is a problem unlike in windows where they just have the length of the file at the beginning so that they don't need an end of file character.
If there is not a direct solution here then perhaps I could make my own encoding that would not exit the file and nest that inside a more common one?
This looks like a good use case for Java's Bitset: https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html
When writing out the data to a file, you should output the number of values which were encoded and afterwards you only need the serialized stream of bits.

Java unexpected character parsing txt file

I am trying to divide txt files into ArrayList of strings and so far it works, but first words in the file always starts with (int)'65279' and I can't even copy this character here. Also, in GUI it looks like second letter of word is missing but at the same time it works in console. Other words are as they should be. I am using UTF-8 format .txt files. How can I change format in netBeans and GUI made in this IDE?
U+FEFF is the byte order mark. It's used to indicate the character encoding/endianness (to you can easily tell the difference between big and little-endian UTF-16, for example).
If it's causing you a problem, the simplest thing is just to strip it:
if (text.startsWith("\ufeff")) {
text = text.substring(1);
}

Where is hex code of the "EOF" character?

As far as know in the end of all files, specially text files, there is a Hex code for EOF or NULL character. And when we want to write a program and read the contents of a text file, we send the read function until we receive that EOF hexcode.
My question : I downloaded some tools to see a hex view of a text file. but I can't see any hex code for EOF(End Of File/NULL) or EOT(End Of Text)
ASCII/Hex code tables :
This is output of Hex viewer tools:
Note : My input file is a text file that its content is "Where is hex code of "EOF"?"
Appreciate your time and consideration.
There is no such thing as a EOF character. The operating system knows exactly how many bytes a file contains (this is stored alongside other metadata like permissions, creation date, and the name), and hence can tell programs that try to read the eleventh byte of a ten byte file: You've reached the end of file, there are no more bytes to read.
In fact, the "EOF" value returned for example by C functions like getchar is explicitly an int value outside the range of a byte, so it cannot possibly be stored in a file!
Sometimes, certain file formats insist on adding NUL terminators (probably because that's how strings are usually stored in C), though usually these delimit multiple records in a single file, not the file as a whole. And such decoration usually disqualifies a file from being considered a "text file".
ASCII codes like ETX and NUL date back to the days of teletypewriters and friends. NUL is used in C for in-memory strings, but this has no bearing on file systems.
There was - a long long time ago - an End Of File marker but it hasn't been used in files for many years.
You can demonstrate a distant echo of it on windows using:
C:\>copy con junk.txt
Hello
Hello again
- Press <Ctrl> and <z>
C:\>dump junk.txt
junk.txt:
00000000 4865 6c6c 6f0d 0a48 656c 6c6f 2061 6761 Hello..Hello aga
00000010 696e 0d0a in..
C:\>
Note the use of Ctrl-Z as an EOT marker.
However, notice also that the Ctrl-Z does not appear in the file any more - it used to appear as a 0x1a but only on some operating systems and even then not consistently.
Use of ETX (0x03) stopped even before those dim and distant times.
There is no such thing as EOF. EOF is just a value returned by file reading functions to tell you the file pointer reached the end of the file.
The EOT byte (0x04) is used to this day by unix tty terminals to indicate end of input. You type it with a Ctrl + D (ie. ^D) to end input to shells or any other program reading from stdin.
However, as others have pointed out, this is distinct from EOF, which is a condition rather than a piece of data per se.
There once were even different EOF characters (for different operating systems). No longer seen one. (Typically files were in blocks of 128 bytes.) For coding a PITA, like nowadays BOMs.
Instead there is still a int read() that normally delivers a byte value, but for EOF delivers -1.
The NUL character is a string terminator in C. In java you can have a NUL character in the middle of a string. To be cooperative with C, the UTF-8 bytes generated use a multi-byte encoding both for Unicode characters > 127 and for NUL.
(Some of this is probably known already.)
In the 7bit Wintel world it is 0x1A or chr(26).
It is still commonly found in older text files and archives and is still produced by some file transmission protocols. In particular text files downloaded from BBS systems were commonly terminated with this character.
There are other such sentinel values for older systems, and like EOL (CR,LF,CR+LF) needs to be anticipated from time to time.
It can be a source of annoyance to see it still being used, on the same level as return(0) for instance.

java-how can someone find out the size of a txt file before creation

I have an array of strings that I need to save into a txt file.I'm only allowed to make max 64kb files so I need to know when to stop putting strings into the file.
Is there some method that having an array of strings,i can find out how big the file will be without creating the file ?
Is the file going to be ASCII-encoded? If so, every character you write will be 1 byte. Add up the string lengths as you go, and if the total number of characters goes greater than 64k, you know to stop. Don't forget to include newlines between strings, in case you're doing that.
Java brings with it a library to input and output data named NIO. I imagine that you should know about how to use it. If you do not know how to use NIO, look at the following links to learn more:
http://en.wikipedia.org/wiki/New_I/O
https://blogs.oracle.com/slc/entry/javanio_vs_javaio
http://docs.oracle.com/javase/tutorial/essential/io/fileio.html
We all know that all data types are just bytes in the end. With characters, we have the same thing, with a little more detail. The characters (letters, numbers, symbols and so on.) in the World are mapped to a table named Unicode, and using some character encoding algorithms you can get a certain number of bytes when you come to save text to a file. How I'd spend hours talking about it for you, I suggest you take a look at the following links to understand more about character encoding:
http://www.w3schools.com/tags/ref_charactersets.asp
https://stackoverflow.com/questions/3049090/character-sets-explained-for-dummies
https://www.w3.org/International/questions/qa-what-is-encoding.en
http://unicode-table.com/en/
http://en.wikipedia.org/wiki/Character_encoding
By using Charset, CharsetEncoder and CharsetDecoder, you can choose a specific character encoding to save your text, depending on, the final size of your file may vary. With the use of UTF-8 (the 8 here means bits), you will end up saving each character in your file with 1 byte each. With UTF-16 (16 here means bits), you will save each character with 2 bytes. This means that as you use a encoding, you end up with a certain number of bytes for each character saved. On the following link you can find the actual encodings supported by the current Java API:
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
With the NIO library, you do not need to actually save a file to know your size. If you just make the use of ByteBuffer, you may already know the final size of your file without even saving it.
Any questions, please comment.

Get filename as UTF-8? (ä,ü,ö ... is always '?')

I have to read the name of some files and put them in a list as a string. Its not so hard I just have some Problems with some characters like ä,ö,ü ... they are always as a '?' in my string.
Whats the Problem? Well the encoding. Ok this should be easy... thats what i thought. So I tried to use functions like:
new String(insert.getBytes("UTF-8")
or
new String(insert.getBytes("ISO-8859-1"), "UTF-8")
because the most of the files are ISO-8859-1
Its not helping. This is my code:
...
File[] fileList = dir.listFiles();
String insert;
for(File f : fileList) {
...
insert=f.getName().substring(0,f.getName().length()-4);
insert=insert.charAt(0)+insert.substring(1,insert.length()).toLowerCase().replaceFirst("([0-9]*(_s?(i)?(_dat)?)*$)", "").replaceFirst("_", " ");
...
System.out.println("test UTF8: " + new String(insert.getBytes("UTF-8"))); //not helping
System.out.println("test ISO , UTF8: " + new String(insert.getBytes("ISO-8859-1"), "UTF-8")); //not helping
...
names.add(insert);
}
At the end there are a lot of strings with '?' characters in my list.
How to fix the problem? And whats the best way if there are not only ISO-8859-1 files? (lets say there are a lot of unknown encoded files)
Thank You!
Given the extended comments back and forth under the question, it now looks like this is either a font problem or (perhaps more likely) a filename encoding problem.
I asked Lissy to run the following command to let us figure out what the problem is. If she is sure that the filename contain "ä" in them, but that character does not appear when she ls the filename, then this command will tell us whether this is a font or encoding problem.
touch filenäme
ls filen*me
If this shows "filenäme" in the output of ls then we know the problem is with the creation/copy of the files onto this system. This could happen if the program which created the files didn't realize what the filesystem encoding was or was too stupid to do the right thing. The convmv program will probably be the best way to fix this.
convmv -f ENCODING -t utf8 -r .
The question is what is the proper encoding. Possibilities include UTF-16, cp850, or perhaps iso8859-1. convmv --list will show you the list of currently known (to your system) encodings. Since the listed command above only shows you what it might do, it is safe to run several times with different encodings until you find one which works for all files.
If this is a font problem, we'll have to look into that
Unexpected question marks, spalts, etc in a String are a sign that something somewhere doesn't recognize a particular character when converting from one character set to another.
In your case, the problem could be occurring in a couple of places:
It could be occurring when your Java program is reading the file names from the directory (in the dir.listFiles() call).
It could be happening when you print the characters to the console stream.
In either case, the root cause is most likely a mismatch between what Java thinks the locale settings should be and the settings that the operating system and/or command shell are using.
As an experiment, try to list a directory containing the problematic file names from the command line. Do you see question marks or other splats there?
A second experiment to perform is to modify your Java program to dump one of the problem Strings as a sequence of numbers representing the character codes for each of the characters. Do you see the character codes for an ASCII / Unicode '?'.
The encoding of the content of the file name has nothing to do with the encoding of the file name itself.
You should get correct results from System.out.println(insert)
If you don't, it means that the shell has a different character encoding that the default character encoding for your system (this rarely happens; it would usually be the result of an explicit command to switch encodings in the shell).
If the file names are displayed correctly when you list the directory in the shell, I would expect them to be displayed correctly without specifying an encoding in your Java program.
If the shell is incapable of displaying the character (it is substituting the replacement character 0xFFFD (�) for these unprintable characters), there's nothing you can do from your Java application to change that. You need to change the terminal character encoding, install the right fonts, etc.; that is a operating system issue, not a Java issue.
At the same time, even if your terminal can't display the correct results, the Java program should be handling the character encodings correctly without your intervention.
The library behind the File API is figuring out the correct character encoding for your system and doing the necessary decoding into characters. Likewise, the database driver should negotiate with the database to determine the correct encoding, and do any necessary encoding into bytes on behalf of your application.
In a comment you wrote:
#mdrg: well, theres a Problem. I have to read the name of the files and then put them into a database. And there are a lot of '?' , that shouldnt be... – Lissy 27 mins ago
My guess is that the column you're inserting the filenames into specifies US-ASCII as the encoding and replaces characters outside that range with a replacement character, which in your case is the question mark.
So you have to find out the encoding for the column in your database table where you store the filenames. Various products have various syntaxes for retrieving that information.
In Java 1.6 you can use System.console() instead of System.out.println() to display accentuated characters to console.
public class Test {
public static void main(String args[]){
String s = "caractères français : à é \u00e9"; // Unicode for "é"
System.console().writer().println(s);
}
}
and the output is
C:\temp>java Test
caractères français : à é é

Categories