As far as know in the end of all files, specially text files, there is a Hex code for EOF or NULL character. And when we want to write a program and read the contents of a text file, we send the read function until we receive that EOF hexcode.
My question : I downloaded some tools to see a hex view of a text file. but I can't see any hex code for EOF(End Of File/NULL) or EOT(End Of Text)
ASCII/Hex code tables :
This is output of Hex viewer tools:
Note : My input file is a text file that its content is "Where is hex code of "EOF"?"
Appreciate your time and consideration.
There is no such thing as a EOF character. The operating system knows exactly how many bytes a file contains (this is stored alongside other metadata like permissions, creation date, and the name), and hence can tell programs that try to read the eleventh byte of a ten byte file: You've reached the end of file, there are no more bytes to read.
In fact, the "EOF" value returned for example by C functions like getchar is explicitly an int value outside the range of a byte, so it cannot possibly be stored in a file!
Sometimes, certain file formats insist on adding NUL terminators (probably because that's how strings are usually stored in C), though usually these delimit multiple records in a single file, not the file as a whole. And such decoration usually disqualifies a file from being considered a "text file".
ASCII codes like ETX and NUL date back to the days of teletypewriters and friends. NUL is used in C for in-memory strings, but this has no bearing on file systems.
There was - a long long time ago - an End Of File marker but it hasn't been used in files for many years.
You can demonstrate a distant echo of it on windows using:
C:\>copy con junk.txt
Hello
Hello again
- Press <Ctrl> and <z>
C:\>dump junk.txt
junk.txt:
00000000 4865 6c6c 6f0d 0a48 656c 6c6f 2061 6761 Hello..Hello aga
00000010 696e 0d0a in..
C:\>
Note the use of Ctrl-Z as an EOT marker.
However, notice also that the Ctrl-Z does not appear in the file any more - it used to appear as a 0x1a but only on some operating systems and even then not consistently.
Use of ETX (0x03) stopped even before those dim and distant times.
There is no such thing as EOF. EOF is just a value returned by file reading functions to tell you the file pointer reached the end of the file.
The EOT byte (0x04) is used to this day by unix tty terminals to indicate end of input. You type it with a Ctrl + D (ie. ^D) to end input to shells or any other program reading from stdin.
However, as others have pointed out, this is distinct from EOF, which is a condition rather than a piece of data per se.
There once were even different EOF characters (for different operating systems). No longer seen one. (Typically files were in blocks of 128 bytes.) For coding a PITA, like nowadays BOMs.
Instead there is still a int read() that normally delivers a byte value, but for EOF delivers -1.
The NUL character is a string terminator in C. In java you can have a NUL character in the middle of a string. To be cooperative with C, the UTF-8 bytes generated use a multi-byte encoding both for Unicode characters > 127 and for NUL.
(Some of this is probably known already.)
In the 7bit Wintel world it is 0x1A or chr(26).
It is still commonly found in older text files and archives and is still produced by some file transmission protocols. In particular text files downloaded from BBS systems were commonly terminated with this character.
There are other such sentinel values for older systems, and like EOL (CR,LF,CR+LF) needs to be anticipated from time to time.
It can be a source of annoyance to see it still being used, on the same level as return(0) for instance.
Related
I'm utilizing this line codes
String string = "Some usefull information − don't know what happens with my output";
System.out.println(string);
String str2verify = driver.findElement(By.xpath("//someWellFormXpath")).getText();
Assert.assertEquals(str2verify , "Some usefull information − don't know what happens with my output");
And I'm getting this in my console, so if I want to use equals function doesn't work.
Output
Some usefull information ? don't know what happens with my output
expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]
java.lang.AssertionError: expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]
This is the process:
You write some text. In an editor. That is showing strings to you.
You save your file. files are bytes, not characters, so your editor is applying a charset encoding to do this. Which one? Your editor will know, you didn't mention which one you use so I can't tell you.
Javac reads your file. files are bytes, but javac needs characters, so javac is applying a charset encoding to do this. Which one? "The platform default", unless you use the -encoding parameter / the tool you are using that calls javac has a way to tell it which -encoding parameter to use.
Javac emits class files. These are byte based so this doesn't require encoding.
Your java JVM runs your class file. As part of running, a string is printed to standard out.
System.out refers to 'standard out'. These things are, on pretty much every OS, a stream of bytes. Meaning, when you send strings there, the JVM first encodes your string using some charset encoding, then it goes to standard out.
Something is connected to the other end of standard out and sees these bytes. These convert the bytes back to a string, also using some encoding.
The characters are sent to the font rendering engine on your OS. Even if the character 'survived' all those conversions back and forth, it is possible your font doesn't have a glyph for it. The intent is clearly for that character to be an emdash (a dash that is as long as the letter 'm' - the standard 'minus' character is an ndash, not the same thing; that one is a bit shorter).
Count em up - that's like 6 conversions. They all need to be using the same charset encoding. So, check that your editor and javac agree on what charset encoding your source file is in. Then, check that the console thing that is showing the string is in agreement with standard out (which should be 'platform default', whatever that might be), then, check if the font you use has emdash.
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
Then write to ps, not System.out - that's how you can explicitly force some charset to be used when writing to output.
It turns that em dash doesn't have a representation in cp-1252 charset encoding, so at the end I have to change to UTF-8 all my files in the project to be able to save this character.
It was a pain in the brain this encoding issue.
Thanks for all the suggestions friends.
IntelliJ IDEA 14.0.1
Plugin: jetbrains-bitbucket-connector
I'm trying to commit files, but get the error:
Error:transaction abort!
rollback completed abort:
decoding near 'C:\Users\����\AppDa': 'utf8' codec can't decode byte
0xcc in position 9: invalid continuation byte!
Has anyone encountered this error? How can it be solved?
Thanks.
This probably isn't the answer which you're looking for but gives you some insight on what might be going on:
On most systems, file paths are made up from bytes since file systems were designed decades before Unicode. Unicode is retrofitted to them by interpreting the bytes as UTF-8 encoded strings. Unfortunately, there is no way to say "this is Cp-1251" and "this is UTF-8" inside of a file name. Therefore, the "convert file name to string" code relies on the platform's default encoding. NTFS solved the problem by always storing file names as Unicode (ignoring the local code page) but the names are translated into the local code page when you use a tool which displays them on screen.
And then comes Python 2 where Unicode was also retrofitted in a similar way. Python just has the advantage that you have two types of objects (str and unicode) so in theory, you can tell raw bytes and Unicode apart. The problems start when you get a bunch of bytes from somewhere and the logic says "this should be Unicode" - which happens when you read file names from disk.
In your case, the file system passes bytes which contain Cp1251 encoded characters to Python but the Python code tries to read them as UTF-8 encoded Unicode. For many characters (< code point 128), this works but it breaks for everything with a code point > 128. \xCC is a common case here since UTF-8 uses this byte to encode all code points between 128 and 256. This is why you see this error so often in Europe - we use those characters a lot.
Now the people who created Mercurial are well aware of all this. Most of the time, Mercurial should just work. See https://www.mercurial-scm.org/pipermail/mercurial/2009-January/023762.html
As I see it, your problem could be caused by:
Somehow, Windows used the local code page to create your home directory (unlikely)
Mercurial gets the path as Unicode but for some reason, it thinks that the string is raw bytes and tried to decode using a UTF-8 decoder. Since the decoding is applied twice, this fails. Maybe you have an old version of Mercurial. Try to update.
Maybe you showed us the wrong part of the error message and the problem is actually in a file which you tried to commit. In that case, we can ignore the odd � characters in the error message. Make sure you use the correct encoding when you edit the file.
To see which one it is, I suggest to create a folder C:\dev and work there. If this works, then there is something wrong with your home folder or Mercurial has a bug.
Error is saying that in your file location path there are few character which is not present in the utf-8 character set so the decoder is not able to decode the given file path and it is aborting the operation.
see the characters in the location path and correct it if there are any unknown character present in that
'C:\Users\����\AppDa'
here the ���� are showing that these characters are not able to decode by utf-8.
edit:
Check your string with this tool to see in which encoding your character set is encoded. link to tool
then you can use that encoder but this is not a practical solution use utf-16 char set it is having large character set, and it vary by platform and language.
I had the same problem, if you use Mercurial also, then here is the solution:
go to [project directory]/.hg
open "hgrc" file
below the [ui] insert username = my_name_only_utf_characters <mail#example.com>
save & commit
I have an array of strings that I need to save into a txt file.I'm only allowed to make max 64kb files so I need to know when to stop putting strings into the file.
Is there some method that having an array of strings,i can find out how big the file will be without creating the file ?
Is the file going to be ASCII-encoded? If so, every character you write will be 1 byte. Add up the string lengths as you go, and if the total number of characters goes greater than 64k, you know to stop. Don't forget to include newlines between strings, in case you're doing that.
Java brings with it a library to input and output data named NIO. I imagine that you should know about how to use it. If you do not know how to use NIO, look at the following links to learn more:
http://en.wikipedia.org/wiki/New_I/O
https://blogs.oracle.com/slc/entry/javanio_vs_javaio
http://docs.oracle.com/javase/tutorial/essential/io/fileio.html
We all know that all data types are just bytes in the end. With characters, we have the same thing, with a little more detail. The characters (letters, numbers, symbols and so on.) in the World are mapped to a table named Unicode, and using some character encoding algorithms you can get a certain number of bytes when you come to save text to a file. How I'd spend hours talking about it for you, I suggest you take a look at the following links to understand more about character encoding:
http://www.w3schools.com/tags/ref_charactersets.asp
https://stackoverflow.com/questions/3049090/character-sets-explained-for-dummies
https://www.w3.org/International/questions/qa-what-is-encoding.en
http://unicode-table.com/en/
http://en.wikipedia.org/wiki/Character_encoding
By using Charset, CharsetEncoder and CharsetDecoder, you can choose a specific character encoding to save your text, depending on, the final size of your file may vary. With the use of UTF-8 (the 8 here means bits), you will end up saving each character in your file with 1 byte each. With UTF-16 (16 here means bits), you will save each character with 2 bytes. This means that as you use a encoding, you end up with a certain number of bytes for each character saved. On the following link you can find the actual encodings supported by the current Java API:
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
With the NIO library, you do not need to actually save a file to know your size. If you just make the use of ByteBuffer, you may already know the final size of your file without even saving it.
Any questions, please comment.
I'm a starter python programmer and I used to know a bit java in the past.
I have some text files (in Turkish) and corresponding xml files which contains offset numbers of
the connectives in the text. For example
-<Conn>
-<Span>
<Text>ama</Text>
<BeginOffset>281</BeginOffset>
<EndOffset>284</EndOffset>
</Span>
</Conn>
this says that there is an 'ama' at the 281 offset in the txt file. But when I read this file with python,
'ama' is at the 301. byte or it is the 272. character in the file. As far as I know, java application doesn't mention any encoding while reading txt files. And I tried to read files with unicode, UTF8 etc...
I need to find a way from these offsets to correct positions in the files. my guess, problem is due to Turkish characters (which may takes different numbers of bytes in different encodings) but I coudn't figure it out.
Any suggestion will be very very good for me.
thanks
Edit:
I used following code in python3.3:
f = open(path, encoding='utf-8')
text = f.read()
text[272:275] # returns 'ama' but it should be text[281:284]
ibbyte = text.encode(encoding='utf-8')
inbytes[292:295] # returns 'ama' but this is also incorrect
as #Gene says it's end-of-line markers. since the java application written in windows, it counts each '\n' as 2 bytes. But python counts them as 1 byte. I count '\n' until the offset number and substract it from the given offset number.
thank you very much for your insightful comments
I have to read the name of some files and put them in a list as a string. Its not so hard I just have some Problems with some characters like ä,ö,ü ... they are always as a '?' in my string.
Whats the Problem? Well the encoding. Ok this should be easy... thats what i thought. So I tried to use functions like:
new String(insert.getBytes("UTF-8")
or
new String(insert.getBytes("ISO-8859-1"), "UTF-8")
because the most of the files are ISO-8859-1
Its not helping. This is my code:
...
File[] fileList = dir.listFiles();
String insert;
for(File f : fileList) {
...
insert=f.getName().substring(0,f.getName().length()-4);
insert=insert.charAt(0)+insert.substring(1,insert.length()).toLowerCase().replaceFirst("([0-9]*(_s?(i)?(_dat)?)*$)", "").replaceFirst("_", " ");
...
System.out.println("test UTF8: " + new String(insert.getBytes("UTF-8"))); //not helping
System.out.println("test ISO , UTF8: " + new String(insert.getBytes("ISO-8859-1"), "UTF-8")); //not helping
...
names.add(insert);
}
At the end there are a lot of strings with '?' characters in my list.
How to fix the problem? And whats the best way if there are not only ISO-8859-1 files? (lets say there are a lot of unknown encoded files)
Thank You!
Given the extended comments back and forth under the question, it now looks like this is either a font problem or (perhaps more likely) a filename encoding problem.
I asked Lissy to run the following command to let us figure out what the problem is. If she is sure that the filename contain "ä" in them, but that character does not appear when she ls the filename, then this command will tell us whether this is a font or encoding problem.
touch filenäme
ls filen*me
If this shows "filenäme" in the output of ls then we know the problem is with the creation/copy of the files onto this system. This could happen if the program which created the files didn't realize what the filesystem encoding was or was too stupid to do the right thing. The convmv program will probably be the best way to fix this.
convmv -f ENCODING -t utf8 -r .
The question is what is the proper encoding. Possibilities include UTF-16, cp850, or perhaps iso8859-1. convmv --list will show you the list of currently known (to your system) encodings. Since the listed command above only shows you what it might do, it is safe to run several times with different encodings until you find one which works for all files.
If this is a font problem, we'll have to look into that
Unexpected question marks, spalts, etc in a String are a sign that something somewhere doesn't recognize a particular character when converting from one character set to another.
In your case, the problem could be occurring in a couple of places:
It could be occurring when your Java program is reading the file names from the directory (in the dir.listFiles() call).
It could be happening when you print the characters to the console stream.
In either case, the root cause is most likely a mismatch between what Java thinks the locale settings should be and the settings that the operating system and/or command shell are using.
As an experiment, try to list a directory containing the problematic file names from the command line. Do you see question marks or other splats there?
A second experiment to perform is to modify your Java program to dump one of the problem Strings as a sequence of numbers representing the character codes for each of the characters. Do you see the character codes for an ASCII / Unicode '?'.
The encoding of the content of the file name has nothing to do with the encoding of the file name itself.
You should get correct results from System.out.println(insert)
If you don't, it means that the shell has a different character encoding that the default character encoding for your system (this rarely happens; it would usually be the result of an explicit command to switch encodings in the shell).
If the file names are displayed correctly when you list the directory in the shell, I would expect them to be displayed correctly without specifying an encoding in your Java program.
If the shell is incapable of displaying the character (it is substituting the replacement character 0xFFFD (�) for these unprintable characters), there's nothing you can do from your Java application to change that. You need to change the terminal character encoding, install the right fonts, etc.; that is a operating system issue, not a Java issue.
At the same time, even if your terminal can't display the correct results, the Java program should be handling the character encodings correctly without your intervention.
The library behind the File API is figuring out the correct character encoding for your system and doing the necessary decoding into characters. Likewise, the database driver should negotiate with the database to determine the correct encoding, and do any necessary encoding into bytes on behalf of your application.
In a comment you wrote:
#mdrg: well, theres a Problem. I have to read the name of the files and then put them into a database. And there are a lot of '?' , that shouldnt be... – Lissy 27 mins ago
My guess is that the column you're inserting the filenames into specifies US-ASCII as the encoding and replaces characters outside that range with a replacement character, which in your case is the question mark.
So you have to find out the encoding for the column in your database table where you store the filenames. Various products have various syntaxes for retrieving that information.
In Java 1.6 you can use System.console() instead of System.out.println() to display accentuated characters to console.
public class Test {
public static void main(String args[]){
String s = "caractères français : à é \u00e9"; // Unicode for "é"
System.console().writer().println(s);
}
}
and the output is
C:\temp>java Test
caractères français : à é é