Get File encoding ASCII or EBCDIC with java - java

I have a file which extension is .b3c i want to know if it's encoded in ASCII or EBCDIC using java jow can I achieve that please.
Help is needed.
Thanks

Assuming the text file contains multiple lines of text, check for the newline character.
In ASCII, lines end with an LF / \n / 0x0a. Sure, on Windows there's also a CR, but we can ignore that part.
In EBCDIC, lines end with an NL / \025 / 0x15.
ASCII text files will not contain a 0x15 / NAK, and EBCDIC text files will not contain a 0x0a / SMM, so look for both:
If only one of them is found, you know the character set.
If both are found, the file is a binary file, and not a text file, so reject the file.
If neither is found, the file could have just one line of text, in which case further analysis might be needed. Hopefully that won't be the case for here, so the simple test done so far should be enough.

Related

Export (Android/Java) string data in with extended characters for import into Excel

I need to export string data that includes the 'degrees' symbol ("\u00B0"). This data is exported as a csv text file with UTF-8 encoding. As would be expected, the degrees symbol is encoded as two characters (0xC2, 0xB0) within the java (unicode) string. When the CSV file is imported into Excel, it is displayed as a capital A with an circumflex accent, followed by the degrees symbol.
I know that "UTF-8" only supports 7-bit ASCII (as a single byte), not 8-bit "extended ASCII", and "US-ASCII" only supports 7-bit ASCII period.
Is there some way to specify encoding such that the 0xC2 prefix byte is suppressed?
I'm leaning toward allowing normal processing to occur, then reading & overwriting the file contents, stripping the extra byte.
I'd really prefer a more eloquent solution...
Excel assumes csv files are in an 8-bit code page.
To get Excel to parse your csv as UTF-8, you need to add a UTF-8 Byte Order Mark to the start of the file.
Edit:
If you're in Western Europe or US, Excel will likely use Windows-1252 character set for decoding and encoding when encountering files without a Unicode Byte Order Mark.
As 0xC2 and 0xB0 are both legal Windows-1252 characters, Excel will decode to the following:
0xC2 = Â
0xB0 = °

Delete non utf8 characters in Eclipse using regex

Is there a possibility to do that in Eclipse? I have a lot of non utf8 characters like sch�ma or propri�t� (it's french :)). For now, I am deleting those characters hand. How can I remove those characters?
Those characters are in the UTF-8 character set.
Either the text is encoded incorrectly or you have your file encoding set incorrectly in Eclipse.
Try right clicking the file -> properties. Then check the text file encoding is set to UTF-8, if its not, select Other and change it to UTF-8.
I would write a little program that reads a file, removes all char > 127 and write back to the file.
[I would pass the file names as command line arguments]

Perform binary search on a file written in UTF format

Is there a way to perform binary search on a file stored in UTF format in sorted order. I am able to perform binary search on a text file using RandomAccessFile. First I find out the length of the file and then jump to the middle position of the file using fseek, after jumping to the middle position I read the bytes. However, I am not finding it feasible for a file stored in UTF format, as the first characters are random in UTF format. And also with DataInputStream I am unable to jump to a particular position in the file. Is it possible to do binary search on such a file. If yes, then using which classes.
Yes, it is possible. If you jump into the middle of a file, you will first need to go to the nearest record separator and then use the text starting after the record separator.
Depending on the exact file format you have, a line feed, a TAB character or something similar could be used as the record separator.
Locating the record separator is easy if it is a character with a Unicode number below 32 (which NL, CR, TAB fulfill). Then you don't need to care about the multibyte UTF-8 encoding (for locating the separator). If it's a wide character Unicode format, then it isn't much more difficult either.
DataInputStream is the wrong class from random access. (Streaming is sort of the opposite of random access.) Have a look at RandomAccessFile instead.

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");
You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting  characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

how access file name with non english

when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name
The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.
Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.

Categories