How to read non-English characters in filename, using Java programme

How to read non-English characters in filename, using Java programme - java

I'm trying to read mail in my outbox which usually contains one attached pdf file. If the pdf file name contains English characters, the function below works fine. But if the file name contains any non-English character (for example, filename1(chinesecharacter).pdf) my function is not able to read it. Can anybody tell me what changes I need to make in my function?

Just simply check the ASCII (or Unicode?) values against the range(s) of values with English characters. Every character corresponds to a number in its character set.
Or you could create an array of all English characters, and check it against that. There may also be an API function in Java.

This line indicates you might have a problem decoding non-ISO 8859 character sets, e.g. UTF-8, due to the weak handling of RFC2822 encoded file names:
if(fileName.startsWith("=?iso-8859"))
{
String strFolder = strFolderName.substring(strFolderName.lastIndexOf("/")+1,
strFolderName.length());
fileName = strFolder + ".pdf";
}
http://en.wikipedia.org/wiki/MIME#Encoded-Word

Related

Use Get-Content in powershell as java input get extra character

I am practicing to use command line to run java script in windows 10.The java script is using scanner(System.in) to get input from a file and print the string it get from the file.The powershell command is as follow:
Get-Content source.txt | java test.TestPrint
The content of source.txt file is as follow:
:
a
2
!
And the TestPrint.java file is as follow:
package test;
import java.util.Scanner;
public class TestPrint {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner in = new Scanner(System.in);
while(in.hasNextLine())
{
String str = in.nextLine();
if(str.equals("q")) break;
System.out.println( str );
}
}
}
Then weird thing happed.The result is
?:
a
2
!
You see,It add question mark into the begging of first line.Then when I change the character in first line of the source.txt file from ":" to "a",the result is
a
a
2
!
It add space into the begging of the first line.
I had tested the character and found the regularity：if the character is larger than "?" in ASCII,which is 63 in ASCII,then it will add space,such as "A"(65 in ASCII) or "["(91 in ASCII).If the character is smaller than "?", including "?" itself ,it will add question mark.

Could this be a Unicode issue (See: Java Unicode problems)? i.e. try specifying the type you want to read in:
Scanner in = new Scanner(System.in, "UTF-8");
EDIT:
Upon further research, PowerShell 5.1 and earlier, the default code page is Windows-1252. PowerShell 6+ and cross platform versions have switched to UTF-8. So (from the comments) you may have to specify Windows-1252 encoding:
Scanner in = new Scanner(System.in, "Windows-1252");
To find out what encoding is being used, execute the following in PowerShell:
[System.Text.Encoding]::Default
And you should be able to see what encoding is being used (for me in PowerShell v 5.1 it was Windows-1252, for PowerShell 6 it was UTF-8).

There is no text but encoded text.
Every program reading a text file or stream must know and use the same character encoding that the writer used.
An adaptive default character encoding is a 90s solution to a 70s and 80s problem (approx). Today, it's usually best to avoid constructors and methods that use a default, and in PowerShell, add an encoding argument where needed to control input or output.
To prevent data loss, you can use the Unicode character set throughout. UTF-8 is the most common for files and streams. (PowerShell and Java use UTF-16 for text datatypes.)
But you need to start from what you know the character encoding of the text file is. If you don't know this metadata, that's data loss right there.
Unicode provides that if a file or stream is known to be Unicode, it can start with metadata called a BOM. The BOM indicates which specific Unicode character encoding is being used and what the byte order is (for character encodings with code units longer than a byte). [This provision doesn't solve any problem that I've seen and causes problems of its own.]
(A character encoding, at the abstract level, is a map between codepoints and code units and is therefore independent of byte order. In practice, a character encoding takes the additional step of serializing/deserializing code units to/from byte sequences. So, sometimes using or not using a BOM is included in the encoding's name or description. A BOM might also be referred to as a signature. Ergo, "UTF-8 with signature.")
As metadata, a BOM, if present, should used if needed and always discarded when putting text into text datatypes. Unfortunately, Java's standard libraries don't discard the BOM. You can use popular libraries or a dozen or so lines of your own code to do this.
Again, start with the knowing the character encoding of the text file and inserting that metadata into the processing as an argument.

Java unexpected character parsing txt file

I am trying to divide txt files into ArrayList of strings and so far it works, but first words in the file always starts with (int)'65279' and I can't even copy this character here. Also, in GUI it looks like second letter of word is missing but at the same time it works in console. Other words are as they should be. I am using UTF-8 format .txt files. How can I change format in netBeans and GUI made in this IDE?

U+FEFF is the byte order mark. It's used to indicate the character encoding/endianness (to you can easily tell the difference between big and little-endian UTF-16, for example).
If it's causing you a problem, the simplest thing is just to strip it:
if (text.startsWith("\ufeff")) {
text = text.substring(1);
}

In Java, How to detect if a string is unicode escaped

I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.

Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.

You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).

Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.

The library ICU4J seems to be what you're looking for. See the Normalization page.

Adding Annotation with Hebrew letters in Itext

When i add annotaion using:
PdfAnnotation.createFileAttachment(writer,null,null , null, , "שם קובץ", "שם קובץ");
the Hebrew letters in the annotaion are not shown.
Is there a way to fix it?

You're using Hebrew characters in your code. That's not safe. Please replace them with a unicode notation (you'll need to know their unicode value; for instance \u00a0 is the value for a non-breaking space). If you don't do this, compilers could interpret the characters incorrectly (see the remarks that were given).
It appears to me that you don't have the correct number of parameters in the method. I assume that you're using this method.
You're using a 'short-cut' method that assumes that the characters aren't Unicode characters. Please don't. Use the method where you create a PdfFileSpecification object, and use methods such as setUnicodeFileName() with the unicode parameter set to true. This way, iText knows that the characters should be interpreted as Unicode characters.
You probably want the characters to appear from right to left. I don't know if this is supported in PDF. I browsed ISO-32000-1 and looked at Table 44 (Entries in a file specification dictionary), but all I saw was: Unicode text string that provides file specification of the form described in 7.11.2, "File Specification Strings." This is a text string encoded using PDFDocEncoding or UTF-16BE with a leading byte-order marker (as defined in 7.9.2.2, "Text String Type"). You'll have to dig into those sections if you want to know more.
You pass null as value for the Rectangle. That doesn't make sense. Are you sure you want to add a file attachment annotation? Based on your code I would assume that you want to add a document-level attachment instead. That's done like this: writer.addFileAttachment(fs); with fs an instance of the FileSpecification class.

how access file name with non english

when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name

The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.

Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.