Prevent non-US Letters in Form Submit (java)

Prevent non-US Letters in Form Submit (java) - java

I have a java JSP/servlet application in a Tomcat, and fronted by an Apache.
The server side checks to make sure only letters in ranges [A..Z][a..z], digits, and punctuation symbols are accepted.
However, when a, for example, chinese character is entered, the value in the server-side looks something like '&#5960".
Hence, as far as the server-side is concerned, these are valid punctuation symbols and digits.
Any pointers that can help? Driving me insane after a 10 coding marathon.

You can use Apache Commons StringEscapeUtils.unescapeHTML() in Java.
unescapeHtml(String str)
does the following:
Unescapes a string containing entity escapes to a string containing
the actual Unicode characters corresponding to the escapes.

You need to process the text using a unicode encoding like UTF-8.
First make sure your server is handling requests with UTF-8 encoding. Where you set or configure that will depend on how you're implementing your JSPs/Servlets, but see: http://docs.oracle.com/javaee/6/api/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String)

Related

How to get a character type in GWT client side?

com.google.gwt.json.client.JSONParser.parseStrict(jsonStr) is throwing a syntax error when the json string contains non-printable/recognized Unicode characters. Hence, I am trying to remove non-printable Unicode characters in client side. Following How can I replace non-printable Unicode characters in Java?, I'm trying to implement this code in client side, but Character.getType(codePoint) Isn't client compatible.
Is there any way to overcome this issue? Any other way to get the character type in client side? Any other suggestion on how to solve the main problem?
Many Thanks!
David.
By the way, I've tried to use my_string.replaceAll("\\p{C}", "?") code and it worked on server side but not on client side.

You can add a native JS method and use a regex inside it to replace every non-printable non-ASCII character, like this:
private native String replaceUnprintableChars(String text, String replacement) /*-{
return text.replace(/[^\u0020-\u007E]/g, replacement);
}-*/;
// Somewhere else....
String replacedText = replaceUnprintableChars(originalString, "?");
The regex shown will replace every non-printable or non-ASCII character with the replacement string (e.g. "?"). If you want to include non-ASCII printable characters (latin, for example) then you can tweak the expression to broaden the range.
Of course, you can do this with Java regexes too:
text.replaceAll("[^\\u0020-\\u007E]", "?");
But I came up with the JS solution first, don't know why!

com.google.common.base.CharMatcher in Guava looks promissing. I have not tested this, but the class is annotated #GwtCompatible.
There is a guide to using it in the StringsExplained article in the Guava docs.

I've changed the use of com.google.gwt.json.client.JSONParser.parseStrict(jsonStr) with com.google.gwt.json.client.JSONParser.parseLenient(jsonStr) and the parser was able to handle those non-printable/recognized Unicode characters like in server side.
I feel comfortable using JSONParser.parseLenient since the jsonStrdoesn't come from user input.

Java regex to distinguish special characters while allowing non english chars

I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?

You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.

It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");

HttpClient 2.0. Params "codified"

I have to use HttpClient 2.0 (can not use anything newer), and I am running into the next issue. When I use the method (post, in that case), it "codify" the parameters to the Hexadecimal ASCII code, and the "spaces" turned into "+" (something that the receiver don't want).
Does anyone know a way to avoid it?
Thanks a lot.

Even your browser does that, converting space character into +. See here http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html
It encodes URL, converts to UTF-8 like string.
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Also, see here http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.
To answer your question, if you do not want to encode. I guess, URLDecoder.decode will help you to undo the encoded string.

You could in theory avoid this by constructing the query string or request body containing parameters by hand.
But this would be a bad thing to do, because the HTML, HTTP, URL and URI specs all mandate that reserved characters in request parameters are encoded. And if you violate this, you may find that server-side HTTP stacks, proxies and so on reject your requests as invalid, or misbehave in other ways.
The correct way to deal with this issue is to do one of the following:
If the server is implemented in Java EE technology, use the relevant servlet API methods (e.g. ServletRequest.getParam(...)) to fetch the request parameters. These will take care of any decoding for you.
If the parameters are part of a URL query string, you can instantiate a Java URL or URI object and use the getter to return you the query with the encoding removed.
If your server is implemented some other way (or if you need to unpick the request URL's query string or POST data yourself), then use URLDecoder.decode or equivalent to remove the % encoding and replace +'s ... after you have figured out where the query and parameter boundaries, etc are.

Using both \u.... and html entities in properties file?

I'm stumbling upon a few xxx_fr.properties, xxx_en.properties, etc. files and I'm a bit surprised for they contain both html entities and \uxxxx escapings.
I guess the HTML entities are fine as long as these resources are served to something awaiting HTML but what about the \uxxxx escaping?
Does Java specify that \uxxxx escaping are fine in .properties file?

Yes - see the documentation for load(Reader), which states
Characters in keys and elements can be
represented in escape sequences
similar to those used for character
and string literals.
and then clarifies that
Only a single 'u' character is allowed in a Unicode escape sequence.
Hence a Unicode escape sequence containing a single 'u' character is definitely supported.
Note that there's nothing special going on here at loading time with HTML entities - the String & for example would simply be seen within Java as a String containing 5 characters. As you point out, this might be interpreted in a special way if it were output to some other component later.
On the other hand, the escape sequence \u0061 would be seen within Java as the single-character string 'a', and would be indistinguishable from the file having contained that character instead.

The \u type escaping is a standard Java way of representing Unicode characters. You can read about it in Java Internationalization FAQ. With "How do I specify non-ASCII strings in a properties file?" question being the one you're most interested in:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#properties-escape
And that's not Properties related only; you can use those in your typical Java code as well. See the Text Representation block:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#core-textrep

how access file name with non english

when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name

The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.

Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.