Remove illegal characters from file in notepad++ or java

Remove illegal characters from file in notepad++ or java - java

I have a huge file and that file contains a lot of illegal characters like in the image below, but these are not all. They are of many different kinds so it's not possible to search for them all and replace them.
Is there a way i can remove these characters. I've tried a lot of solutions like converting to ANSI, or some regex expression but they didn't work. Please help.
EDIT: Even if anyone can tell me how to remove these characters in java, that will be fine too.

Instead of removing specific characters it's easier to implement a white-list filter if you know which types of characters you are expecting.
As per this answer, which explains how to remove emoticons you can try:
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter, "");
To understand what \p{} groups are available look at Classes for Unicode scripts, blocks, categories and binary properties docs:
\p{IsLatin} A Latin script character (script)
\p{InGreek} A character in the Greek block (block)
\p{Lu} An uppercase letter (category)
\p{IsAlphabetic} An alphabetic character (binary property)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

Related

Regex for matching Unicode pattern

I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like \u0046 for example.
For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?
So, as far as I have searched on the Internet, I've seen Unicode characters written like \u0046, or like U+0046. Based on this, I have written the following regex:
(\\u|U\+)....
This means, \u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always \u or U+? Can it be more or less than 4 characters after \u or U+?
Thanks

The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties \u followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.
The pattern to search for that:
"\\\\u[0-9A-Fa-f]{4}"
Or a String.contains on:
"\\u"
In other languages than Java \Uxxxxxx (six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.

How can Java handle characters that are unencodeable in UTF-16?

Since Java holds characters internally in UTF-16, what if you need to output in a certain encoding that includes characters that are not in unicode at all?

Java can only handle characters which are present in Unicode, basically. Text outside the BMP (i.e. above U+FFFF) is encoded as surrogate pairs (as each char is a UTF-16 code unit)... but if you want characters which aren't in Unicode at all, you're on your own - you could probably find some area of Unicode which is reserved for private use, and map the characters there... but you may well have "fun" in all kinds of odd ways.
Do you definitely need to handle characters which aren't in Unicode? I thought it covered almost everything these days...

Java regex to distinguish special characters while allowing non english chars

I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?

You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.

It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");

Pattern objects not matching with different languages

I have the following reg expression that works fine when the user's inputs English.
But it always fails when using Portuguese characters.
Pattern p = Pattern.compile("^[a-zA-Z]*$");
Matcher matcher = p.matcher(fieldName);
if (!matcher.matches())
{
....
}
Is there any way to get the pattern object to recognise valid Portuguese characters such as ÁÂÃÀÇÉÊÍÓÔÕÚç....?
Thanks

You want a regular expression that will match the class of all alphabetic letters. Across all the scripts of the world, there's loads of those, but luckily we can tell Java 6's RE engine that we're after a letter and it will use the magic of Unicode classes to do the rest. In particular, the L class matches all types of letters, upper, lower and “oh, that concept doesn't apply in my language”:
Pattern p = Pattern.compile("^\\p{L}*$");
// the rest is identical, so won't repeat it...
When reading the docs, remember that backslashes will need to be doubled up if placed in a Java literal so as to stop the Java compiler from interpreting them as something else. (Also be aware that that RE is not suitable for things like validating the names of people, which is an entirely different and much more difficult problem.)

It should work with "^\p{IsAlphabetic}*$", that takes into account Unicode characters. For reference see the options in http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Check out the Pattern doc and particularly the section on Unicode:
Unicode blocks and categories are written with the \p and \P
constructs as in Perl. \p{prop} matches if the input has the property
prop, while \P{prop} does not match if the input has that property.
Blocks are specified with the prefix In, as in InMongolian. Categories
may be specified with the optional prefix Is: Both \p{L} and \p{IsL}
denote the category of Unicode letters. Blocks and categories can be
used both inside and outside of a character class.
(for Java 1.4.x). I suspect you're interested in identifying Unicode letters and not particularly Portuguese letters?

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex

WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.

It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.

It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).

Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.