Java Regexp to Match ASCII Characters - java

What regex would match any ASCII character in java?
I've already tried:
^[\\p{ASCII}]*$
but found that it didn't match lots of things that I wanted (like spaces, parentheses, etc...). I'm hoping to avoid explicitly listing all 127 ASCII characters in a format like:
^[a-zA-Z0-9!##$%^*(),.<>~`[]{}\\/+=-\\s]*$

The first try was almost correct
"^\\p{ASCII}*$"

I have never used \\p{ASCII} but I have used ^[\\u0000-\\u007F]*$

If you only want the printable ASCII characters you can use ^[ -~]*$ - i.e. all characters between space and tilde.
https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart

For JavaScript it'll be /^[\x00-\x7F]*$/.test('blah')

I think question about getting ASCII characters from a raw string which has both ASCII and special characters...
public String getOnlyASCII(String raw) {
Pattern asciiPattern = Pattern.compile("\\p{ASCII}*$");
Matcher matcher = asciiPattern.matcher(raw);
String asciiString = null;
if (matcher.find()) {
asciiString = matcher.group();
}
return asciiString;
}
The above program will remove the non ascii string and return the string. Thanks to #Oleg Pavliv for pattern.
For ex:
raw = ��+919986774157
asciiString = +919986774157

Related

Eliminating Unicode Characters and Escape Characters from String

I want to remove all Unicode Characters and Escape Characters like (\n, \t) etc. In short I want just alphanumeric string.
For example :
\u2029My Actual String\u2029
\nMy Actual String\n
I want to fetch just 'My Actual String'. Is there any way to do so, either by using a built in string method or a Regular Expression ?
Try
String stg = "\u2029My Actual String\u2029 \nMy Actual String";
Pattern pat = Pattern.compile("(?!(\\\\(u|U)\\w{4}|\\s))(\\w)+");
Matcher mat = pat.matcher(stg);
String out = "";
while(mat.find()){
out+=mat.group()+" ";
}
System.out.println(out);
The regex matches all things except unicode and escape characters. The regex pictorially represented as:
Output:
My Actual String My Actual String
Try this:
anyString = anyString.replaceAll("\\\\u\\d{4}|\\\\.", "");
to remove escaped characters. If you also want to remove all other special characters use this one:
anyString = anyString.replaceAll("\\\\u\\d{4}|\\\\.|[^a-zA-Z0-9\\s]", "");
(I guess you want to keep the whitespaces, if not remove \\s from the one above)

Java Regex to Validate Full Name allow only Spaces and Letters

I want regex to validate for only letters and spaces. Basically this is to validate full name. Ex: Mr Steve Collins or Steve Collins I tried this regex. "[a-zA-Z]+\.?" But didnt work. Can someone assist me please
p.s. I use Java.
public static boolean validateLetters(String txt) {
String regx = "[a-zA-Z]+\\.?";
Pattern pattern = Pattern.compile(regx,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(txt);
return matcher.find();
}
What about:
Peter Müller
François Hollande
Patrick O'Brian
Silvana Koch-Mehrin
Validating names is a difficult issue, because valid names are not only consisting of the letters A-Z.
At least you should use the Unicode property for letters and add more special characters. A first approach could be e.g.:
String regx = "^[\\p{L} .'-]+$";
\\p{L} is a Unicode Character Property that matches any kind of letter from any language
try this regex (allowing Alphabets, Dots, Spaces):
"^[A-Za-z\s]{1,}[\.]{0,1}[A-Za-z\s]{0,}$" //regular
"^\pL+[\pL\pZ\pP]{0,}$" //unicode
This will also ensure DOT never comes at the start of the name.
For those who use java/android and struggle with this matter try:
"^\\p{L}+[\\p{L}\\p{Z}\\p{P}]{0,}"
This works with names like
José Brasão
You could even try this expression ^[a-zA-Z\\s]*$ for checking a string with only letters and spaces (nothing else).
For me it worked. Hope it works for you as well.
Or go through this piece of code once:
CharSequence inputStr = expression;
Pattern pattern = Pattern.compile(new String ("^[a-zA-Z\\s]*$"));
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches())
{
//if pattern matches
}
else
{
//if pattern does not matches
}
please try this regex (allow only Alphabets and space)
"[a-zA-Z][a-zA-Z ]*"
if you want it for IOS then,
NSString *yourstring = #"hello";
NSString *Regex = #"[a-zA-Z][a-zA-Z ]*";
NSPredicate *TestResult = [NSPredicate predicateWithFormat:#"SELF MATCHES %#",Regex];
if ([TestResult evaluateWithObject:yourstring] == true)
{
// validation passed
}
else
{
// invalid name
}
Regex pattern for matching only alphabets and white spaces:
String regexUserName = "^[A-Za-z\\s]+$";
Accept only character with space :-
if (!(Pattern.matches("^[\\p{L} .'-]+$", name.getText()))) {
JOptionPane.showMessageDialog(null, "Please enter a valid character", "Error", JOptionPane.ERROR_MESSAGE);
name.setFocusable(true);
}
My personal choice is:
^\p{L}+[\p{L}\p{Pd}\p{Zs}']*\p{L}+$|^\p{L}+$, Where:
^\p{L}+ - It should start with 1 or more letters.
[\p{Pd}\p{Zs}'\p{L}]* - It can have letters, space character (including invisible), dash or hyphen characters and ' in any order 0 or more times.
\p{L}+$ - It should finish with 1 or more letters.
|^\p{L}+$ - Or it just should contain 1 or more letters (It is done to support single letter names).
Support for dots (full stops) was dropped, as in British English it can be dropped in Mr or Mrs, for example.
To validate for only letters and spaces, try this
String name1_exp = "^[a-zA-Z]+[\-'\s]?[a-zA-Z ]+$";
Validates such values as:
"", "FIR", "FIR ", "FIR LAST"
/^[A-z]*$|^[A-z]+\s[A-z]*$/
check this out.
String name validation only accept alphabets and spaces
public static boolean validateLetters(String txt) {
String regx = "^[a-zA-Z\\s]+$";
Pattern pattern = Pattern.compile(regx,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(txt);
return matcher.find();
}
To support language like Hindi which can contain /p{Mark} as well in between language characters.
My solution is ^[\p{L}\p{M}]+([\p{L}\p{Pd}\p{Zs}'.]*[\p{L}\p{M}])+$|^[\p{L}\p{M}]+$
You can find all the test cases for this here
https://regex101.com/r/3XPOea/1/tests
#amal. This code will match your requirement. Only letter and space in between will be allow, no number. The text begin with any letter and could have space in between only. "^" denotes the beginning of the line and "$" denotes end of the line.
public static boolean validateLetters(String txt) {
String regx = "^[a-zA-Z ]+$";
Pattern pattern = Pattern.compile(regx,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(txt);
return matcher.find();
}
Try with this:
public static boolean userNameValidation(String name){
return name.matches("(?i)(^[a-z])((?![? .,'-]$)[ .]?[a-z]){3,24}$");
}
For Java, you can use below for Name validation which uses Alpha (Letters) + Spaces (Blanks or tabs)
"[^\\\p{Alpha}\\\p{Blank}]"
Can get a reference from Wikipedia for ASCII values also.

Regex for checking if a string is strictly alphanumeric

How can I check if a string contains only numbers and alphabets ie. is alphanumeric?
Considering you want to check for ASCII Alphanumeric characters, Try this:
"^[a-zA-Z0-9]*$". Use this RegEx in String.matches(Regex), it will return true if the string is alphanumeric, else it will return false.
public boolean isAlphaNumeric(String s){
String pattern= "^[a-zA-Z0-9]*$";
return s.matches(pattern);
}
If it will help, read this for more details about regex: http://www.vogella.com/articles/JavaRegularExpressions/article.html
In order to be unicode compatible:
^[\pL\pN]+$
where
\pL stands for any letter
\pN stands for any number
It's 2016 or later and things have progressed. This matches Unicode alphanumeric strings:
^[\\p{IsAlphabetic}\\p{IsDigit}]+$
See the reference (section "Classes for Unicode scripts, blocks, categories and binary properties"). There's also this answer that I found helpful.
See the documentation of Pattern.
Assuming US-ASCII alphabet (a-z, A-Z), you could use \p{Alnum}.
A regex to check that a line contains only such characters is "^[\\p{Alnum}]*$".
That also matches empty string. To exclude empty string: "^[\\p{Alnum}]+$".
Use character classes:
^[[:alnum:]]*$
Pattern pattern = Pattern.compile("^[a-zA-Z0-9]*$");
Matcher matcher = pattern.matcher("Teststring123");
if(matcher.matches()) {
// yay! alphanumeric!
}
try this [0-9a-zA-Z]+ for only alpha and num with one char at-least..
may need modification so test on it
http://www.regexplanet.com/advanced/java/index.html
Pattern pattern = Pattern.compile("^[0-9a-zA-Z]+$");
Matcher matcher = pattern.matcher(phoneNumber);
if (matcher.matches()) {
}
To consider all Unicode letters and digits, Character.isLetterOrDigit can be used. In Java 8, this can be combined with String#codePoints and IntStream#allMatch.
boolean alphanumeric = str.codePoints().allMatch(Character::isLetterOrDigit);
To include [a-zA-Z0-9_], you can use \w.
So myString.matches("\\w*"). (.matches must match the entire string so ^\\w*$ is not needed. .find can match a substring)
https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
If you want to include foreign language letters as well, you can try:
String string = "hippopotamus";
if (string.matches("^[\\p{L}0-9']+$")){
string is alphanumeric do something here...
}
Or if you wanted to allow a specific special character, but not any others. For example for # or space, you can try:
String string = "#somehashtag";
if(string.matches("^[\\p{L}0-9'#]+$")){
string is alphanumeric plus #, do something here...
}
100% alphanumeric RegEx (it contains only alphanumeric, not even integers & characters, only alphanumeric)
For example:
special char (not allowed)
123 (not allowed)
asdf (not allowed)
1235asdf (allowed)
String name="^[^<a-zA-Z>]\\d*[a-zA-Z][a-zA-Z\\d]*$";
To check if a String is alphanumeric, you can use a method that goes through every character in the string and checks if it is alphanumeric.
public static boolean isAlphaNumeric(String s){
for(int i = 0; i < s.length(); i++){
char c = s.charAt(i);
if(!Character.isDigit(c) && !Character.isLetter(c))
return false;
}
return true;
}

java regex to filter out non-English text

I found a few references to regex filtering out non-English but none of them is in Java, aside from the fact that they are all referring to somewhat different problems than what I am trying to solve:
Replace all non-English characters
with a space.
Create a method that returns true
if a string contains any non-English
character.
By "English text" I mean not only actual letters and numbers but also punctuation.
So far, what I have been able to come with for goal #1 is quite simple:
String.replaceAll("\\W", " ")
In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?
As for goal #2, I could simply trim() the string after the above replaceAll(), then check if it's empty. But... Is there a more efficient way to do this?
In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?
\W is equivalent to [^\w], and \w is equivalent to [a-zA-Z_0-9]. Using \W will replace everything which isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.
By "English text" I mean not only actual letters and numbers but also punctuation.
In that case, you might want to use a character class which omits punctuation; something like
[^\w.,;:'"]
Create a method that returns true if a string contains any non-English character.
Use Pattern and Matcher.
Pattern p = Pattern.compile("\\W");
boolean containsSpecialChars(String string)
{
Matcher m = p.matcher(string);
return m.find();
}
This works for me
private static boolean isEnglish(String text) {
CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
return asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
}
Here is my solution. I assume the text may contain English words, punctuation marks and standard ascii symbols such as #, %, # etc.
private static final String IS_ENGLISH_REGEX = "^[ \\w \\d \\s \\. \\& \\+ \\- \\, \\! \\# \\# \\$ \\% \\^ \\* \\( \\) \\; \\\\ \\/ \\| \\< \\> \\\" \\' \\? \\= \\: \\[ \\] ]*$";
private static boolean isEnglish(String text) {
if (text == null) {
return false;
}
return text.matches(IS_ENGLISH_REGEX);
}
Assuming an english word is made up of characters from: [a-zA-Z_0-9]
To return true if a string contains any non-English character, use string.matches:
return !string.matches("^\\w+$");

Unicode to string conversion in Java

I am building a language, a toy language. The syntax \#0061 is supposed to convert the given Unicode to an character:
String temp = yytext().subtring(2);
Then after that try to append '\u' to the string, I noticed that generated an error.
I also tried to "\\" + "u" + temp; this way does not do any conversion.
I am basically trying to convert Unicode to a character by supplying only '0061' to a method, help.
Strip the '#' and use Integer.parseInt("0061", 16) to convert the hex digits to an int. Then cast to a char.
(If you had implemented the lexer by hand, an alternatively would be to do the conversion on the fly as your lexer matches the unicode literal. But on rereading the question, I see that you are using a lexer generator ... good move!)
i am basically trying to convert
unicode to a character by supplying
only '0061' to a method, help.
char fromUnicode(String codePoint) {
return (char) Integer.parseInt(codePoint, 16);
}
You need to handle bad inputs and such, but that will work otherwise.
You need to convert the particular codepoint to a char. You can do that with a little help of regex:
String string = "blah #0061 blah";
Matcher matcher = Pattern.compile("\\#((?i)[0-9a-f]{4})").matcher(string);
while (matcher.find()) {
int codepoint = Integer.valueOf(matcher.group(1), 16);
string = string.replaceAll(matcher.group(0), String.valueOf((char) codepoint));
}
System.out.println(string); // blah a blah
Edit as per the comments, if it is a single token, then just do:
String string = "0061";
char c = (char) Integer.parseInt(string, 16);
System.out.println(c); // a
\uXXXX is an escape sequence. Before execution it has already been converted into the actual character value, its not "evaluated" in anyway at runtime.
What you probably want to do is define a mapping from your #XXXX syntax to Unicode code points and cast them to char.

Categories