Regex java a regular expression for extraction the first alphabetical caracters

Regex java a regular expression for extraction the first alphabetical caracters - java

How to extract first alphabetical characters in Java, for example after applying regex on the string "ABD123EZ13 I should get "ABD", Is this possible, I searched for a while and didn't find any thing.
I find this regex :
String firstThreeCharacters = str.replaceAll("(?i)^[^a-z]*([a-z])[^a-z]*([a-z])[^a-z]*([a-z]).*$", "$1$2$3")
To extract the first n caracters, but it doesn't check if a th caracters are alphabetical or not.
Other Examples:
"AAAA" => "AAAA"
"1231" => ""
"_abvbv" => ""
"abd_12df" => "abd"

You may use
String result = s.replaceFirst("(?s)\\P{L}.*", "");
See the regex demo
Details
(?s) - a Pattern.DOTALL modifier to make . match line break cahrs
\\P{L} - any char other than a Unicode letter
.* - any 0+ chars, up to the end of the string.
You do not need replaceAll since there will be only 1 replacement operation, replaceFirst is fine.
If you only need to only handle ASCII letters, replace \\P{L}, replace with \\P{Alpha} that only matches any chars other than ASCII letters.
Probably a matching approach will be easiest with ^\p{L}+ or ^\p{Alpha}+ patterns that match 1 or more letters from the start of the string only:
String s = "abd_12df";
Pattern pattern = Pattern.compile("^\\p{L}+"); // or just Pattern.compile("^[a-zA-Z]+") to get the first one or more ASCII letters
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(0));
}
See the Java demo.

Related

Split string using multiple patterns, where second pattern matches smaller parts of the first

I'm reading special "formatting codes" in a string and am trying to split the string so that I have those formatting codes and the string's text separated.
There are two "types" of formatting codes: "Encoded" hex colors: §x§7§3§7§5§f§f and other codes in the format of §r.
Given the example string: §x§7§3§7§5§f§f§ltest1 §rtest2
I need the larger pattern split as a whole, and then the smaller ones. I can do what I want on those patterns separately, but am having trouble combining them into a single regex. Because the second pattern matches pieces of the first pattern, it's just splitting everything into smaller groups.
I'm trying this:
for (String substr : "§x§7§3§7§5§f§f§ltest1 §rtest2".split("((?<=(§x(§[0-9a-f]){6}))|(?<=§[0-9a-z])|(?=§[0-9a-z]))")) {
System.out.println(substr);
}
My expected output is:
§x§7§3§7§5§f§f
§l
test1
§r
test
My actual output is:
§x
§7
§3
§7
§5
§f
§f
§l
test1
§r
test2
When I split the expressions up into different split tests, they work, they're just not working together.

Instead of splitting, you could just use this simplified regex for matching:
§x(?:§[0-9a-f]){6}|§[0-9a-z]|[^§\s]+
RegEx Demo
RegEx Details:
§x(?:§[0-9a-f]){6}: Match text starting with §x and 6 hex characters
|: OR
§[0-9a-z]: Match text starting with § and an alphanumeric
|: OR
[^§\s]+: Match 1+ non-whitespace and non-§ characters
Code:
final String regex = "§x(?:§[0-9a-f]){6}|§[0-9a-z]|[^§\\s]+";
final String string = "§x§7§3§7§5§f§f§ltest1 §rtest2";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println( matcher.group(0) );
}

You can use the following regex:
See it working here
?((?:§[^§])(?=[^§])|[^§ ]{2,})
How it works:
? optionally match the space character
((?:§[^§])(?=[^§])|[^§ ]{2,}) capture either of the following:
(?:§[^§])(?=[^§]) match the following:
(?:§[^§]) match § followed by any character except §
(?=[^§]) lookahead ensuring what follows is not § (same as (?!§) but more efficient)
[^§ ]{2,} match any character except § or space two or more times
With the substitution of \n$1
Result:
§x§7§3§7§5§f§f
§l
test1
§r
test2

Regex to find a given number of characters after last underscore

I need to find two characters after the last underscore in given filename.
Example string:
sample_filename_AB12123321.pdf
I am using [^_]*(?=\.pdf), but it finds all the characters after the underscore like AB12123321.
I need to find the first two characters AB only.
Moreover, there is no way to access the code, I can only modify the regex pattern.

If you want to solve the problem using a regex you may use:
(?<=_)[^_]{2}(?=[^_]*$)
See regex demo.
Details
(?<=_) - an underscore must appear immediately to the left of the current position
[^_]{2} - Capturing group 1: any 2 chars other than underscore
(?=[^_]*$) - immediately to the left of the current position, there must appear any 0+ chars other than underscore and then an end of string.
See the Java demo:
String s = "sample_filename_AB12123321.pdf";
Pattern pattern = Pattern.compile("(?<=_)[^_]{2}(?=[^_]*$)");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(0));
}
Output: AB.

Erase any string that doesn't match a pattern using replaceall()

I need to replace ALL characters that don't follow a pattern with "".
I have strings like:
MCC-QX-1081
TEF-CO-QX-4949
SPARE-QX-4500
So far the closest I am using the following regex.
String regex = "[^QX,-,\\d]";
Using the replaceAll String method I get QX1081 and the expected result is QX-1081

You're using a character class which matches single characters, not patterns.
You want something like
String resultString = subjectString.replaceAll("^.*?(QX-\\d+)?$", "$1");
which works as long as nothing follows the QX-digits part in your strings.

Put the dash at the end of the regex: [^QX,\d-]
Next you just have to substring to filter out the first dash.
Don't know exactly what you expect for all strings but if you want to match a dash in a character class then it must be set as last character.

You are using a character class where you have to either escape the hyphen or put it at the start or at the end like [^QX,\d-] or else you are matching a range from a comma to a comma. But changing that will give you -QX-1081 which is not the desired result.
You could match your pattern and then replace with the first capturing group $1:
^(?:[A-Z]+-)+(QX-\d+)$
In Java you have to double escape matching a digit \\d
That will match:
^ Start of the string
(?:[A-Z]+-)+ Repeat 1+ times one or more uppercase charactacters followed by a hyphen
(QX-\d+) Capture in a group QX- followed by 1+ digits
$ End of the string
For example:
String result = "MCC-QX-1081".replaceAll("^(?:[A-Z]+-)+(QX-\\d+)$", "$1");
System.out.println(result); // QX-1081
See the Regex demo | Java demo
Note that if you are doing just 1 replacement, you could also use replaceFirst

Regex expression to capture only words without numbers or symbols

I need some regex that given the following string:
"test test3 t3st test: word%5 test! testing t[st"
will match only words in a-z chars:
Should match: test testing
Should not match: test3 t3st test: word%5 test! t[st
I have tried ([A-Za-z])\w+ but word%5 should not be a match.

You may use
String patt = "(?<!\\S)\\p{Alpha}+(?!\\S)";
See the regex demo.
It will match 1 or more letters that are enclosed with whitespace or start/end of string locations. Alternative pattern is either (?<!\S)[a-zA-Z]+(?!\S) (same as the one above) or (?<!\S)\p{L}+(?!\S) (if you want to also match all Unicode letters).
Details:
(?<!\\S) - a negative lookbehind that fails the match if there is a non-whitespace char immediately to the left of the current location
\\p{Alpha}+ - 1 or more ASCII letters (same as [a-zA-Z]+, but if you use a Pattern.UNICODE_CHARACTER_CLASS modifier flag, \p{Alpha} will be able to match Unicode letters)
(?!\\S) - a negative lookahead that fails the match if there is a non-whitespace char immediately to the right of the current location.
See a Java demo:
String s = "test test3 t3st test: word%5 test! testing t[st";
Pattern pattern = Pattern.compile("(?<!\\S)\\p{Alpha}+(?!\\S)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
Output: test and testing.

Try this
Pattern tokenPattern = Pattern.compile("[\\p{L}]+");
[\\p{L}]+ this prints group of letters

Java Regex for multiline text

I need to match a string against a regex in Java. The string is multiline and therefore contains multiple \n like the followings
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
String regex = "\\S*";
System.out.println(text.matches(regex));
I only want to match whether the text contains at least a non-whitespace character. The output is false. I have also tried \\S*(\n)* for the regex, which also returns false.
In the real program, both the text and regex are not hard-coded. What is the right regex to check is a multiline string contains any non-whitespace character?

The problem is not to do with the multi lines, directly. It is that matches matches the whole string, not just a part of it.
If you want to check for at least one non-whitespace character, use:
"\\s*\\S[\\s\\S]*"
Which means
Zero or more whitespace characters at the start of the string
One non-whitespace character
Zero or more other characters (whitespace or non-whitespace) up to the end of the string

If you just want to check whether there is at least one non white space character in the string, you can just trim the text and check the size without involving regex at all.
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
if (!text.trim().isEmpty()){
//your logic here
}
If you really want to use regex, you can use a simple regex like below.
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
String regex = ".*\\S+.*";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
if (matcher.find()){
// your logic here
}

Using String.matches()
!text.matches("\\s*")
Check if the input text consist solely of whitespace characters (this includes newlines), invert the match result with !
Using Matcher.find()
Pattern regexp = Pattern.compile("\\S");
regexp.matcher(text).find()
Will search for the first non-whitespace character, which is more efficient as it will stop on the first match and also uses a pre-compiled pattern.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex java a regular expression for extraction the first alphabetical caracters - java

Related

Split string using multiple patterns, where second pattern matches smaller parts of the first

Regex to find a given number of characters after last underscore

Erase any string that doesn't match a pattern using replaceall()

Regex expression to capture only words without numbers or symbols

Java Regex for multiline text

Categories

Resources