Regex ignore tokens that do not start with letter - java

how can I write a regex that ignores any token that does not start with a letter? it should be used in java.
example: it 's super cool --> regex should match: [it, super, cool] and ignore ['s].

Alternative regex:
"(?:^|\\s)([A-Za-z]+)"
Regex in context:
public static void main(String[] args) {
String input = "it 's super cool";
Matcher matcher = Pattern.compile("(?:^|\\s)([A-Za-z]+)").matcher(input);
while (matcher.find()) {
String result = matcher.group(1);
System.out.println(result);
}
}
Output:
it
super
cool
Note: To match alphabetic characters, letters, in any language (e.g. Hindi, German, Chinese, English etc.), use the following regex instead:
"(?:^|\\s)(\\p{L}+)"
More about the class, Pattern and the classes for Unicode scripts, blocks, categories and binary properties, can be found here.

You can use (?<!\\p{Punct})(\\p{L}+) which means letters not preceded by a punctuation mark. Note that (?<! is used to specify a negative look behind. Check the documentation of Pattern to learn more about it.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "it 's super cool";
Pattern pattern = Pattern.compile("(?<!\\p{Punct})(\\p{L}+)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
it
super
cool

Related

How to build the regex to find particular word [duplicate]

This question already has answers here:
Regex to get the words after matching string
(6 answers)
Closed 2 years ago.
I need to find and print out a particular word in a String. What regex can you recommend me to find a "9.1.1_offline" in following String:
EGA_SAMPLE_APP-iOS-master-<Any word>-200710140849862
Another examples are:
EGA_SAMPLE_APP-iOS-master-9.2.3_online-200710140849862
EGA_SAMPLE_APP-iOS-master-10.2.3_offline-200710140849862
Use the regex, \\d+\\.\\d+\\.\\d+\\_(offline|online)
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
// Test strings
String[] arr = { "EGA_SAMPLE_APP-iOS-master-9.1.1_offline-200710140849862",
"EGA_SAMPLE_APP-iOS-master-9.2.3_online-200710140849862",
"EGA_SAMPLE_APP-iOS-master-10.2.3_offline-200710140849862" };
Pattern pattern = Pattern.compile("\\d+\\.\\d+\\.\\d+\\_(offline|online)");
// Print the matching string
for (String s : arr) {
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
}
Output:
9.1.1_offline
9.2.3_online
10.2.3_offline
Explanation of the regex:
\\d+ specifies one or more digits
\\. specifies a .
\\_ specifies a _
(offline|online) specifies offline or online.
[Update]
Based on the edited question i.e. find anything between EGA_SAMPLE_APP-iOS-master- and -An_integer_number: Use the regex, EGA_SAMPLE_APP-iOS-master-(.*)-\\d+
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
// Test strings
String[] arr = { "EGA_SAMPLE_APP-iOS-master-9.1.1_offline-200710140849862",
"EGA_SAMPLE_APP-iOS-master-9.2.3_online-200710140849862",
"EGA_SAMPLE_APP-iOS-master-10.2.3_offline-200710140849862",
"EGA_SAMPLE_APP-iOS-master-anything here-200710140849862" };
// Define regex pattern
Pattern pattern = Pattern.compile("EGA_SAMPLE_APP-iOS-master-(.*)-\\d+");
// Print the matching string
for (String s : arr) {
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
}
Output:
9.1.1_offline
9.2.3_online
10.2.3_offline
anything here
Explanation of the regex:
.* specifies anything and the parenthesis around it specifies a capturing group which I've captured with group(1) in the code.
I can suggest the following one line option using String#replaceAll:
String input = "EGA_SAMPLE_APP-iOS-master-9.2.3_online-200710140849862";
String target = input.replaceAll(".*\\b(\\d+\\.\\d+\\.\\d+_(?:online|offline))\\b.*", "$1");
System.out.println(target);
This prints:
9.2.3_online

Java replaceAll but the specified regex

Can't get my head around this for quite some time already. I have this piece of code:
getStringFromDom(doc).replaceAll("contract=\"\\d*\"|name=\"\\p{L}*\"", "");
Basically I need it to work literally the opposite way - to replace everything BUT the specified regex. I've been trying to do it with the negative lookahead to no avail.
For your particular task, I think
getStringFromDom(doc).replaceAll(".*?(contract=\"\\d*\"|name=\"\\p{L}*\").*", "$1");
should do what you need.
You want to remove everything that does not match the pattern. This is the same as simply filtering the pattern matches. Use the regex to find matches for that pattern, then collect the matches in a stringbuilder.
Matcher m = Pattern.compile(your pattern).matcher(your input);
StringBuilder sb = new StringBuilder();
while (m.find()) sb.append (m.group()).append('\n');
String result = sb.toString();
I also think that removing what your are not looking for is a double negative. Concentrate on what you are looking for and use a pattern matching for that. This example searches your document for any name attributes:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String input = "<AnotherDoc accNum=\"1111\" docDate=\"2017-09-26\" docNum=\"2222\" name=\"foo\"> <anotherTag>some date</anotherTag>";
Pattern pattern = Pattern.compile("name=\"[^\\\"]*\""); // value are all characters but "
Matcher matcher = pattern.matcher(input);
while (matcher.find())
System.out.println(matcher.group());
}
}
This prints:
name="foo"

Java regex can't work if have \n character

I have project to detect if editor have write html entities, but when it containt \n it doesnt work? why?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String text = "asdasdas <h1>Test</h1></div>";
String regex = ".*<[^&lt]+>.*";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(text);
System.out.println(m.matches());
}
}
If you want to take \n into consideration, you can do this:
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
This takes the escape sequence into consideration.
You can also use Pattern.MULTILINE, which matches the regex with Each Line. So if you add ^ or $ in your regex, it matches the starting and ending of the regex respctively for each new line.
This is a link to the Oracle docs which may help you better understand, rather than just application of the code. The More You Know... :)

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Find a number of a given number of digits between given separators

What regex/pattern can I use to find the following pattern in a string?
#nnnn:
nnnn can be any 4-digit long number as long as it is sorrounded by a hashtag and a colon.
I have tried the code below:
String string = "#8226:";
if(string.matches( ".*\\d:.*" )) {
System.out.println( "Yes" );
}
It DOES work, but it matches other strings like below:
"This is a string 1234: Hahaha!" // Outputs "Yes"
"Hello 1834: World!!!" // Outputs "Yes"
I want it to only match the pattern at the top of the question.
Can anybody tell me where did I go wrong?
It can be done with Regular Expression
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FindPattern {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("#[0-9]{4}:");
String text = "#1233:#3433:abc#3993: #a343:___#8888:ki";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
output is:
#1233:
#3433:
#3993:
#8888:
You have already a pattern: #nnnn:. The only problem is that this is not a java compatible regular expression. Let's convert.
# and : are valid character literals, so let these untouched.
As you probably know (according to your solution), a number is denoted with the \d sequence (note, there are some alternatives, e. g. [0-9], \p{Digit}). Just replace all ns with \d:
#\d\d\d\d:
There are four equal subpatterns here, so we can shorten it with a fixed quantifier:
#\d{4}:
You can now write string.matches("#\\d{4}:"). Note that this is slow because compiles the given regex pattern every time. If this code is called frequently, I would consider using a precompiled Pattern like:
Pattern HASH_NUMBER_COLON_PATTERN = Pattern.compile("#\\d{4}:");
// ...
if (HASH_NUMBER_COLON_PATTERN.matcher(yourString).matches()) {
// ...
}
Even better to use some regular expression builder library, such as regex-builder, JavaVerbalExpressions or RegexBee. These tools can make your intention very clear. RegexBee example:
Pattern HASH_NUMBER_COLON_PATTERN = Bee
.then(Bee.fixedChar('#'))
.then(Bee.intBetween(1000, 9999))
.then(Bee.fixedChar(':'))
.toPattern()

Categories