Regular Expression That Contains All Of The Specific Letters In Java - java

I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.
Regular Expression Pattern;
^(?=.*B)(?=.*T)(?=.*L).+$
Input Text File;
AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
And output of the regular expression in notepad++;
LABAT
BALAT
LATAB
As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.
Here is my test code;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;
public class Test {
public static void main(String[] args) {
String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";
String dictionary =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(dictionary);
while(m.find())
{
System.out.println("Match: " + m.group());
}
}
}
The output is errorneous as below;
Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
My question is simply, what is the java-compatible version of this regular expression?

Java-specific answer
In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.
That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.
I suggest this almost regex-free approach (String#split() still uses it):
String dictionary_str =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++) // Iterate through lines
{
if(dictionary[i].indexOf("B") > -1 && // There must be B
dictionary[i].indexOf("T") > -1 && // There must be T
dictionary[i].indexOf("L") > -1) // There must be L
{
System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
}
}
See IDEONE demo
Original regex-based answer
You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:
^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)
The regex breakdown:
^ - start of string
(?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B
(?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T
(?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than L
See Java demo:
String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(dictionary[i]);
if(m.find())
{
System.out.println("Match: " + dictionary[i]);
}
}
Output:
Match: LABAT
Match: BALAT
Match: LATAB

Change your Pattern to:
String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";
Output
Match: LABAT
Match: BALAT
Match: LATAB

I did not debug your situation, but I think your problem is caused by matching the entire string rather than individual words.
You're matching "AL\nBAL\nBAK\nLABAT\n" plus some more. Of course that string has all the required characters. You can see it in the fact that your output only contains one Match: prefix.
Please have a look at this answer. You need to use Pattern.MULTILINE.

Related

Regex display with arrays

So I have a regex question. When running this code
if (str1.trim().contains(search2)){
String str3 = str1;
str3 = str3.replaceAll("[^-?0-9]+", " ");
System.out.println("location: " + Arrays.asList(str3.trim().split(" ")));
System.out.println(" ");
}
it produces
location: [290, -70]
is it possible to replace the bracket characters with "[ x, x]" with "x x" so that they just show the characters within quotes?
location: "290 -70"?
I'm kinda new to regex so I tried some things like .replace("[", " "); but it did not work.
EDIT ----
Here's my entire code.
public static void main (String [] args) throws IOException {
BufferedReader in = new BufferedReader (new FileReader ("/Users/Dannybwee/Documents/workspace/csc199/src/csc199/test.txt"));
String str;
List<String> finallist = new ArrayList<String>();
while ((str = in.readLine()) != null){
finallist.add(str);
}
String search = "node";
String search2 = "position";
for (String str1: finallist) {
if (str1.trim().contains(search)){
System.out.print("{ key " + str1+ ",\n" +
"name: " + str1 + ",\n" +
"Truth: 'Tainted'," + "\n" +
"False: 'NotTainted, \n");
}
if (str1.trim().contains(search2)){
String str3 = str1;
str3 = str3.replaceAll("[^-?0-9]+", " ");
System.out.println("location: " + Arrays.asList(str3.trim().split(" ")));
System.out.println("}");
}
}
}
What i'm trying to do is take a text file, and then change the formatting of the text. I thought it would be easiest to take the file and scan for what needed to change. for instance, All I want is to change the brackets outputed above to braces.
So basically I want it to output location: "290 -70" instead of location: [290, -70] without the comma and brackets
I'm splitting because the line is positions = (number number); What I'm trying to do is just extract the number from that index
Then if you split, you get ["(number", "number)"].
You want to remove the round brackets, not the square ones. And you have already done that [^-?0-9]+ removes all characters but one or more 0-9, -, and ?
You don't need to split anything.
if (str1.trim().contains(search2)){
str1 = st1.replaceAll("[^-?0-9]+", " ");
System.out.println("location: \"" + str1 + "\"");
System.out.println("}");
}
You could also forget the regex entirely and use str1.substring(1, str1.length() - 1)
By the way, if you are trying to produce JSON, it isn't valid. The keys need to be quoted
you can specify the literal bracket with the backslash "escape character" \[. This is common for many regex entries that also correspond to triggered characters.
\\ , \. , \( ... etc
It is important to note that in Java we must escape our escape character, therefore whenever you use it you'll need a single backslash for each backslash:
\\[, \\\\, \\., \\( ... etc
You can implement this into your existing code, or you could make your life a little easier by using a pattern matcher.
Pattern p = Pattern.compile("\\D+?(-?\\d++)\\D+?(-?\\d++)\\D*");
Matcher m = p.matcher(STRING);
String results = "location: "+m.group(1)+" "+m.group(2);
\\D+? eliminates non-digit (0-9) characters reluctantly, this will spare the '-' when found.
(-?\\d++) will capture m.group(n) which will possessively contain as many digits as it can find in a row. Since the '-' was spared earlier it should be present for this capture if at all.

regex seems to be off for special characters (e.g. +-.,!##$%^&*;)

I am using regex to print out a string and adding a new line after a character limit. I don't want to split up a word if it hits the limit (start printing the word on the next line) unless a group of concatenated characters exceed the limit where then I just continue the end of the word on the next line. However when I hit special characters(e.g. +-.,!##$%^&*;) as you'll see when I test my code below, it adds an additional character to the limit for some reason. Why is this?
My function is:
public static String limiter(String str, int lim) {
str = str.trim().replaceAll(" +", " ");
str = str.replaceAll("\n +", "\n");
Matcher mtr = Pattern.compile("(.{1," + lim + "}(\\W|$))|(.{0," + lim + "})").matcher(str);
String newStr = "";
int ctr = 0;
while (mtr.find()) {
if (ctr == 0) {
newStr += (mtr.group());
ctr++;
} else {
newStr += ("\n") + (mtr.group());
}
}
return newStr ;
}
So my input is:
String str = " The 123456789 456789 +-.,!##$%^&*();\\/|<>\"\' fox jumpeded over the uf\n 2 3456 green fence ";
With a character line limit of 7.
It outputs:
456789 +
-.,!##$%
^&*();\/
|<>"
When the correct output should be:
456789
+-.,!##
$%^&*()
;\/|<>"
My code is linked to an online compiler you can run here:
https://ideone.com/9gckP1
You need to replace the (\W|$) with \b as your intention is to match whole words (and \b provides this functionality). Also, since you do not need trailing whitespace on newly created lines, you need to also use \s*.
So, use
Matcher mtr = Pattern.compile("(?U)(.{1," + lim + "}\\b\\s*)|(.{0," + lim + "})").matcher(str);
See demo
Note that (?U) is used here to "fix" the word boundary behavior to keep it in sync with \w (so that diacritics were not considered word characters).
In your pattern, \\W is part of the first capturing group. It is adding this one (non-word) character to the .{1,limit} pattern.
Try with: "(.{1," + lim + "})(\W|$)|(.{0," + lim + "})"
(I can't currently use your regex online compiler)

What does regex "\\p{Z}" mean?

I am working with some code in java that has an statement like
String tempAttribute = ((String) attributes.get(i)).replaceAll("\\p{Z}","")
I am not used to regex, so what is the meaning of it? (If you could provide a website to learn the basics of regex that would be wonderful) I've seen that for a string like
ept as y it gets transformed into eptasy, but this doesn't seem right. I believe the guy who wrote this wanted to trim leading and trailing spaces maybe.
It removes all the whitespace (replaces all whitespace matches with empty strings).
A wonderful regex tutorial is available at regular-expressions.info.
A citation from this site:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
The OP stated that the code fragment was in Java. To comment on the statement:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
the sample code below shows that this does not apply in Java.
public static void main(String[] args) {
// some normal white space characters
String str = "word1 \t \n \f \r " + '\u000B' + " word2";
// various regex patterns meant to remove ALL white spaces
String s = str.replaceAll("\\s", "");
String p = str.replaceAll("\\p{Space}", "");
String b = str.replaceAll("\\p{Blank}", "");
String z = str.replaceAll("\\p{Z}", "");
// \\s removed all white spaces
System.out.println("s [" + s + "]\n");
// \\p{Space} removed all white spaces
System.out.println("p [" + p + "]\n");
// \\p{Blank} removed only \t and spaces not \n\f\r
System.out.println("b [" + b + "]\n");
// \\p{Z} removed only spaces not \t\n\f\r
System.out.println("z [" + z + "]\n");
// NOTE: \p{Separator} throws a PatternSyntaxException
try {
String t = str.replaceAll("\\p{Separator}","");
System.out.println("t [" + t + "]\n"); // N/A
} catch ( Exception e ) {
System.out.println("throws " + e.getClass().getName() +
" with message\n" + e.getMessage());
}
} // public static void main
The output for this is:
s [word1word2]
p [word1word2]
b [word1
word2]
z [word1
word2]
throws java.util.regex.PatternSyntaxException with message
Unknown character property name {Separator} near index 12
\p{Separator}
^
This shows that in Java \\p{Z} removes only spaces and not "any kind of whitespace or invisible separator".
These results also show that in Java \\p{Separator} throws a PatternSyntaxException.
First of all, \p means you are going to match a class, a collection of character, not single one. For reference, this is Javadoc of Pattern class. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Unicode scripts, blocks, categories and binary properties are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property.
And then Z is the name of a class (collection,set) of characters. In this case, it's abbreviation of Separator . Separator containts 3 sub classes: Space_Separator(Zs), Line_Separator(Zl) and Paragraph_Separator(Zp).
Refer here for which characters those classes contains here: Unicode Character Database or
Unicode Character Categories
More document: http://www.unicode.org/reports/tr18/#General_Category_Property

Java pattern for [j-*]

Please help me with the pattern matching. I want to build a pattern which will match the word starting with j- or c- in the following in a string (Say for example)
[j-test] is a [c-test]'s name with [foo] and [bar]
The pattern needs to find [j-test] and [c-test] (brackets inclusive).
What I have tried so far?
String template = "[j-test] is a [c-test]'s name with [foo] and [bar]";
Pattern patt = Pattern.compile("\\[[*[j|c]\\-\\w\\-\\+\\d]+\\]");
Matcher m = patt.matcher(template);
while (m.find()) {
System.out.println(m.group());
}
And its giving output like
[j-test]
[c-test]
[foo]
[bar]
which is wrong. Please help me, thanks for your time on this thread.
Inside a character class, you don't need to use alternation to match j or c. Character class itself means, match any single character from the ones inside it. So, [jc] itself will match either j or c.
Also, you don't need to match the pattern that is after j- or c-, as you are not bothered about them, as far as they start with j- or c-.
Simply use this pattern:
Pattern patt = Pattern.compile("\\[[jc]-[^\\]]*\\]");
To explain:
Pattern patt = Pattern.compile("(?x) " // Embedded flag for Pattern.COMMENT
+ "\\[ " // Match starting `[`
+ " [jc] " // Match j or c
+ " - " // then a hyphen
+ " [^ " // A negated character class
+ " \\]" // Match any character except ]
+ " ]* " // 0 or more times
+ "\\] "); // till the closing ]
Using (?x) flag in the regex, ignores the whitespaces. It is often helpful, to write readable regexes.

Getting the "context" text of a matched group

I'm using the Matcher class of Java to get some strings, now when I get my matches, I also find their begin index and end index. Now what I want to do is get the x preceding and proceeding characters.
So what I did was just call the substring method on the string with {begin index minusx} to {end index plusx}, but it seems to be a little heavy, for every match, I'll have to loop the string for it's context.
I wanted to know whether there's a better way to do that.
Here is what I've done so far:
The part that bothers me is the text.substring, how expensive is it
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("\\d{2}").matcher(text);
int x = 5;
while (matcher.find()) {
String match = matcher.group();
int start = matcher.start();
int end = matcher.end();
String pretext = text.substring(start - x, start);
String postext = text.substring(end, end + x);
System.out.println(pretext + " - " + match + " - " + postext);
}
Suggested answer of using grouping to solve this:
using the regex (.{5})(\d{2}(.{5}).
First of all, this wouldn't be able to captures ones without at least 5 characters before it. So the solution to that is (.{0,5})(\d{2})(.{0.5}), very nice for that simple regex (\d{2})but for one like "c?at" and the given text "cat" this would match the groups
c
at
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("(.{5})(\\d{2})(.{5})").matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2) + " - " + matcher.group(3));
}
output :
Some - 22 - text
with - 44 - char

Categories