Java pattern for [j-*] - java

Please help me with the pattern matching. I want to build a pattern which will match the word starting with j- or c- in the following in a string (Say for example)
[j-test] is a [c-test]'s name with [foo] and [bar]
The pattern needs to find [j-test] and [c-test] (brackets inclusive).
What I have tried so far?
String template = "[j-test] is a [c-test]'s name with [foo] and [bar]";
Pattern patt = Pattern.compile("\\[[*[j|c]\\-\\w\\-\\+\\d]+\\]");
Matcher m = patt.matcher(template);
while (m.find()) {
System.out.println(m.group());
}
And its giving output like
[j-test]
[c-test]
[foo]
[bar]
which is wrong. Please help me, thanks for your time on this thread.

Inside a character class, you don't need to use alternation to match j or c. Character class itself means, match any single character from the ones inside it. So, [jc] itself will match either j or c.
Also, you don't need to match the pattern that is after j- or c-, as you are not bothered about them, as far as they start with j- or c-.
Simply use this pattern:
Pattern patt = Pattern.compile("\\[[jc]-[^\\]]*\\]");
To explain:
Pattern patt = Pattern.compile("(?x) " // Embedded flag for Pattern.COMMENT
+ "\\[ " // Match starting `[`
+ " [jc] " // Match j or c
+ " - " // then a hyphen
+ " [^ " // A negated character class
+ " \\]" // Match any character except ]
+ " ]* " // 0 or more times
+ "\\] "); // till the closing ]
Using (?x) flag in the regex, ignores the whitespaces. It is often helpful, to write readable regexes.

Related

change a part of file via regex and java

I have 2 different results betwin regex online and my java code.
My input text:
Examples:
#DATA
|id|author|zip|city|element|
Odl data - Odl data - Odl data
#END
I want change Odl data - Odl data - Odl data (in my example) by foo.
My regex is:
#DATA[\s\S].*[\s\S]([\s\S]*)#END
I want change Group 1 by foo
démo online:
https://regex101.com/r/Nq9fas/2
My java code:
final String regex = "#DATA[\\s\\S].*[\\s\\S]([\\s\\S]*)#END";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
I want keep the 1st line (|id|author|zip|city|element|) but my regex change all data betwin #DATA and #END
The point is to match what you need to remove/replace and match and capture what you need to keep.
You may use a replaceFirst with
(#DATA\r?\n.*\r?\n)[\s\S]*(#END)
See the regex demo. In Java:
String res = s.replaceFirst("(#DATA\r?\n.*\r?\n)[\\s\\S]*(#END)", "$1foo\n$2");
Note: If you have only one 1 line to replace, use
(#DATA\r?\n.*\r?\n).*([\s\S]*#END)
Note 2: If you have several such "blocks" in the text, use a lazy quantifier with [\s\S] and use with replaceAll instead of replaceFirst:
(#DATA\r?\n.*\r?\n).*([\s\S]*?#END)
^^

Why does regex doesn't match

I have wrote the following code:
public static void main(String[] args) {
// String to be scanned to find the pattern.
String line = "'7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'\n"
+ "'7859','1194','FIRM','21'";
String pattern = "^'*','*','*','*'$";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
} else {
System.out.println("NO MATCH");
}
}
it returns NO MATCH always.
expected result - 2 rows
What do I wrong ?
There are several problems in your code :
A single "star" (*) in matches 0-N times the character it follows - in your code, '*' means "match 0-N times a single quote, followed by another single quote"
Also, the "star" qualifier is "greedy" by default, meaning it will eat as many matching chars as possible, including the ending quote in your groups. In your case, you may want to set it in "reluctant" mode (by appending a ? to it : *?), so that it matches only the text inside the single quotes.
The lines must be matched one by one, so the initial multi-line must be split on the line-separator character (\n). Unless you use the multi-line match option, but I think this is not what you want here.
Matching groups start at 1, not 0, so groups would be numbered 1 to 4 in your case.
Here is your code, corrected as explained above :
public static void main(String[] args) {
String line = "'7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'\n" +
"'7859','1194','FIRM','21'";
Pattern r = Pattern.compile("'(.*?)','(.*?)','(.*?)','(.*?)'");
String[] lines = line.split("\n");
for (String l : lines) {
System.out.println("Line : " + l);
Matcher m = r.matcher(l);
if (m.find()) {
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
System.out.println("Found value: " + m.group(3));
System.out.println("Found value: " + m.group(4));
} else {
System.out.println("NO MATCH");
}
}
}
And here is the result :
Line : '7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'
Found value: 7858
Found value: 1194
Found value: FSP,FRB,FWF,FBVS,FRRC
Found value: 15
Line : '7859','1194','FIRM','21'
Found value: 7859
Found value: 1194
Found value: FIRM
Found value: 21
"^'*','*','*','*'$" does not match anything because '* searches for as many 's as possible. It does not match what you want.
Also, the ^ and $ won't work.
I think that this regex is what you need:
'[0-9A-Z,]*','[0-9A-Z,]*','[0-9A-Z,]*','[0-9A-Z,]*'
Here I have added the character class [0-9A-Z,] to match numbers, letters and ,s. I think that this will give you what you need.
You could try with this expression:
(?<=^|[\r\n]+)'([^']*)','([^']*)','([^']*)','([^']*)'(?=[\r\n]+|$)
Breakdown:
(?<=^|[\r\n]+) is a positive look-behind checking for either the start of the input or a sequence of linebreak characters
'([^']*)' matches and captures one of your groups. You could use '(.*?)' (i.e. a reluctant qualifier) instead but the former version is safer since it won't match if your input lines contain more than 4 groups
(?=[\r\n]+|$) is a positive look-ahead checking your groups are followed by either a sequence of linebreak characters of the end of the input sequence
I also made the following assumptions about your code:
Your input contains multiple lines which you can't or don't want to split (otherwise String[] lines = input.split("[\\r\\n]+") would be better).
A matching line always consists of 4 groups which you want to access using group(1) etc.
Your groups can contain any character except a single quote. If a group is only allowed to contain certain characters (e.g. digits), it would be safer to reflect that in the expression (e.g. '[0-9]+')

Regular Expression That Contains All Of The Specific Letters In Java

I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.
Regular Expression Pattern;
^(?=.*B)(?=.*T)(?=.*L).+$
Input Text File;
AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
And output of the regular expression in notepad++;
LABAT
BALAT
LATAB
As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.
Here is my test code;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;
public class Test {
public static void main(String[] args) {
String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";
String dictionary =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(dictionary);
while(m.find())
{
System.out.println("Match: " + m.group());
}
}
}
The output is errorneous as below;
Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
My question is simply, what is the java-compatible version of this regular expression?
Java-specific answer
In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.
That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.
I suggest this almost regex-free approach (String#split() still uses it):
String dictionary_str =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++) // Iterate through lines
{
if(dictionary[i].indexOf("B") > -1 && // There must be B
dictionary[i].indexOf("T") > -1 && // There must be T
dictionary[i].indexOf("L") > -1) // There must be L
{
System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
}
}
See IDEONE demo
Original regex-based answer
You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:
^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)
The regex breakdown:
^ - start of string
(?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B
(?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T
(?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than L
See Java demo:
String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(dictionary[i]);
if(m.find())
{
System.out.println("Match: " + dictionary[i]);
}
}
Output:
Match: LABAT
Match: BALAT
Match: LATAB
Change your Pattern to:
String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";
Output
Match: LABAT
Match: BALAT
Match: LATAB
I did not debug your situation, but I think your problem is caused by matching the entire string rather than individual words.
You're matching "AL\nBAL\nBAK\nLABAT\n" plus some more. Of course that string has all the required characters. You can see it in the fact that your output only contains one Match: prefix.
Please have a look at this answer. You need to use Pattern.MULTILINE.

Java regex to parse any number of Markdown-style links

I'm trying to parse a string for any occurrences of markdown style links, i.e. [text](link). I'm able to get the first of the links in a string, but if I have multiple links I can't access the rest. Here is what I've tried, you can run it on ideone:
Pattern p;
try {
p = Pattern.compile("[^\\[]*\\[(?<text>[^\\]]*)\\]\\((?<link>[^\\)]*)\\)(?:.*)");
} catch (PatternSyntaxException ex) {
System.out.println(ex);
throw(ex);
}
Matcher m1 = p.matcher("Hello");
Matcher m2 = p.matcher("Hello [world](ladies)");
Matcher m3 = p.matcher("Well, [this](that) has [two](too many) keys.");
System.out.println("m1 matches: " + m1.matches()); // false
System.out.println("m2 matches: " + m2.matches()); // true
System.out.println("m3 matches: " + m3.matches()); // true
System.out.println("m2 text: " + m2.group("text")); // world
System.out.println("m2 link: " + m2.group("link")); // ladies
System.out.println("m3 text: " + m3.group("text")); // this
System.out.println("m3 link: " + m3.group("link")); // that
System.out.println("m3 end: " + m3.end()); // 44 - I want 18
System.out.println("m3 count: " + m3.groupCount()); // 2 - I want 4
System.out.println("m3 find: " + m3.find()); // false - I want true
I know I can't have repeating groups, but I figured find would work, however it does not work as I expected it to. How can I modify my approach so that I can parse each link?
Can't you go through the matches one by one and do the next match from an index after the previous match? You can use this regex:
\[(?<text>[^\]]*)\]\((?<link>[^\)]*)\)
The method Find() tries to find all matches even if the match is a substring of the entire string. Each call to find gets the next match. Matches() tries to match the entire string and fails if it doesn't match. Use something like this:
while (m.find()) {
String s = m.group(1);
// s now contains "BAR"
}
The regular expression I've used to match what you need (without groups) is \[\w+\]\(.+\)
It is just to show you it simple. Basically it does:
Filter a square: \[
Followed by any word char (at least 1): \w+
Then the square: \]
This will look for these pattern [blabla]
Then the same with parenthesis...
Filter a parenthesis: \(
Followed by any char (at least 1): .+
Then the parenthesis: \)
So it filters (ble...ble...)
Now if you want to store the matches on groups you can use additional parenthesis like this:
(\[\w+\])(\(.+\)) in this way you can have stored the words and links.
Hope to help.
I've tried on regexplanet.com and it's working
Update: workaround .*(\[\w+\])(\(.+\))*.*

Can you help with regular expressions in Java?

I have a bunch of strings which may of may not have random symbols and numbers in them. Some examples are:
contains(reserved[j])){
close();
i++){
letters[20]=word
I want to find any character that is NOT a letter, and replace it with a white space, so the above examples look like:
contains reserved j
close
i
letters word
What is the best way to do this?
It depends what you mean by "not a letter", but assuming you mean that letters are a-z or A-Z then try this:
s = s.replaceAll("[^a-zA-Z]", " ");
If you want to collapse multiple symbols into a single space then add a plus at the end of the regular expression.
s = s.replaceAll("[^a-zA-Z]+", " ");
yourInputString = yourInputString.replaceAll("[^\\p{Alpha}]", " ");
^ denotes "all characters except"
\p{Alpha} denotes all alphabetic characters
See Pattern for details.
I want to find any character that is NOT a letter
That will be [^\p{Alpha}]+. The [] indicate a group. The \p{Alpha} matches any alphabetic character (both uppercase and lowercase, it does basically the same as \p{Upper}\p{Lower} and a-zA-Z. The ^ inside group inverses the matches. The + indicates one-or-many matches in sequence.
and replace it with a white space
That will be " ".
Summarized:
string = string.replaceAll("[^\\p{Alpha}]+", " ");
Also see the java.util.regex.Pattern javadoc for a concise overview of available patterns. You can learn more about regexs at the great site http://regular-expression.info.
Use the regexp /[^a-zA-Z]/ which means, everything that is not in the a-z/A-Z characters
In ruby I would do:
"contains(reserved[j]))".gsub(/[^a-zA-Z]/, " ")
=> "contains reserved j "
In Java should be something like:
import java.util.regex.*;
...
String inputStr = "contains(reserved[j])){";
String patternStr = "[^a-zA-Z]";
String replacementStr = " ";
// Compile regular expression
Pattern pattern = Pattern.compile(patternStr);
// Replace all occurrences of pattern in input
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll(replacementStr);

Categories