Java regex to parse any number of Markdown-style links

Java regex to parse any number of Markdown-style links - java

I'm trying to parse a string for any occurrences of markdown style links, i.e. [text](link). I'm able to get the first of the links in a string, but if I have multiple links I can't access the rest. Here is what I've tried, you can run it on ideone:
Pattern p;
try {
p = Pattern.compile("[^\\[]*\\[(?<text>[^\\]]*)\\]\\((?<link>[^\\)]*)\\)(?:.*)");
} catch (PatternSyntaxException ex) {
System.out.println(ex);
throw(ex);
}
Matcher m1 = p.matcher("Hello");
Matcher m2 = p.matcher("Hello [world](ladies)");
Matcher m3 = p.matcher("Well, [this](that) has [two](too many) keys.");
System.out.println("m1 matches: " + m1.matches()); // false
System.out.println("m2 matches: " + m2.matches()); // true
System.out.println("m3 matches: " + m3.matches()); // true
System.out.println("m2 text: " + m2.group("text")); // world
System.out.println("m2 link: " + m2.group("link")); // ladies
System.out.println("m3 text: " + m3.group("text")); // this
System.out.println("m3 link: " + m3.group("link")); // that
System.out.println("m3 end: " + m3.end()); // 44 - I want 18
System.out.println("m3 count: " + m3.groupCount()); // 2 - I want 4
System.out.println("m3 find: " + m3.find()); // false - I want true
I know I can't have repeating groups, but I figured find would work, however it does not work as I expected it to. How can I modify my approach so that I can parse each link?

Can't you go through the matches one by one and do the next match from an index after the previous match? You can use this regex:
\[(?<text>[^\]]*)\]\((?<link>[^\)]*)\)
The method Find() tries to find all matches even if the match is a substring of the entire string. Each call to find gets the next match. Matches() tries to match the entire string and fails if it doesn't match. Use something like this:
while (m.find()) {
String s = m.group(1);
// s now contains "BAR"
}

The regular expression I've used to match what you need (without groups) is \[\w+\]\(.+\)
It is just to show you it simple. Basically it does:
Filter a square: \[
Followed by any word char (at least 1): \w+
Then the square: \]
This will look for these pattern [blabla]
Then the same with parenthesis...
Filter a parenthesis: \(
Followed by any char (at least 1): .+
Then the parenthesis: \)
So it filters (ble...ble...)
Now if you want to store the matches on groups you can use additional parenthesis like this:
(\[\w+\])(\(.+\)) in this way you can have stored the words and links.
Hope to help.
I've tried on regexplanet.com and it's working
Update: workaround .*(\[\w+\])(\(.+\))*.*

Related

How do I replace a certain char in between 2 strings using regex

I'm new to regex and have been trying to work this out on my own but I don't seem to get it working. I have an input that contains start and end flags and I want to replace a certain char, but only if it's between the flags.
So for example if the start flag is START and the end flag is END and the char i'm trying to replace is " and I would be replacing it with \"
I would say input.replaceAll(regex, '\\\"');
I tried making a regex to only match the correct " chars but so far I have only been able to get it to match all chars between the flags and not just the " chars. -> (?<=START)(.*)(?=END)
Example input:
This " is START an " example input END string ""
START This is a "" second example END
This" is "a START third example END " "
Expected output:
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "

Find all characters between START and END, and for those characters replace " with \".
To achieve this, apply a replacer function to all matches of characters between START and END:
string = Pattern.compile("(?<=START).*?(?=END)").matcher(string)
.replaceAll(mr -> mr.group().replace("\"", "\\\\\""));
which produces your expected output.
Some notes on how this works.
This first step is to match all characters between START and END, which uses look arounds with a reluctant quantifier:
(?<=START).*?(?=END)
The ? after the .* changes the match from greedy (as many chars as possible while still matching) to reluctant (as few chars as possible while still matching). This prevents the middle quote in the following input from being altered:
START a"b END c"d START e"f END
A greedy quantifier will match from the first START all the way past the next END to the last END, incorrectly including c"d.
The next step is for each match to replace " with \". The full match is group 0, or just MatchResult#group. and we don't need regex for this replacement - just plain string replace is enough (and yes, replace() replaces all occurrences).

For now i've been able to solve it by creating 3 capture groups and continuously replacing the match until there are no more matches left. In this case I even had to insert a replace indentifier because replacing with " would keep the " char there and create an infinite loop. Then when there are no more matches left I replaced my identifier and i'm now getting the expected result.
I still feel like there has to be a way cleaner way to do this using only 1 replace statement...
Code that worked for me:
class Playground {
public static void main(String[ ] args) {
String input = "\"ThSTARTis is a\" te\"\"stEND \" !!!";
String regex = "(.*START.+)\"+(.*END+.*)";
while(input.matches(regex)){
input = input.replaceAll(regex, "$1---replace---$2");
}
String result = input.replace("---replace---", "\\\"");
System.out.println(result);
}
}
Output:
"ThSTARTis is a\" te\"\"stEND " !!!
I would love any suggestions as to how I could solve this in a better/cleaner way.

Another option is to make use of the \G anchor with 2 capture groups. In the replacement use the 2 capture groups followed by \"
(?:(START)(?=.*END)|\G(?!^))((?:(?!START|END)(?>\\+\"|[^\r\n\"]))*)\"
Explanation
(?: Non capture group
(START)(?=.*END) Capture group 1, match START and assert there is END to the right
| Or
\G(?!^) Assert the current position at the end of the previous match
) Close non capture group
( Capture group 2
(?: Non capture group
(?!START|END) Negative lookhead, assert not START or END directly to the right
(?>\\+\"|[^\r\n\"]) Match 1+ times \ followed by " or match any char except " or a newline
)* Close the non capture group and optionally repeat it
) Close group 2
\" Match "
See a Java regex demo and a Java demo
For example:
String regex = "(?:(START)(?=.*END)|\\G(?!^))((?:(?!START|END)(?>\\\\+\\\"|[^\\r\\n\\\"]))*)\\\"";
String string = "This \" is START an \" example input END string \"\"\n"
+ "START This is a \"\" second example END\n"
+ "This\" is \"a START third example END \" \"";
String subst = "$1$2\\\\\"";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = matcher.replaceAll(subst);
System.out.println(result);
Output
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "

Why does regex doesn't match

I have wrote the following code:
public static void main(String[] args) {
// String to be scanned to find the pattern.
String line = "'7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'\n"
+ "'7859','1194','FIRM','21'";
String pattern = "^'*','*','*','*'$";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
} else {
System.out.println("NO MATCH");
}
}
it returns NO MATCH always.
expected result - 2 rows
What do I wrong ?

There are several problems in your code :
A single "star" (*) in matches 0-N times the character it follows - in your code, '*' means "match 0-N times a single quote, followed by another single quote"
Also, the "star" qualifier is "greedy" by default, meaning it will eat as many matching chars as possible, including the ending quote in your groups. In your case, you may want to set it in "reluctant" mode (by appending a ? to it : *?), so that it matches only the text inside the single quotes.
The lines must be matched one by one, so the initial multi-line must be split on the line-separator character (\n). Unless you use the multi-line match option, but I think this is not what you want here.
Matching groups start at 1, not 0, so groups would be numbered 1 to 4 in your case.
Here is your code, corrected as explained above :
public static void main(String[] args) {
String line = "'7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'\n" +
"'7859','1194','FIRM','21'";
Pattern r = Pattern.compile("'(.*?)','(.*?)','(.*?)','(.*?)'");
String[] lines = line.split("\n");
for (String l : lines) {
System.out.println("Line : " + l);
Matcher m = r.matcher(l);
if (m.find()) {
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
System.out.println("Found value: " + m.group(3));
System.out.println("Found value: " + m.group(4));
} else {
System.out.println("NO MATCH");
}
}
}
And here is the result :
Line : '7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'
Found value: 7858
Found value: 1194
Found value: FSP,FRB,FWF,FBVS,FRRC
Found value: 15
Line : '7859','1194','FIRM','21'
Found value: 7859
Found value: 1194
Found value: FIRM
Found value: 21

"^'*','*','*','*'$" does not match anything because '* searches for as many 's as possible. It does not match what you want.
Also, the ^ and $ won't work.
I think that this regex is what you need:
'[0-9A-Z,]*','[0-9A-Z,]*','[0-9A-Z,]*','[0-9A-Z,]*'
Here I have added the character class [0-9A-Z,] to match numbers, letters and ,s. I think that this will give you what you need.

You could try with this expression:
(?<=^|[\r\n]+)'([^']*)','([^']*)','([^']*)','([^']*)'(?=[\r\n]+|$)
Breakdown:
(?<=^|[\r\n]+) is a positive look-behind checking for either the start of the input or a sequence of linebreak characters
'([^']*)' matches and captures one of your groups. You could use '(.*?)' (i.e. a reluctant qualifier) instead but the former version is safer since it won't match if your input lines contain more than 4 groups
(?=[\r\n]+|$) is a positive look-ahead checking your groups are followed by either a sequence of linebreak characters of the end of the input sequence
I also made the following assumptions about your code:
Your input contains multiple lines which you can't or don't want to split (otherwise String[] lines = input.split("[\\r\\n]+") would be better).
A matching line always consists of 4 groups which you want to access using group(1) etc.
Your groups can contain any character except a single quote. If a group is only allowed to contain certain characters (e.g. digits), it would be safer to reflect that in the expression (e.g. '[0-9]+')

Regular Expressions - Find hexadecimal numbers' matches excluding the 0 of the next hexadecimal number

Goal - I need to retrieve all hexadecimal numbers from my input String.
Example inputs and matches --
1- Input = "0x480x8600x89dfh0x89BABCE" (The "" are not included in the input).
should produce following matches:
0x48 ( as opposed to 0x480)
0x860
0x89df
0x89BABCE
I have tried this Pattern:
"0[xX][\\da-fA-F]+"
But it results in the following matches:
0x480
0x89df
0x89BABCE
2- Input = "0x0x8600x89dfh0x89BABCE" (The "" are not included in the input).
Should produce following matches:
0x860
0x89df
0x89BABCE
Is such a regex possible?
I know that I can first split my input using the String.split("0[xX]"), and then for each String I can write logic to retrieve the first valid match, if there is one.
But I want to know if I can achieve the desired result using just a Pattern and a Matcher.
Here's my current code.
package toBeDeleted;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String pattern = "0[xX][\\da-fA-F]+";
String input = "0x480x860x89dfh0x89BABCE";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Starting at : " + m.start()
+ ", Ending at : " + m.end()
+ ", element matched : " + m.group());
}
}
}

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
-- Jamie Zawinski
If you just use .split("0x"), and tack 0x back on each (non-empty) result, you'll be done.

You can use a lookahead to check that the next character is not an "x"
String pattern = "0[xX]([1-9a-fA-F]|0(?![xX]))+";
Doesn't provide a match as 0x0 for the second example, though. However, you did state that matches should exclude the "0" preceding the next hex number, so not really sure why that would be matched.

Simple java Regex doesn't

When i use this code, i don't have the expected result :
pattern = Pattern.compile("create\\stable\\s(\\w*)\\s\\(", Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher("create table CONTACT (");
if(matcher.matches()) {
for(int i =0; i<matcher.groupCount();i++) {
System.out.println("table : " + matcher.group(i) + matcher.start(i) + " - " + matcher.end(i));
}
}
}
I expect to catch CONTACT but the regex catch the whole expression "create table CONTACT (".
Has someone an idea of the problem ?
Thanks

The regex engine actually counts the entire regex as a group. The first group in your regex is actually the second group returned by the match, which is at index 1.
If you ignore the first group, then you should find what you're looking for in the second.
The reason that the group isn't printed by your code is that groupCount doesn't count the entire regex as a group, so you're only getting 1 group in your loop.
Group zero denotes the entire pattern by convention. It is not included in this count.
You probably don't need a loop, and you can just extract the desired string directly with group(1).

Group number starts from 1, not from 0.
Following expression:
matcher.group(i)
should be replaced with:
matcher.group(i+1)
Or simply print group 1 if you want print only one group:
System.out.println("table: " + matcher.group(1));

Java pattern for [j-*]

Please help me with the pattern matching. I want to build a pattern which will match the word starting with j- or c- in the following in a string (Say for example)
[j-test] is a [c-test]'s name with [foo] and [bar]
The pattern needs to find [j-test] and [c-test] (brackets inclusive).
What I have tried so far?
String template = "[j-test] is a [c-test]'s name with [foo] and [bar]";
Pattern patt = Pattern.compile("\\[[*[j|c]\\-\\w\\-\\+\\d]+\\]");
Matcher m = patt.matcher(template);
while (m.find()) {
System.out.println(m.group());
}
And its giving output like
[j-test]
[c-test]
[foo]
[bar]
which is wrong. Please help me, thanks for your time on this thread.

Inside a character class, you don't need to use alternation to match j or c. Character class itself means, match any single character from the ones inside it. So, [jc] itself will match either j or c.
Also, you don't need to match the pattern that is after j- or c-, as you are not bothered about them, as far as they start with j- or c-.
Simply use this pattern:
Pattern patt = Pattern.compile("\\[[jc]-[^\\]]*\\]");
To explain:
Pattern patt = Pattern.compile("(?x) " // Embedded flag for Pattern.COMMENT
+ "\\[ " // Match starting `[`
+ " [jc] " // Match j or c
+ " - " // then a hyphen
+ " [^ " // A negated character class
+ " \\]" // Match any character except ]
+ " ]* " // 0 or more times
+ "\\] "); // till the closing ]
Using (?x) flag in the regex, ignores the whitespaces. It is often helpful, to write readable regexes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex to parse any number of Markdown-style links - java

Related

How do I replace a certain char in between 2 strings using regex

Why does regex doesn't match

Regular Expressions - Find hexadecimal numbers' matches excluding the 0 of the next hexadecimal number

Simple java Regex doesn't

Java pattern for [j-*]

Categories

Resources