Regex code not collecting multiple lines of matching pattern - java

I'm new to using regex and I was hoping that someone could help me with this.
I have this regex code which is supposed to identify tab groups in a tablature file. It works on regex testing websites such as regexr.com, regextester.com, and extendsclass.com/regex-tester, but when I code it in java using the example text shown below, I am given each individual line as its own separate group, instead of 4 groups containing all the text which are separated only by one newline.
I have read through this stack overflow thread"Regular expression works on regex101.com, but not on prod" and have been careful to avoid string literal problems, multiline problems, and ive tried the code with other regex engines on regex101 and it worked, but still, it does not work in my java code shown below.
I tried enabling the multiline flag but it still doesn't work. I thought it was a problem with my code, but then I got the same wrong output on other regex tester websites: myregexp.com and freeformatter.com/java-regex-tester
here is the original regex. It is ling, so it might be easier to use the regex above as they both have the same problem I was talking about:
RealRegexCode = (^|[\n\r])(((?<=^|[\n\r])[^\S\n\r]*\|*[^\S\n\r]*((E|A|D|G|B|e|a|d|g|b)[^\S\n\r]*\|*(?=(([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))[|\r\n]|$)))((([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))\|)+(((?<=\|)[^\S\n\r]*((E|A|D|G|B|e|a|d|g|b)[^\S\n\r]*\|*(?=(([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))[|\r\n]|$)))((([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))\|)+)*(\n|\r|$))+
Here is a simplified regex code that displays the same problem, provided for the sake of debugging
SimplifiedRegexCode = (^|[\n\r])([^\n\r]+(\n|\r|$))+
here is the code that finds the matches using the regex pattern:
public static void main(String[] args){
String filePath = "C:\\Users\\stani\\IdeaProjects\project\\src\\testing files\\guitar - a thousand matches by passenger.txt";
Path path = Path.of(filePath);
List<String> stuff = new ArrayList<>();
try {
String rootStr = Files.readString(path);
Pattern pattern = Pattern.compile("(^|[\\n\\r])([^\\n\\r]+(\\n|\\r|$))+");
Matcher ptrnMatcher = pattern.matcher(rootStr);
while (ptrnMatcher.find()) {
stuff.add(ptrnMatcher.group());
}
}catch (Exception e) {
e.printStackTrace();
}
System.out.println(new Patterns().MeasureGroupCollection);
for (String s:stuff)
System.out.println(s);
}
And here is the text I was testing it with. It might help to copy and paste this in a text editor as stack overflow might distort how the text looks:
e|---------------------------------|------------------------------------|
e|------------------------------------------------------------------|
B|-----1--------(1)----1-----------|-------1---------------1----------1-|
B|-----1--------(1)----0---------0-----1---------1-----3--------(3)-|
G|-----------0------------0--------|-------------0----------------0-----|
G|-----------0---------------0---------------0---------------0------|
D|-----0h2-----2-------2-----------|-------2-------2-------0--------0---|
D|-----2-------2-------2-------2-------2-------2-------0-------0----|
A|-3-------3-------3-------3-------|------------------------------------|
A|-0-------0--------------------------------------------------------|
E|-----------------------------0---|---1-------1-------3-------3--------|
E|-----------------0-------0--------1------1-------3-------3--------|
e|-------------------------------------------------------------------|
B|-----1---------1-----1---------1-----3---------3-------1---------1-|
G|-----------0---------------0---------------0-----------------0-----|
D|-----3-------2-------2-------2-------0-------0---------2-------2---|
A|-----------------3-------3-------------------------3-------3-------|
E|-1-------1-----------------------3-------3-------------------------|
It should identify four different groups from the text. However, in java and in the two testers I mentioned above, it recognizes each line as its own different group (i.e 12 groups)

I couldn't help but respond to this as I am familiar with both regex and guitar haha.
For your short regex, please see the following regex on regex101.com:
https://regex101.com/r/NqGhoh/1/
The multiline modifier is required.
The main problem with this is that you are handling newlines on the front and back of the expression. I have modified the expression in a couple ways:
Made the regex match newlines only on the end, always looking for a ^ at the beginning.
Matching the carriage return new line combination as \r?\n as a carriage return should always be followed by a newline when it is used.
Used non-capturing groups to improve overhead and reduce complexity when looking at matches. This is the ?: just inside the parenthesis. It means the group won't be captured in the result, just used for encapsulation.
I started testing your longer regex and may update that as well, though it sounds like you already know what to do with the shorter one corrected.

Related

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?
The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.
I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

Java Regular Expression not evaluating

I have a string that is changes frequently, in the form of :
*** START OF THIS PROJECT FILENAME ***
Where FILENAME can be multiple words in different instances. I tried running the regex :
Pattern.matches("\\*\\*\\* START OF THIS PROJECT ", line);
where line is equal to one such strings.
I also tried using the Matcher, where beginningOfFilePatter is also set to the same regex pattern above:
Matcher beginFileAccelerator;
beginFileAccelerator = beginningOfFilePattern.matcher(line);
if (beginFileAccelerator.find()
//Do Something
Ive exhuastively tried at least 30 different combinations of regex, and I simply can't find the solution. If anyone could lend me an eye I would greatly appreciate it.
Pattern.matches tries to match the entire string against the pattern, because under the covers it uses Matcher#matches, which says:
Attempts to match the entire region against the pattern.
In your case, that will fail at the end, because the input doesn't end with "PROJECT ". It has more after that.
To allow anything at the end, add .*:
Pattern.matches("\\*\\*\\* START OF THIS PROJECT .*", line)
// Here -----------------------------------------^
Live Example

Simple Java regular expression matching fails

Before y'all jump on me for posting something similar to previous questions asked, yes, there seem to be a number of regex related questions but nothing which seems to help me, or at least that I can see.
I am trying to parse strings in JAVA using PATTERN and MATCHER and am really having no joy. My regular expression seems to match my input string when I use a few of the online regular expression testing websites but Java simply does not match my expression.
My input string is:
"Big apple" title="Little Apple" type="Container" url="http://malcolm.com/testing"
The regular expression I am using to match is ".*" title="(.*)" type="Container" url="(.*)"
Essentially I want to pull out the text within the second and the fourth set of quotes. There will always be 4 sets of quotes with text within and around.
I am coding as follows:
Variable XMLSubstring contains the string above (including the quotes) and is as stated, even when I print it out.
Pattern p = Pattern.compile(".* title=\"(.*)\" type=\"Container\" url=\"(.*)\"");
m = p.matcher(XMLSubstring);
It doesn't appear to be rocket science I'm attempting but I'm pulling my hair out trying to debug the bloody thing.
Is there something wrong with my regex pattern?
Is there something wrong with the code I am using?
Am I simply a moron and should stop coding with immediate effect?
EDIT & UPDATE: I have found the problem. My string had a space at the end of it which was breaking the parser! How silly, and I think based on that, I need to accept the third suggestion of mine and give up programming. Thanks all for your assistance.
Try this,
String str="\"Big apple\" title=\"Little Apple\" type=\"Container\" url=\"http://malcolm.com/testing\"";
Pattern p=Pattern.compile(".* title=\\\".*\\\" type=\\\"Container\\\" url=\\\".*\\\"");
Matcher m=p.matcher(str);

Using regular expression with Java (some specific characters)

I have this example:
String str = "HellMCo I fiCZMnd thBVMis site intZereVCsting";
String tags = "BCMVZ";
I need a regular expression that helps me to find every combination of tags. As you can see in str we find four variations. I don't know too much about regular expressions.
I'm starting to test with this pattern:
(\d{,1}[BCMVZ])
PD: I'm testing here http://regexpal.com/ but it doesn't work my pattern.
So my real question is, how can I detect any variation of any character from another string?
Maybe try someting like:
[BCMVZ]+
it find any tags combinations with this chars BCMVZ.

Regular Expressions to match an <a> tag

I am writing a small java program for a class, and I can't quite figure out why my regex isn't working properly. In the special case of having 2 tags on the same line that is read in, it only matches the second one.
Here is a link that has the regex included, along with a simple set of test data:
Regex Test Link.
In my java program I have the following code:
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String[] results;
System.out.println(p.toString());
Matcher m = null;
while((line = input.readLine()) != null) {
m = p.matcher(line);
while(m.find()) {
System.out.println("Matches: " + m.group(1));
}
}
The goal is to extract the href value, as long as it starts with http://, the website ends in either no page (like http://www.google.com) or ends in index.htm or index.html (like http://www.google.com/index.html).
My regex works for every case of the above, but doesnt match in the special case of 2 tags that are on the same line.
Any help is appreciated.
Just use a proper HTML parsing library, such as HTML cleaner. It is theoretically impossible to properly parse HTML with a regex - there are so many constructs that will confound it. For example:
<![CDATA[ > bar ]]>
This is not a link. This is literal text in XHTML.
baz
This is only one link.
<a rel="next" href="bar?2">Next</a>
This is a realistic example of a link with a relation attribute and a relative URI.
<a name="foo">The href="http://example.com" part is the link destination...</a>
This is a named anchor, not a link. However your regex would parse out the literal text here as a link.
Foo
Does your regex handle line-spanning links properly?
There are all kinds of other Fun edge cases that can occur. Save yourself time and headaches. These problems have already been solved and wrapped up in nice neat libraries for you to use. Take advantage of this.
Regexes may be a powerful tool, but as they say - when all you have is a hammer, everything looks like a nail. You are currently trying to hammer in a screw.
This worked for me in that regex tester page
<a[^>]*>[^<]*</a>
Regex Solution
So I was playing around and realized my issue. I adjusted my regex a bit. My main problem was at the beginning my .* was causing everything to match up until the last tag, and therefore it was really only matching once instead of twice. I made that .* lazy and it matched twice instead of once. That was the only issue. Once that regex was added to java, my loop code worked fine.
Thanks everyone that responded. While you may not have provided the answer, your comments got me thinking in the right direction!
You would have to look through all the matches you got per line and find which one looks like a url (like with some more regex ;))

Categories