Java Regular Expression not evaluating - java

I have a string that is changes frequently, in the form of :
*** START OF THIS PROJECT FILENAME ***
Where FILENAME can be multiple words in different instances. I tried running the regex :
Pattern.matches("\\*\\*\\* START OF THIS PROJECT ", line);
where line is equal to one such strings.
I also tried using the Matcher, where beginningOfFilePatter is also set to the same regex pattern above:
Matcher beginFileAccelerator;
beginFileAccelerator = beginningOfFilePattern.matcher(line);
if (beginFileAccelerator.find()
//Do Something
Ive exhuastively tried at least 30 different combinations of regex, and I simply can't find the solution. If anyone could lend me an eye I would greatly appreciate it.

Pattern.matches tries to match the entire string against the pattern, because under the covers it uses Matcher#matches, which says:
Attempts to match the entire region against the pattern.
In your case, that will fail at the end, because the input doesn't end with "PROJECT ". It has more after that.
To allow anything at the end, add .*:
Pattern.matches("\\*\\*\\* START OF THIS PROJECT .*", line)
// Here -----------------------------------------^
Live Example

Related

Regex code not collecting multiple lines of matching pattern

I'm new to using regex and I was hoping that someone could help me with this.
I have this regex code which is supposed to identify tab groups in a tablature file. It works on regex testing websites such as regexr.com, regextester.com, and extendsclass.com/regex-tester, but when I code it in java using the example text shown below, I am given each individual line as its own separate group, instead of 4 groups containing all the text which are separated only by one newline.
I have read through this stack overflow thread"Regular expression works on regex101.com, but not on prod" and have been careful to avoid string literal problems, multiline problems, and ive tried the code with other regex engines on regex101 and it worked, but still, it does not work in my java code shown below.
I tried enabling the multiline flag but it still doesn't work. I thought it was a problem with my code, but then I got the same wrong output on other regex tester websites: myregexp.com and freeformatter.com/java-regex-tester
here is the original regex. It is ling, so it might be easier to use the regex above as they both have the same problem I was talking about:
RealRegexCode = (^|[\n\r])(((?<=^|[\n\r])[^\S\n\r]*\|*[^\S\n\r]*((E|A|D|G|B|e|a|d|g|b)[^\S\n\r]*\|*(?=(([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))[|\r\n]|$)))((([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))\|)+(((?<=\|)[^\S\n\r]*((E|A|D|G|B|e|a|d|g|b)[^\S\n\r]*\|*(?=(([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))[|\r\n]|$)))((([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))\|)+)*(\n|\r|$))+
Here is a simplified regex code that displays the same problem, provided for the sake of debugging
SimplifiedRegexCode = (^|[\n\r])([^\n\r]+(\n|\r|$))+
here is the code that finds the matches using the regex pattern:
public static void main(String[] args){
String filePath = "C:\\Users\\stani\\IdeaProjects\project\\src\\testing files\\guitar - a thousand matches by passenger.txt";
Path path = Path.of(filePath);
List<String> stuff = new ArrayList<>();
try {
String rootStr = Files.readString(path);
Pattern pattern = Pattern.compile("(^|[\\n\\r])([^\\n\\r]+(\\n|\\r|$))+");
Matcher ptrnMatcher = pattern.matcher(rootStr);
while (ptrnMatcher.find()) {
stuff.add(ptrnMatcher.group());
}
}catch (Exception e) {
e.printStackTrace();
}
System.out.println(new Patterns().MeasureGroupCollection);
for (String s:stuff)
System.out.println(s);
}
And here is the text I was testing it with. It might help to copy and paste this in a text editor as stack overflow might distort how the text looks:
e|---------------------------------|------------------------------------|
e|------------------------------------------------------------------|
B|-----1--------(1)----1-----------|-------1---------------1----------1-|
B|-----1--------(1)----0---------0-----1---------1-----3--------(3)-|
G|-----------0------------0--------|-------------0----------------0-----|
G|-----------0---------------0---------------0---------------0------|
D|-----0h2-----2-------2-----------|-------2-------2-------0--------0---|
D|-----2-------2-------2-------2-------2-------2-------0-------0----|
A|-3-------3-------3-------3-------|------------------------------------|
A|-0-------0--------------------------------------------------------|
E|-----------------------------0---|---1-------1-------3-------3--------|
E|-----------------0-------0--------1------1-------3-------3--------|
e|-------------------------------------------------------------------|
B|-----1---------1-----1---------1-----3---------3-------1---------1-|
G|-----------0---------------0---------------0-----------------0-----|
D|-----3-------2-------2-------2-------0-------0---------2-------2---|
A|-----------------3-------3-------------------------3-------3-------|
E|-1-------1-----------------------3-------3-------------------------|
It should identify four different groups from the text. However, in java and in the two testers I mentioned above, it recognizes each line as its own different group (i.e 12 groups)
I couldn't help but respond to this as I am familiar with both regex and guitar haha.
For your short regex, please see the following regex on regex101.com:
https://regex101.com/r/NqGhoh/1/
The multiline modifier is required.
The main problem with this is that you are handling newlines on the front and back of the expression. I have modified the expression in a couple ways:
Made the regex match newlines only on the end, always looking for a ^ at the beginning.
Matching the carriage return new line combination as \r?\n as a carriage return should always be followed by a newline when it is used.
Used non-capturing groups to improve overhead and reduce complexity when looking at matches. This is the ?: just inside the parenthesis. It means the group won't be captured in the result, just used for encapsulation.
I started testing your longer regex and may update that as well, though it sounds like you already know what to do with the shorter one corrected.

Java Regexp for matching all the content between "<" and ">" in a paragraph [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?
The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.
I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

Regular Expressions to match an <a> tag

I am writing a small java program for a class, and I can't quite figure out why my regex isn't working properly. In the special case of having 2 tags on the same line that is read in, it only matches the second one.
Here is a link that has the regex included, along with a simple set of test data:
Regex Test Link.
In my java program I have the following code:
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String[] results;
System.out.println(p.toString());
Matcher m = null;
while((line = input.readLine()) != null) {
m = p.matcher(line);
while(m.find()) {
System.out.println("Matches: " + m.group(1));
}
}
The goal is to extract the href value, as long as it starts with http://, the website ends in either no page (like http://www.google.com) or ends in index.htm or index.html (like http://www.google.com/index.html).
My regex works for every case of the above, but doesnt match in the special case of 2 tags that are on the same line.
Any help is appreciated.
Just use a proper HTML parsing library, such as HTML cleaner. It is theoretically impossible to properly parse HTML with a regex - there are so many constructs that will confound it. For example:
<![CDATA[ > bar ]]>
This is not a link. This is literal text in XHTML.
baz
This is only one link.
<a rel="next" href="bar?2">Next</a>
This is a realistic example of a link with a relation attribute and a relative URI.
<a name="foo">The href="http://example.com" part is the link destination...</a>
This is a named anchor, not a link. However your regex would parse out the literal text here as a link.
Foo
Does your regex handle line-spanning links properly?
There are all kinds of other Fun edge cases that can occur. Save yourself time and headaches. These problems have already been solved and wrapped up in nice neat libraries for you to use. Take advantage of this.
Regexes may be a powerful tool, but as they say - when all you have is a hammer, everything looks like a nail. You are currently trying to hammer in a screw.
This worked for me in that regex tester page
<a[^>]*>[^<]*</a>
Regex Solution
So I was playing around and realized my issue. I adjusted my regex a bit. My main problem was at the beginning my .* was causing everything to match up until the last tag, and therefore it was really only matching once instead of twice. I made that .* lazy and it matched twice instead of once. That was the only issue. Once that regex was added to java, my loop code worked fine.
Thanks everyone that responded. While you may not have provided the answer, your comments got me thinking in the right direction!
You would have to look through all the matches you got per line and find which one looks like a url (like with some more regex ;))

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

Categories