Regular expression- prefix and suffix without a specific string

Regular expression- prefix and suffix without a specific string - java

I'd like your help with the following problem.
I'm trying to define a regular expression which represent a valid comment in Java.
For that I want a prefix: "/\*" + Everything including new lines and tabs BUT not another "\*/"+ a suffix "*/"
I tried this one: "/\*"[^"\*/"]"\*/" but it does not work. It takes /*fdfsd */ */ as one valid comment
What should I do?

You can try with
yourString.matches("/[*]((?![*]/).)*[*]/")
this will match at start /* and */ at end. In the middle I am using negative look-ahead to test if character (represented by dot) is not first * in */. Of course it involves little backtracking so performance may be improved but for now it would do the trick.

Related

Regex for adding a word to a specific line if line does not contain the word

I have a YAML file with multiple lines and I know there's one line that looks like this:
...
schemas: core,ext,plugin
...
Note that there is unknown number of whitespaces at the beginning of this line (because YAML). The line can be identified uniquely by the schemas: expression. The number of existing values for the schemas property is unknown, but greater than zero. And I do not know what these values are, except that one of them might be foo.
I would like to use a regex match-and-replace to append the word ,foo to this line if foo is not already contained in the list of values at any position. foo might appear on any other line but I want to ignore these instances. I don't want the other lines to be modified.
I've tried different regular expressions with lookarounds and capture groups, but none did the job. My latest attempt that looked promising at first was:
(?s)(?!.*foo)(.*schemas:.*)
But this does not match if foo is contained on any other line, which is not what I want.
Any assistance would be very much appreciated. Thanks.
(I use the Java regex engine, btw.)

Would this work?
^(?!.*foo)(\s*schemas:.*)$
If you want to make sure stuff like
food, fool, etc.
matches you can use this:
^(?!.*(?:foo\s*$|foo,))(\s*schemas:.*)$
Replacement:
$1,foo
If I understood your question correctly, you want to make sure only one line is checked for the negative lookahead. This should accomplish that. I tested it on https://regex101.com/ using the Java 8 engine. You can also check what each operator does there.
Explanation:
wrapping the expression with
^$
makes sure that only one line is considered at a time.
The negative lookahead
(?!.*(?:foo\s*$|foo,))
looks for any "foo" followed by either (whitespaces and a newline) or a comma within this line. If you want to make the expression faster you could probably turn the lookahead into a lookbehind, so that the simpler check for "schemas:" comes first. However, I don't know if this actually improves performance.
^(\s*schemas:.*)(?<!(?:foo\s?$|foo,))$
With lookbehinds you can't use the * quantifier, so the regex would match if foo is followed by more than one whitespace.

Remove comments from a java file and maintain file structure

I am working on a project that requires me to remove comments from a java file. Currently, I am using the regular expression
(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)
which I got from http://ostermiller.org/findcomment.html.
The regular expression works well, but the problem is that I need to preserve the file structure when I remove the comments. In other words, if I have a 3 line block comment, I need it to be replaced with 3 blank lines. This is necessary so that the code remains on the same line numbers as the original.
How would I replace the 3 line block comment with 3 blank lines?
Edit:
I was able to solve my problem by making use of SableCC.

I haven't fully sussed out what that regex is doing, but if it matches the entire comment, then you can get the matched comment, check to see how many newlines it contains, and then replace the match with that many newlines instead of replacing it with the empty string.

If you're set on regex you can try this
~/(?:/.*?$|\*[^*]*\*/)~
DEMO
This makes use of two different non-capture groups
Since all comments (single-line and multi-line) have to start with a / that's the first character of the regex. Then a comment can have another / or a *. This is where the alternation comes in. The first part /.*?$ handles single line comments, while the second part \*[^*]*\* matches on multi-line comments.
If your multi-line comments are formatted with leading * followed by a <space>, like this:
/* mu
* lti
* line
* comment
*/
then this DEMO should do the trick (I don't think a line can start with a * in Java, unless it's in a comment).
Unfortunately, I have not found a suitable substitution to preserve line spacing if they are not formatted as above.

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.

You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>

#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.

I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.

The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;

You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.

The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.

You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.

Try this:
=\\s*(.*)$

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression- prefix and suffix without a specific string - java

Related

Regex for adding a word to a specific line if line does not contain the word

Remove comments from a java file and maintain file structure

Regular expression, excluding .. in suffix of email addy [duplicate]

is it possible to use replaceAll() with wildcards

Java Regex Engine Crashing

Categories

Resources