java regexp for reluctant matching - java

need to find an expression for the following problem:
String given = "{ \"questionID\" :\"4\", \"question\":\"What is your favourite hobby?\",\"answer\" :\"answer 4\"},{ \"questionID\" :\"5\", \"question\" :\"What was the name of the first company you worked at?\",\"answer\" :\"answer 5\"}";
What I want to get: "{ \"questionID\" :\"4\", \"question\":\"What is your favourite hobby?\",\"answer\" :\"*******\"},{ \"questionID\" :\"5\", \"question\" :\"What was the name of the first company you worked at?\",\"answer\" :\"******\"}";
What I am trying:
String regex = "(.*answer\"\\s:\"){1}(.*)(\"[\\s}]?)";
String rep = "$1*****$3";
System.out.println(test.replaceAll(regex, rep));
What I am getting:
"{ \"questionID\" :\"4\", \"question\":\"What is your favourite hobby?\",\"answer\" :\"answer 4\"},{ \"questionID\" :\"5\", \"question\" :\"What was the name of the first company you worked at?\",\"answer\" :\"******\"}";
Because of the greedy behaviour, the first group catches both "answer" parts, whereas I want it to stop after finding enough, perform replacement, and then keep looking further.

The pattern
("answer"\s*:\s*")(.*?)(")
Seems to do what you want. Here's the escaped version for Java:
(\"answer\"\\s*:\\s*\")(.*?)(\")
The key here is to use (.*?) to match the answer and not (.*). The latter matches as many characters as possible, the former will stop as soon as possible.
The above pattern won't work if there are double quotes in the answer. Here's a more complex version that will allow them:
("answer"\s*:\s*")((.*?)[^\\])?(")
You'll have to use $4 instead of $3 in the replacement pattern.

The following regex works for me :
regex = "(?<=answer\"\\s:\")(answer.*?)(?=\"})";
rep = "*****";
replaceALL(regex,rep);
The \ and " might be incorrectly escaped since I tested without java.
http://regexr.com?303mm

Related

search and replace string in java using pattern

Given the string
Content ID [9283745997] Content ID [9283005997] There can be text in between Content ID [9283745953] Content ID [9283741197] Content ID [928374500] There can be valid text here which should not be removed.
I want to remove the text starting Content ID followed by [9283745997] any numbers can be present between square brackets. Eventually I want the result string to be
There can be text in between There can be valid text here which should not be removed.
Could anyone please provide a valid regex to capture this recurring text but the numerals within square brackets are unique?
I appreciate your help!
My soulution to this was :
Pattern p = Pattern.compile("(Content ID \\[\\d*\\] )");
Matcher m = p.matcher(str);
StringBuffer sb = new StringBuffer();
while(m.find()) {
m.appendReplacement(sb, "");
}
m.appendTail(sb);
System.out.println(sb);
So basically you are trying to remove each of Content ID [one or more digits].
To do this you can use replaceAll("regex","replacement") method of String class. As replacement you can use empty String "".
Only problem that stays is what regex should you use.
to match Content ID just write it normally as "Content ID "
to match [ or ] you will have to add \ before each of them because they are regex metacharacters and you need to escape them (in Java you will need to write \ as "\\")
to represent one digit (character from range 0-9) regex uses \d (again in Java you will need to write \ as "\\" which will result in "\\d")
to say "one or more of previously described element" just add + after definition of such element. For example if you want to match one or more letters a you can write it as a+.
Now you should be able to create correct regex. If you will have some questions feel free to ask them in comments.
Try this one:
(Content ID \[[0-9]+\])
You can test it here: http://regexpal.com/
I would use the regex
Content ID \[\d+\] ?
Implement it like this:
str.replaceAll("Content ID \\[\\d+\\] ?", "");
You can find an explanation and demonstration here: http://regex101.com/r/qD5rJ6

Regex to find hostname from Jdbc url

I am new to regex. I would like to retrieve the Hostname from postgreSQL jdbc URL using regex.
Assume the postgreSQL url will be jdbc:postgresql://production:5432/dbname. I need to retrieve "production", which is the hostname. I want to try with regex and not with Java split function. I tried with
Pattern PortFinderPattern = Pattern.compile("[//](.*):*");
final Matcher match = PortFinderPattern.matcher(url);
if (match.find()) {
System.out.println(match.group(1));
}
But it's matching all the string from hostname till the end.
Pattern PortFinderPattern = Pattern.compile(".*:\/\/([^:]+).*");
regex without grouping :
"(?<=//)[^:]*"
[//]([\\w\\d\\-\\.]+)\:
Should be enough to find it reliably. Though this is probably a better regex:
The Hostname Regex
There are some errors in your regex:
[//] - This is only one character, because the [] marks a character class, so it will not fully match //. To match it, you need to write it like this: [/][/] or \/\/.
(.*) - This will match all characters to the end of line. You need to be more specific if you want to go till a certain character. For example you could go to the colon by fetching all characters, which are not colons, like this: ([^:]*).
:* - This makes the colon optional. I guess you forgot to put a dot( every character ) after the colon, like this: :.*.
So here is your regex corrected: \/\/([^:]*):.*.
Hope this helps.
BTW. If the port number is optional after production (:5432), then I suggest the following regex:
\/\/([^/]*)(?::\d+)?\/
To capture also Oracle and MySQL JDBC URL variants with their quirks (e.g. Oracle allowing to use # instead of // or even #//), I use this regexp to get the hostname: [/#]+([^:/#]+)([:/]+|$) Then the hostname is in group 1.
Code e.g.
String jdbcURL = "jdbc:oracle:thin:#//hostname:1521/service.domain.local";
Pattern hostFinderPattern = Pattern.compile("[/#]+([^:/#]+)([:/]+|$)");
final Matcher match = hostFinderPattern.matcher(jdbcURL);
if (match.find()) {
System.out.println(match.group(1));
}
This works for all these URLs (and other variants):
jdbc:oracle:thin:#//hostname:1521/service.domain.local
jdbc:oracle:thin:#hostname:1521/service.domain.local
jdbc:oracle:thin:#hostname/service.domain.local
jdbc:mysql://localhost:3306/sakila?profileSQL=true
jdbc:postgresql://production:5432/dbname
jdbc:postgresql://production/
jdbc:postgresql://production
This assumes that
The hostname is after // or # or a combination thereof (single / would also work, but I don't think JDBC allows that).
After the hostname either : or / or the end of the string follows.
Note that the the + are greedy, this is especially important for the middle one.

Regex: strip all tags except those containing keyword "univ"

[introduction][position]Lead Researcher and Research Manager[/position] in the [affiliation]Web Search and Mining Group, Microsoft Research[/affiliation]</b>.
I am a [position]lead researcher[/position] at [affiliation]Microsoft Research[/affiliation]. I am also [position]adjunct professor[/position] of [affiliation]Peking University[/affiliation], [affiliation]Xian Jiaotong University[/affiliation] and [affiliation]Nankai University[/affiliation].
I joined [affiliation]Microsoft Research[/affiliation] in June 2001. Prior to that, I worked at the Research Laboratories of NEC Corporation.
I obtained a [bsdegree]B.S.[/bsdegree] in [bsmajor]Electrical Engineering[/bsmajor] from [bsuniv]Kyoto University[/bsuniv] in [bsdate]1988[/bsdate] and a [msdegree]M.S.[/msdegree] in [msmajor]Computer Science[/msmajor] from [msuniv]Kyoto University[/msuniv] in [msdate]1990[/msdate]. I earned my [phddegree]Ph.D.[/phddegree] in [phdmajor]Computer Science[/phdmajor] from the [phduniv]University of Tokyo[/phduniv] in [phddate]1998[/phddate].
I am interested in [interests]statistical learning[/interests], [interests]natural language processing[/interests], [interests]data mining, and information retrieval[/interests].[/introduction]
I'm able to strip all tags from the paragraph above with:
String stripped = html.replaceAll("\\[.*?\\]", "");
But I'd like to keep three pairs of tags in the paragraph, which are [bsuniv][/bsuniv],[msuniv][/msuniv] and [phduniv][/phduniv]. In other words, I don't want to strip those tags containing the keyword "univ". I can't find a convenient way to rewrite the regular expression. Anyone help me?
You can use a negative-look ahead assertion here: -
str = str.replaceAll("\\[(.(?!univ))*?\\]", "");
or: -
str = str.replaceAll("\\[((?!univ).)*?\\]", "");
Both of them will give you the desired output. There is only one difference -
The first one does a negative look-ahead, against the current character, and if it is not followed by univ, it moves to the next character.
The second one does a negative look-ahead against an empty string before every character, and if it is not followed by univ, it goes ahead to match a single character.

Why isnt this regexp backtrack working

I have tried to use the following kind of regex
([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))|(FakeEmail:)|(Email:)|(\1\2)|(\1\3)
(pretend the \1 is the email regex group, and \2 is FakeEmail: and \3 is Email: because I didnt count the parens to figure out the real grouping)
What I am trying to do is say "Find the word email: and if you find it, pick up any email address following the word."
That email regex I got off some other question on stack overflow.
my test string could be something like
"This guy is spamming me from
FakeEmail: fakeemailAdress#someplace.com
but here is is real info:
Email: testemail#someplace.com"
Any tips? Thanks
I'm either quite confused as to what you're trying to do, or your Regex is just very wrong. In particular:
Why do you have Email: at the end, instead of the beginning - to match your example?
Why do you have both your Email: and your \1\2 separated by pipe characters, almost as if they're in fields? This is compiling the pattern as ORs. (Find the email pattern, OR the word "Email:", OR whatever \1\2 will end up meaning as it is out of context here.)
If all you're trying to do is match something like Email: testemail#someplace.com, you don't need any backtracking.
Something like this is probably all you need:
Email:\s+([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
Also, I'd strongly advise against trying to validate an email address so strictly. You may want to read http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx . I'd simplify the pattern to something more along the lines of:
Email:\s+(\S+)*#(\S+\.\S+)
Try:
(Fake)?Email: *([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
And captured group \1 will be empty if it's a real email and contain "Fake" if it's a fake email, while \2 will be the email itself.
Do you actually want to capture it if it's FakeEmail though? If you want to capture all Email but ignore all FakeEmail then do:
\bEmail: *([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
The word boundary prevents the Email bit from matching "FakeEmail".
UPDATE: note your regex only matches lowercase since it's got a-z in the [] everywhere but not [A-Z]. Make sure you feed your regex into the java match function with the ignore case switch. i.e.:
Pattern.compile("(Fake)?Email: .....", Pattern.CASE_INSENSITIVE)
You can use following code to match all type of email address:
String text = "This guy is spamming me from\n" +
"FakeEmail: fakeemail+Adress#someplace.com\n" +
"fakeEmail: \n" +
"fakeemail#someplace.com" +
"but here is is real info:\n" +
"Email: test.email+info#someplace.com\n";
Matcher m = Pattern.compile("(?i)(?s)Email:\\s*([_a-z\\d\\+-]+(\\.[_a-z\\d\\+-]+)*#[a-z\\d-]+(\\.[a-z\\d-]+)*(\\.[a-z]{2,4}))").matcher(text);
while(m.find())
System.out.printf("Email is [%s]%n", m.group(1));
This will match email text:
appearing on different lines by using (?s)
ignoring case comparison by using (?i)
Email address with a period . in it
Email address with a plus sign + in it
OUTPUT: From above code is
Email is [fakeemail+Adress#someplace.com]
Email is [fakeemail#someplace.comb]
Email is [test.email+info#someplace.com]

match a string of characters between tags:

I have the following strings:
<PAUL SAINT-KARL 1997-05-07>
<BOB DEAN 2001-05-07>
<GUY JEDDY 2007-05-07>
I want a java regex that would match this type of pattern "name and date" and then extract the name and date separately.
I able to match them separately with the following java regex:
1) (\d{4}-\d{2}-\d{2})>
2) <([ A-Z&#;0-9-]*+)
What I'm looking for is one regex that would identify the full text pattern as provided, and then extract the subsections, such as the actual name, and the date.
I'm looking to use Matcher.group() to retrieve the complete match from the target string.
Thanks
Try this:
"<([ A-Z&#;0-9-]*?) (\\d{4}-\\d{2}-\\d{2})>"
I changed the *+ to *? to make the * match lazily.

Categories