Regex to match numeric values within parentheses - java

I am trying to create a regex to match the each of the numeric values within parentheses.
Here is a link to a regex101 which I have been using for testing.
For instance with a String:
At present there are no approved therapies to directly target NRAS activating mutations. However, N-Ras activation may predict sensitivity to inhibitors of the Raf/MEK/ERK, PI3K/Akt, and other downstream pathways (23274911, 22392911, 21993244). The MEK inhibitors trametinib and cobimetinib (in combination with vemurafenib) have been FDA-approved for BRAF V600E- and V600K-mutant melanoma, and are currently being studied in clinical trials in solid tumors and hematologic malignancies (22663011, 25265494). Several preclinical studies have suggested that combinations of MEK inhibitors with inhibitors of other downstream molecules, such as PI3K, eIF4A, and Plk1, result in synergistic growth inhibition in NRAS-mutant melanoma in vitro and in vivo (19492075).
I would like to match each of the values shown in bold. I am currently using the following
Pattern citationPattern = Pattern.compile("(.?\()(\\d+)");
Matcher match = citationPattern.matcher(treatmentApproach);
and am able to get up to match each of first values within the parentheses. How can I extend it for the cases where there are more than one value within the parentheses. For example (22663011, 25265494). Thanks for your help!

I would just pull the matches from between the literal ( and ). Something like,
Pattern citationPattern = Pattern.compile("\\((.*?)\\)");
Matcher match = citationPattern.matcher(treatmentApproach);
while (match.find()) {
System.out.println(match.group());
}

I would use a pattern like this: [\(|, ](\d+)[, \)]
See it in action here: https://regex101.com/r/hoggh0/1
This finds any number between [open brackets or ", "] and [close brackets or ", "]
The only way this would fall over is if you had numbers in the text that are between comments but not between brackets and wanted to exclude them. This would include them.

"(\([0-9]+([, ]*[0-9])*\))" might be what you need. At least it's working here: https://regex101.com/r/tkQP8z/5.

Related

Match custom pattern in regex multiple times

I am trying to parse a query which I need to modify to replace a specific property and its value with another property and different values. I am struggling to write a regex that will match the specify property and its value that I need.
Here are some examples to illustrate my point. test:property is the property name that we need to match.
Property with a single value: test:property:schema:Person
Property with multiple values (there is no limit on how many values there can be - this example uses 3): test:property:(schema:Person OR schema:Organization OR schema:Place)
Property with a single value in brackets: test:property:(schema:Person)
Property with another property in the query string (i.e. there are other parts of the string that I'm not interested in): test:property:schema:Person test:otherProperty:anotherValue
Also note that other combinations are possible such as other properties being before the property I need to capture, my property having multiple values with another property present in the query.
I want to match on the entire test:property section with each value captured within that match. Given the examples above these are the results I am looking for:
#
Match
Groups
1
test:property:schema:Person
schema:Person
2
test:property:(schema:Person OR schema:Organization OR schema:Place)
schema:Personschema:Organizationschema:Person
3
test:property:(schema:Person)
schema:Person
4
test:property:schema:Person
schema:Person
Note: #1 and #4 produce the same output. I wanted to illustrate that the rest of the string should be ignored (I only need to change the test:property key and value).
The pattern of schema:Person is defined as \w+\:\w+, i.e. one or more word characters, followed by a colon, followed by one or more word characters.
If we define the known parts of the string with names I think I can express what I want to match.
schema:Person - <TypeName> - note that the first part, schema in this case, is not fixed and can be different
test:property - <MatchProperty>
<MatchProperty>: // property name (which is known and the same - in the examples this is `test:property`) followed by a colon
( // optional open bracket
<TypeName>
(OR <TypeName>)* // optional additional TypeNames separated by an OR
) // optional close bracket
Every example I've found has had simple alphanumeric characters in the repeating section but my repeating pattern contains the colon which seems to be tripping me up. The closest I've got is this:
(test\:property:(?:\(([\w+\:\w+]+ [OR [\w+\:\w+]+)\))|[\w+\:\w+]+)
Which works okayish when there are no other properties (although the match for example #2 contains the entire property and value as the first group result, and a second group with the property value) but goes crazy when other properties are included.
Also, putting that regex through https://regex101.com/ I know it's not right as the backslash characters in the square brackets are being matched exactly. I started to have a go with capturing and non-capturing groups but got as far as this before giving up!
(?:(\w+\:\w+))(?:(\sOR\s))*(?:(\w+\:\w+))*
This isn't a complete solution if you want pure regex because there are some limitations to regex and Java regex in particular, but the regexes I came up with seem to work.
If you're looking to match the entire sequence, the following regex will work.
test:property:(?:\((\w+:\w+)(?:\sOR\s(\w+:\w+))*\)|(\w+:\w+))
Unfortunately, the repeated capture groups will only capture the last match, so in queries with multiple values (like example 2), groups 1 and 2 will be the first and last values (schema:Person and schema:Place). In queries without parentheses, the value will be in group 3.
If you know the maximum number of values, you could just generate a massive regex that will have enough groups, but this might not be ideal depending on your application.
The other regex to find values in groups of arbitrary length uses regex's positive lookbehind to match valid values. You can then generate an array of matches.
(?<=test:property:(?:(?:\((?:\w+:\w+\sOR\s)+)|\(?))\w+:\w+
The issue with this method is that it looks like Java lookbehind has some limitations, specifically, not allowing unbound or complex quantifiers. I'm not a Java person so I haven't tried things out for myself, but it seems like this wouldn't work either. If someone else has another solution, please post another answer!
With this in mind, I would probably suggest going with a combination regex + string parsing method. You can use regex to parse out the value or multiple values (separated by OR), then split the string to get your final values.
To match the entire part inside parentheses or the single value no parentheses, you can use this regex:
test:property:(?:\((\w+:\w+(?:\sOR\s\w+:\w+)*)\)|(\w+:\w+))
It's still split into two groups where one matches values with parentheses and the other matches values without (to avoid matching unpaired parentheses), but it should be usable.
If you want to play around with these regexes or learn more, here's a regexr: https://regexr.com/65kma

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?
The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.
I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

Using regex in java to extract a string between two words in html syntax

I have a json feed that feeds html that is used to populate the calendar, I need to retrieve some of the information from it. For example title, time and location. I wanted to use regex to get content between
<span class=\"title\">
and
<\/span><br/><b>
and I am trying to use this code
for(int i = 0; i < json.length(); i++)
{
JSONObject object = new JSONObject(json.getJSONObject(i));
System.out.println(object.getNames(object));
Pattern p = Pattern.compile("(?i)(<span class=\"title\">)(.+?)(<\\/span>)");
Matcher m = p.matcher(json.get(0).toString());
m.find();
System.out.println(m.group(0));
But it doesn't seem to do the job... I have tried multiple ittoriations and tried researching examples online, but I am not sure if I am doing something wrong in the regex syntax. Help would be appreciated.
{"hoverContent":"<b>Title: <\/b><span class=\"title\">Accounting Awareness<\/span><br/><b>Time: <\/b><span class=\"time\">5:30 PM - 6:30 PM<br/><b>Location: <\/b><span class=\"location\">1185 Grainger Hall<\/span><br/><b>Description: <\/b><br/><span class=\"description\">Information from Kristen Fuhremann, Director of Professional Programs in Accounting and Q&A from a panel of current and former students who will share their experiences in the accounting program. Panel includes a grad of the IMAcc program currently in law school, a candidate for the IMAcc program who studied abroad, an accounting and finance double major, and an IMAcc student who is also a TA for AIS 100. Casual Attire is appropriate.<br />Contact: Natalie Dickson, <a href=\"mailto:ndickson#wisc.edu\">ndickson#wisc.edu<\/a><\/span><br/>","title":"Accounting Awareness","start":"2013-09-30 17:30:00","allDay":false,"itemId":"2356754a-8178-4afd-b4cf-7f5f5ce89868","end":"2013-09-30 18:30:00"}
null
m.group(0) always returns the entire string that matches the regex. It looks like you want to return a particular group, so you need to use m.group(1) to get the text that matches the first group, m.group(2) for the second group, and so on. In this regex:
"(?i)(<span class=\"title\">)(.+?)(<\\/span>)"
anything in parentheses, except for things that begin with (?, counts as a group, so the portion in (.+?) is the second capture group, and you can try retrieving it with m.group(2). In this case, there's no need to put the <span stuff in parentheses, so you could say
"(?i)<span class=\"title\">(.+?)<\\/span>"
and now use m.group(1) to get at the first (and only) capture group.
Using regexp to parse something is not really a good idea from design standpoint.
I would personally just wrap the content in a fake tag and parse it using XML parser. There will be overhead, but you don't use regexp to parse JSON, right? Why not do the same for XML?
Try this regex with DOTALL mode, also avoid redundant escaping:
Pattern p = Pattern.compile("(?si)<span class=\"title\">(.+?)</span>");

How to retrieve portion of number that's within parenthesis in Java?

For part of my Java assignment I'm required to select all records that have a certain area code. I have custom objects within an ArrayList, like ArrayList<Foo>.
Each object has a String phoneNumber variable. They are formatted like "(555) 555-5555"
My goal is to search through each custom object in the ArrayList<Foo> (call it listOfFoos) and place the objects with area code "616" in a temporaryListOfFoos ArrayList<Foo>.
I have looked into tokenizers, but was unable to get the syntax correct. I feel like what I need to do is similar to this post, but since I'm only trying to retrieve the first 3 digits (and I don't care about the remaining 7), this really didn't give me exactly what I was looking for. Ignore parentheses with string tokenizer?
What I did as a temporary work-around, was...
for (int i = 0; i<listOfFoos.size();i++){
if (listOfFoos.get(i).getPhoneNumber().contains("616")){
tempListOfFoos.add(listOfFoos.get(i));
}
}
This worked for our current dataset, however, if there was a 616 anywhere else in the phone numbers [like "(555) 616-5555"] it obviously wouldn't work properly.
If anyone could give me advice on how to retrieve only the first 3 digits, while ignoring the parentheses, I would greatly appreciate it.
You have two options:
Use value.startsWith("(616)") or,
Use regular expressions with this pattern "^\(616\).*"
The first option will be a lot quicker.
areaCode = number.substring(number.indexOf('(') + 1, number.indexOf(')')).trim() should do the job for you, given the formatting of phone numbers you have.
Or if you don't have any extraneous spaces, just use areaCode = number.substring(1, 4).
I think what you need is a capturing group. Have a look at the Groups and capturing section in this document.
Once you are done matching the input with a pattern (for example "\((\\d+)\) \\d+-\\d+"), you can get the number in the parentheses using a matcher (object of java.util.regex.Matcher) with matcher.group(1).
You could use a regular expression as shown below. The pattern will ensure the entire phone number conforms to your pattern ((XXX) XXX-XXXX) plus grabs the number within the parentheses.
int areaCodeToSearch = 555;
String pattern = String.format("\\((%d)\\) \\d{3}-\\d{4}", areaCodeToSearch);
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(phoneNumber);
if (m.matches()) {
String areaCode = m.group(1);
// ...
}
Whether you choose to use a regular expression versus a simple String lookup (as mentioned in other answers) will depend on how bothered you are about the format of the entire string.

Use RegExp to replace XML tags with whitespaces (in the length of the tags)

I need to strip all xml tags from an xml document, but keep the space the tags occupy, so that the textual content stays at the same offsets as in the xml. This needs to be done in Java, and I thought RegExp would be the way to go, but I have found no simple way to get the length of the tags that match my regular expression.
Basically what I want is this:
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
Matcher m = p.matcher(stringWithXMLContent);
String strippedContent = m.replaceAll("THIS IS A STRING OF WHITESPACES IN THE LENGTH OF THE MATCHED TAG");
Hope somebody can help me to do this in a simple way!
Since < and > characters always surround starting and ending tags in XML, this may be simpler with a straightforward statemachine. Simply loop over all characters (in some writeable form - not stored in a string), and if you encounter a < flip on the "replacement mode" and start replacing all characters with spaces until you encounter a >. (Be sure to replace both the initial < and the closing >).
If you care about layout, you may wish to avoid replacing tab characters and/or newline characters. If all you care about is overall string length, that obviously won't matter.
Edit: If you want to support comments, processing instructions and/or CData sections, you'll need to explicitly recognize these too; also, attribute values unfortunately can include > as well; all this means a full-fledged implementation will be more complex that you'd like.
A regular transducer would be perfect for this task; but unfortunately those aren't exactly commonly found in class libraries...
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
In the spirit of You Can't Parse XML With Regexp, you do know that's not an adequate pattern for arbitrary XML, right? (It's perfectly valid to have a > character in an attribute value, for example, not to mention other non-tag constructs.)
I have found no simple way to get the length of the tags that match my regular expression.
Instead of using replaceAll, repeatedly call find on the Matcher. You can then read start/end to get the indexes to replace, or use the appendReplacement method on a buffer. eg.
StringBuffer b= new StringBuffer();
while (m.find()) {
String spaces= StringUtils.repeat(" ", m.end()-m.start());
m.appendReplacement(b, spaces);
}
m.appendTail(b);
stringWithXMLContent= b.toString();
(StringUtils comes from Apache Commons. For more background and library-free alternatives see this question.)
Why not use an xml pull parser and simply echo everything that you want to keep as you encounter it, e.g. character content and whenever you reach a start or end tag find out the length using the name of the element, plus any attributes that it has and write the appropriate number of spaces.
The SAX API also has callbacks for ignoreable whitespace. So you can also echo all whitespace that occurs in your document.
Maybe m.start() and m.end() can help.
m.start() => "The index of the first character matched"
m.end() => "The offset after the last character matched"
(m.end() - m.start())-2 and you know how many /s you need.
**string**.replaceAll("(</?[a-zA-Z]{1}>)*", "")
you can also try this. it searches for <, then / 0 or 1 occurance then followed by characters only 1 (small or capital char), then followed by a > , then * for multiple occurrence of this pattern.
:)

Categories