Regex: Formatting/Multiple Formats - java

I'm fairly new to Regex, but not new to Java as a coding language. I'm currently trying to create a Regex expression that will format a user's input to two separate values, but I'm a little curious as to how to approach it.
For example, suppose a user were guessing the resulting score of a basketball game, there's a handful of formats they could use:
57-89
57:89
57/89
etc.
I guess my question is first, how would I go about having my Regex expression handle multiple digits? That is, recognizing a valid guess regardless of how many digits they were to put in for each value. Second of all, how would I go about creating a Regex expression that would handle multiple formats, such as the ones listed above?
Thanks ahead of time.

If the input is in the following format:
<integer><non-integer delimiter><integer>
then this split method will parse it into a String[] with each integer as a separate element:
inputString.split("[^0-9]+");
[^0-9]+ is the regex for the delimiter:
[] character class;
^ exclude the following characters;
0-9 character range 0, 1, ..., 9;
+ one or more occurrences (that means it will work for multicharacter delimiter, e.g. 59 - 87).
More information on Java regexes is here.

Related

Why is my String array length 3 instead of 2?

I'm trying to understand regex. I wanted to make a String[] using split to show me how many letters are in a given string expression?
import java.util.*;
import java.io.*;
public class Main {
public static String simpleSymbols(String str) {
String result = "";
String[] alpha = str.split("[\\+\\w\\+]");
int alphaLength = alpha.length;
// System.out.print(alphaLength);
String[] charCount = str.split("[a-z]");
int charCountLength = charCount.length;
System.out.println(charCountLength);
}
}
My input string is "+d+=3=+s+". I split the string to count the number of letters in string. The array length should be two but I'm getting three. Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
So, a few things pop out to me:
First, your regex looks correct. If you're ever worried about how your regex will perform, you can use https://regexr.com/ to check it out. Just put your regex on the top and enter your string in the bottom to see if it is matching correctly
Second, upon close inspection, I see you're using the split function. While it is convenient for quickly splitting strings, you need to be careful as to what you are splitting on. In this case, you're removing all of the strings that you were initially looking at, which would make it impossible to find. If you print it out, you would notice that the following shows (for an input string of +d+=3=+s+):
+
+=3=+
+
Which shows that you accidentally cut out what you were looking to find in the first place. Now, there are several ways of fixing this, depending on what your criteria is.
Now, if what you wanted was just to separate on all +s and it doesn't matter that you find only what is directly bounded by +s, then split works awesome. Just do str.split("+"), and this will return you a list of the following (for +d+=3=+s+):
d
=3=
s
However, you can see that this poses a few problems. First, it doesn't strip out the =3= that we don't want, and second, it does not truly give us values that are surrounded by a +_+ format, where the underscore represents the string/char you're looking for.
Seeing as you're using +w, you intend to find words that are surrounded by +s. However, if you're just looking to find one character, I would suggest using another like [a-z] or [a-zA-Z] to be more specific. However, if you want to find multiple alphabetical characters, your pattern is fine. You can also add a * (0 or more) or a + (1 or more) at the end of the pattern to dictate what exactly you're looking for.
I won't give you the answer outright, but I'll give you a clue as to what to move towards. Try using a pattern and a matcher to find the regex that you listed above and then if you find a match, make sure to store it somewhere :)
Also, for future reference, you should always start a function name with a lower case, at least in Java. Only constants and class names should start in a capital :)
I am trying to use split to count the number of letters in that string. The array length should be two, but I'm getting three.
The regex in the split functions is used as delimiters and will not be shown in results. In your case "str.split([a-z])" means using alphabets as delimiters to separate your input string, which makes three substrings "(+)|d|(+=3=+)|s|(+)".
If you really want to count the number of letters using "split", use 'str.split("[^a-z]")'. But I would recommend using "java.util.regex.Matcher.find()" in order to find out all letters.
Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
Similarly, check the functions in "java.util.regex.Matcher".

Regex to match a fixed sub string in a String

I am trying to write a regular expression to verify the presence of a specific number in a fixed position in a String.
String: 109300300330066611111111100000000017000656052086116020170111Name 1
Number to find: 111111111 (Staring from position 17)
I have written the following regular expression:
^.{16}(?<Ones>111111111)(.*)
My understanding is:
Let first 16 characters be whatever they are
Use the Named Capturing Group to grab the specific word
Let the rest of the characters be whatever they are
I am new to regex, is there any issue with the above approach?
Can it be done in other/better way?
I am using Java 8.
Without more details of why you're doing what you're doing, there's just one possible improvement I can see. You repeated any character 16 times at the beginning of the string rather than writing out 16 .s, which is nice and readable, but then, it would be nice to do the same for the repeated 1s:
^.{16}(?<Ones>1{9})(.*)
Otherwise, the string of 1s is hard to understand without the coder manually counting how many there are in the regex.
If you want to hard-code the ones and you know the starting position and you just wnat to know if it is there, using a regex seems unnecessary. you can use this:
String s = "109300300330066611111111100000000017000656052086116020170111Name 1";
if (s.indexOf("111111111").equals(16) doSomething();
Another possible solution without regex:
if(s.substring(16,25).equals("111111111") doSomething();
Otherwise your regex looks good.

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?
The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.
I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

Regular Expression for IP validation which works in JFLAP

I noticed that regular expressions which we programmers use in our programs for tasks such as
email address validation
IP validation
...
are a bit different from those Regular Expressions which are used in Automata (if I'm not mistaken)
By the way I want to design an NFA and eventually a DFA for IP validation.
I have found a lot of regular expression such as the following one:
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
But I can not convert it to an NFA or DFA using JFLAP.
What should I do?
You don't need to directly convert the regex, you can rewrite it once you understand what it's trying to do.
A valid IPv4 address is 4 numbers separated by decimal points. Each number can be from 0 to 255. Regex doesn't do range very well, so that's why it looks like it does. The regex you posted checks if it starts with a 2, then the next two numbers cannot be greater than 5 each, if it starts with 1, they can go up to 9, etc.
Easiest way to validate a regex is to split it with the . as the delimiter, convert the strings to numbers, and check their range.
That said, there is nothing non-standard in the regex you posted. It's as simple as they come, I don't know why it doesn't work as-is for you.

Categories