Java Regular expression Set Minimum characters - java

I'm new in using regex in java and now having problems getting my regular expression working.
I want to keep minimum 3 characters in a string, if it only 2 characters, i want to delete it.
here's my string :
It might be more sensible for real users if I also included a lower limit on the number of letters.
The output i want :
might more sensible for real users also includedlower limit the number letters.
So, i did some googling but still doesnt work.
so basically here's the complete code (1-5 is the regex i've tried):
String input = "It might be more sensible for real users if I also included a lower limit on the number of letters.";
//1. /^[a-zA-Z]{3,}$/
//2. /^[a-zA-Z]{3,30}$/
//3. \\b[a-zA-Z]{4,30}\\b
//4. ^\\W*(?:\\w+\\b\\W*){3,30}$
//5. [+]?(?:[a-zA-Z]\\s*){3,30}
String output = input.replaceAll("/^[a-zA-Z]{3,}$/", "");
System.out.println(output);

You can try this:
package com.stackoverflow.answer;
public class RegexTest {
public static void main(String[] args) {
String input = "It might be more sensible for real users if I also included a lower limit on the number of letters.";
System.out.println("BEFORE: " + input);
input = input.replaceAll("\\b[\\w']{1,2}\\b", "").replaceAll("\\s{2,}", " ");
System.out.println("AFTER: " + input);
}
}

You can use \\w{1,3} to get any 1-2 word characters. You then need to make sure they are not adjacent to other word characters before removing them, so you check for non-word characters (\\W) and beginning or ending of the line (^ and $) like so:
String output = input.replaceAll("(^|\\W)\\w{1,3}($|\\W)", " ");
Note the extra space cleans up for the potentially 2 spaces we are removing.

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

What is the most effective way to find a match between the relevant parts of user input and an array?

In my application the user will be presented with a list. They will then input which item in the list that they are interested in.
I have an array with strings that should match what the user possible could write which (abbreviated) looks something like this: ["first", "top", "second", "third", ... ,"bottom"].
How can I as efficiently as possible match this with the relevant part of the user input? The case is easy for when the user writes something that matches exactly with my array, but that is definitely no guarantee in this application. I.e. how can I match "mine is the first one" or "its the one at the bottom" with my array in an efficient way.
Instead of looping through all your words and check contains for each of them every time a user inputs something, I would suggest only looping once to create a regex to match, and use that regex-String every time a user inputs something.
So your array {"first", "top", "second", "third", "bottom"} would become the following regex-String:
"^.*(first|top|second|third|bottom).*$"
This means:
^ # Start looking at the start of the String
.* # Any amount of random characters (can be none)
(first|top|second|third|bottom)
# Followed by one of these words (the `|` are OR-statements)
.* # Followed by any mount of random characters again
$ # Followed by the end of the String
(The ^$ could be removed if you use the String#matches builtin, since it always tries to match the entire given String implicitly, but I would add it as clarification.)
After that you can use this regex every time a user inputs something using userInput.matches(regex).
Here a possible test code:
class Main{
public static void main(String[] a){
String[] array = {"first", "top", "second", "third", "bottom"};
// Create the regex-String of the array once:
String regex = "^.*(";
for(int i=0; i<array.length; i++){
regex += array[i];
if(i < array.length - 1) // All except for the last word in the array:
regex += "|";
}
regex += ").*$";
// Check all user inputs:
// TODO: Actually use user inputs instead of this String-array of test cases of course
for(String userInputTestCase : new String[]{"mine is the first one",
"its the one at the bottom",
"no match",
"top and bottom",
"first"}){
System.out.println(userInputTestCase + " → " + userInputTestCase.matches(regex));
}
}
}
Resulting in:
mine is the first one → true
its the one at the bottom → true
no match → false
top and bottom → true
first → true
So if one or multiple of the words from the array is/are present, it will result in true (including exact matches like the last test case).
Try it online.
EDIT: If you want to Strings of the match instead, you could use a slightly modified regex with a Pattern.compile(regex).matcher(inputString) instead:
Change these lines:
String regex = "("; // was `String regex = "^.*(";` above
...
regex += ")"; // was `regex += ").*$";` above
And add the following:
// TODO: Put the imports at the top of the class
// This is done once, just like creating the regex above:
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(regex);
// The following is done for every user input:
java.util.regex.Matcher matcher = pattern.matcher(userInputTestCase);
java.util.List<String> results = new java.util.ArrayList<>();
while(matcher.find()){
results.add(matcher.group());
}
System.out.println(userInputTestCase + " → " + results);
Try it online.

Regex pattern doesn't work when ending without a space

I want to remove strings that contain either http or https. I have the following code segment:
String line="abc http://someurl something https://someurl";
if (line.contains("https") || line.contains("http")) {
System.out.println(line);
String x = line.replaceAll("https?://.*?\\s+", " ");
System.out.println(x);
}
The output is: abc something https://someurl (doesn't remove the ending url)
Desired output is: abc something
I'm guessing its a simple change to the regex...
Edit: Sorry, the previous example didn't contain an actual url after the http.
Your regex is
https?://.*?\\s+
That final token \s+ means one or more space characters. If you want to remove substrings that don't necessarily end in spaces, you can repeat with * instead of + - * means to repeat the preceding token zero or more times:
String x = line.replaceAll("https?://.*?\\s*", " ");
That said, if the URLs you have are valid and don't contain any space characters, it would probably make more sense to match non-space characters with \S and replace with the empty string, rather than look for space characters, match them, and then replace with another space:
String x = line.replaceAll("https?://\\S*", "");

Looking for method to remove spaces on sides, change all letters to small with first letter as capital letter

I have been trying for a while to make a method which takes an user input and changes it so that potential spaces infront and after the text should be removed. I tried .trim() but doesnt seem to work on input strings with two words. also I didnt manage to make both first and second word have the first letter as Capital.
If user inputs the following string I want all separate words to have all small letters except for the first in the word. e.g: Long Jump
so if user inputs:
"LONG JuMP"
or
" LoNg JUMP "
change it to
"Long Jump"
private String normalisera(String s) {
return s.trim().substring(0,1).toUpperCase() + s.substring(1).toLowerCase();
}
I tried the method above but didnt work with two words, only if the input was one. It should work with both
To remove all spaces extra spaces you can do something like this
string = string.trim().replaceAll(" +", " ");
The above code will call trim to get rid of the spaces at the start and end, then use regex to replace everything that has 2 or more spaces with a single space.
To capitalize the first word, if you're using Apache's commons-lang, you can use WordUtils.capitalizeFully. Otherwise, you'll need to use a homebrewed solution.
Simply iterate through the String, and if the current character is a space, mark the next character to be uppercased. Otherwise, make it lowercase.
Split your problems into smaller ones:
You need to be able to:
iterate over all words and ignore all whitespaces (you can use Scanner#next for that)
edit single word into new form (create helper method like String changeWord(String){...})
create new String which will collect edited versions of each word (you can use StringBuilder or better StringJoiner with delimiter set as one space)
So your general solution can look something like:
public static String changeWord(String word) {
//code similar to your current solution
}
public static String changeText(String text) {
StringJoiner sj = new StringJoiner(" ");// space will be delimiter
try(Scanner sc = new Scanner(text)){
while (sc.hasNext()) {
sj.add(changeWord(sc.next()));
}
}
return sj.toString();
}
Since Strings are immutable and you cannot make in place changes you need to store it in a separate variable and then do your manipulations like this:
String s = " some output ";
String sTrimmed = s.trim();
System.out.println(s);
System.out.println(sTrimmed);
Change your code like this for the rest of your code as well.

Regular expression for string with apostrophes

I'm trying to build regex which will filter form string all non-alphabetical characters, and if any string contains single quotes then I want to keep it as an exception to the rule.
So for example when I enter
car's34
as a result I want to get
car's
when I enter
*&* Lisa's car 0)*
I want to get
Lisa's
at the moment I use this:
string.replaceAll("[^A-Za-z]", "")
however, it gives me only alphabets, and removed the desired single quotas.
This will also remove apostrophes that are not "part if words":
string = string.replaceAll("[^A-Za-z' ]+|(?<=^|\\W)'|'(?=\\W|$)", "")
.replaceAll(" +", " ").trim();
This first simply adds an apostrophe to the list of chars you want to keep, but uses look arounds to find apostrophes not within words, so
I'm a ' 123 & 'test'
would become
I'm a test
Note how the solitary apostrophe was removed, as well as the apostrophes wrapping test, but I'm was preserved.
The subsequent replaceAll() is to replace multiple spaces with a single space, which will result if there's a solitary apostrophe in the input. A further call to trim() was added in case it occurs at the end of the input.
Here's a test:
String string = "I'm a ' 123 & 'test'";
string = string.replaceAll("[^A-Za-z' ]+|(?<=^|\\W)'|'(?=\\W|$)", "").replaceAll(" +", " ").trim();
System.out.println(string);
Output:
I'm a test
Isn't this working ?
[^A-Za-z']
The obvious solution would be:
string.replaceAll("[^A-Za-z']", "")
I suspect you want something more.
You can try the regular expression:
[^\p{L}' ]
\p{L} denote the category of Unicode letters.
In ahother hand, you need to use a constant of Pattern for avoid recompiled the expression every time, something like that:
private static final Pattern REGEX_PATTERN =
Pattern.compile("[^\\p{L}' ]");
public static void main(String[] args) {
String input = "*&* Lisa's car 0)*";
System.out.println(
REGEX_PATTERN.matcher(input).replaceAll("")
); // prints " Lisa's car "
}
#Bohemian has a good idea but word boundaries are called for instead of lookaround:
string.replaceAll("([^A-Za-z']|\B'|'\B)+", " ");

Categories