Decipher this regex

Decipher this regex - java

I came back to a project I was working on several months ago, and one problem I figured out then was when I need to extract a certain part of a String. The String used both paranthesis and quotationmarks, so I couldn't split it like normal text.
Example of how the String might look:
Word_Object("id"): preword:subword
Now say I wanted to only grab what's after the ("id"):, that is
'preword:subword'
I found that regex helped me out, and it took quite some time to find an EXAMPLE that was applicable for what I wanted. I had to settle for example, because I tried to find sources on how to learn about this incredibly complex system but I failed hard at that. The regex that solved it looks like this: "Word_Object(\\(\"" + "id" + "\")\\): "
I was content then that it seemed to work, but now when I got back to the project and tried it, I was trying to extract a word that used a underscore _and the underscore with the following word(s) was left out.
Example, splitting the text Word_Object("id"): preword:subword_underscorewordusing the regex (using complete line now) idSplit = subTemp.split("Word_Object(\\(\"" + "id" + "\")\\): ");would simply return: preword:subwordinstead of the wanted preword:subword_underscoreword.
Did I somehow in this regex instruct it to ignore anything after the 2nd special-character (since it does accept :, but apparently _ breaks everything)?

public static void main(String[] args) {
final String[] split = "Word_Object(\"id\"): preword:subword_underscoreword".split("Word_Object(\\(\"" + "id" + "\")\\): ");
System.out.println("split = " + split[1]);
}
Leads to
split = preword:subword_underscoreword

Since you might need to keep the id dynamic, here is a replaceAll solution:
String s = "Word_Object(\"id\"): preword:subword_underscoreword";
System.out.println(s.replaceAll("Word_Object(\\(\"" + "id" + "\")\\):\\s*",""));
See IDEONE demo
Output: preword:subword_underscoreword

You should match instead of replacing or splitting:
private static final Pattern PRE_SUB_WORD_EXTRACT = Pattern.compile("Word_Object\\(\"\\w+\"\\): (\\w+):(\\w+)");
public static void main(String[] args) {
String test = "Word_Object(\"id\"): preword:subword_underscorewordusing";
Matcher testMatcher = PRE_SUB_WORD_EXTRACT.matcher(test);
if (!testMatcher.matches()) {
System.out.println("Bollocks");
System.exit(1);
}
System.out.printf("%s : %s%n", testMatcher.group(1), testMatcher.group(2));
}

As mentioned in comments there's no need to use .split() it will give you an array of Strings and not the exact one, just use .replace() with an empty string and yopu will get the result you need :
String str = "Word_Object(\"id\"): preword:subword_underscoreword";
String str2 = str.replace("Word_Object(\"id\"): ", "");
This is a DEMO that will give you preword:subword_underscoreword in output.

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?

Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];

(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.

I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.

TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

How do i split a string using delimiters in java?

What regex pattern would i need to pass to String.split() method to split a string into an array of sub strings using the white space as well as the following characters as delimiters.
(" ! ", " , " , " ? " , " . " , " \ " , " _ " , " # " , " ' " ) and it can also be the combination of the above characters with whitespace. I've tried something like this:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.*;
class StringWordCount {
public static void main(String[] args) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new IputStreamReader(System.in));
String string = bufferedReader.readLine();
String delimiter = "[,\\s]+|\\[!\\s]+|\\[?\\s]+|\\[.\\s]+|\\[_\\s]+|\\[_\\s]+|\\['\\s]+|\\[#\\s]+|\\!|\\,|\\?|\\.|\\_|\\'|\\#";
String[] words = string.split(delimiter);
System.out.println(words.length);
for(int i = 0; i<words.length; i++) {
System.out.println(words[i]);
}
}
}
The above code only generates correct output for some testcases, in other cases, it won't generate the correct one.For example,
Consider the below string where it failed to get the expected output.
It generates the output:
23
Hello
thanks
for
attempting
this
problem
Hope
it
will
help
you
to
learn
java
Good
luck
and
have
a
nice
day
Instead of this one:
21
Hello
thanks
for
attempting
this
problem
Hope
it
will
help
you
to
learn
java
Good
luck
and
have
a
nice
day
As you can see in the first output, its leaving a space on the combination of " ! " and [space] and the delimiter for the above combination is \\[!\\s], right?

You can try this one:
String str = "Hello, thanks for attempting this problem! Hope it will help you to learn java! Good luck and have a nice day!";
//String[] split = str.split("[\\p{Punct}\\s+]");
String[] split = str.split("[\\p{Punct}\\p{Blank}]+");
System.out.println("Arrays.toString(split) = " + Arrays.toString(split));
Result is:
Arrays.toString(split) = [Hello, thanks, for, attempting, this, problem, Hope, it, will, help, you, to, learn, java, Good, luck, and, have, a, nice, day]
Eited: edited line below
String[] split = str.split("[\\p{Punct}\\p{Blank}]+");

In this line:
String delimiter = "[,\\s]+|\\[!\\s]+|\\[?\\s]+|\\[.\\s]+|\\[_\\s]+|\\[_\\s]+|\\['\\s]+|\\[#\\s]+|\\!|\\,|\\?|\\.|\\_|\\'|\\#";
you have \\[ in the string literal, which means the pattern has two characters \[ in it. In the pattern matcher, this causes the matcher to look for the [ character. This isn't what you want.
When a \ character appears in a pattern string:
If the following character is a letter or digit, the combination has some special meaning (for example, you're using \s in the string to mean whitespace), but:
If the following character is something other than a letter or a digit, this means to treat the following character as itself. Any special meaning the character may have had is canceled.
It looks like you're trying to use [!\s]+ (in the pattern; of course you had to double the backslash in the string literal) to match one or more characters in the set of ! and whitespace. Here, [ and ] have a special meaning, to match any character in a set. But putting \ before the [ cancels the special meaning of [, and causes the matcher to look for a [ in the input, which it doesn't find.
See this javadoc for more information.
I'm not sure, but I think getting rid of all the \\ before each [ will make things work. The pattern will still be more complicated than necessary (and I'm not 100% clear on what the requirements are, so it's hard for me to suggest an improvement).

Just do matching instead of splitting..
ArrayList<String> lst = new ArrayList<String>();
Matcher m = Pattern.compile("\\w+").matcher(s);
while(m.find()) {
lst.add(m.group());
}

Regular expression for string with apostrophes

I'm trying to build regex which will filter form string all non-alphabetical characters, and if any string contains single quotes then I want to keep it as an exception to the rule.
So for example when I enter
car's34
as a result I want to get
car's
when I enter
*&* Lisa's car 0)*
I want to get
Lisa's
at the moment I use this:
string.replaceAll("[^A-Za-z]", "")
however, it gives me only alphabets, and removed the desired single quotas.

This will also remove apostrophes that are not "part if words":
string = string.replaceAll("[^A-Za-z' ]+|(?<=^|\\W)'|'(?=\\W|$)", "")
.replaceAll(" +", " ").trim();
This first simply adds an apostrophe to the list of chars you want to keep, but uses look arounds to find apostrophes not within words, so
I'm a ' 123 & 'test'
would become
I'm a test
Note how the solitary apostrophe was removed, as well as the apostrophes wrapping test, but I'm was preserved.
The subsequent replaceAll() is to replace multiple spaces with a single space, which will result if there's a solitary apostrophe in the input. A further call to trim() was added in case it occurs at the end of the input.
Here's a test:
String string = "I'm a ' 123 & 'test'";
string = string.replaceAll("[^A-Za-z' ]+|(?<=^|\\W)'|'(?=\\W|$)", "").replaceAll(" +", " ").trim();
System.out.println(string);
Output:
I'm a test

Isn't this working ?
[^A-Za-z']

The obvious solution would be:
string.replaceAll("[^A-Za-z']", "")
I suspect you want something more.

You can try the regular expression:
[^\p{L}' ]
\p{L} denote the category of Unicode letters.
In ahother hand, you need to use a constant of Pattern for avoid recompiled the expression every time, something like that:
private static final Pattern REGEX_PATTERN =
Pattern.compile("[^\\p{L}' ]");
public static void main(String[] args) {
String input = "*&* Lisa's car 0)*";
System.out.println(
REGEX_PATTERN.matcher(input).replaceAll("")
); // prints " Lisa's car "
}

#Bohemian has a good idea but word boundaries are called for instead of lookaround:
string.replaceAll("([^A-Za-z']|\B'|'\B)+", " ");

How to replace last dot in a string using a regular expression?

I'm trying to replace the last dot in a String using a regular expression.
Let's say I have the following String:
String string = "hello.world.how.are.you!";
I want to replace the last dot with an exclamation mark such that the result is:
"hello.world.how.are!you!"
I have tried various expressions using the method String.replaceAll(String, String) without any luck.

One way would be:
string = string.replaceAll("^(.*)\\.(.*)$","$1!$2");
Alternatively you can use negative lookahead as:
string = string.replaceAll("\\.(?!.*\\.)","!");
Regex in Action

Although you can use a regex, it's sometimes best to step back and just do it the old-fashioned way. I've always been of the belief that, if you can't think of a regex to do it in about two minutes, it's probably not suited to a regex solution.
No doubt get some wonderful regex answers here. Some of them may even be readable :-)
You can use lastIndexOf to get the last occurrence and substring to build a new string: This complete program shows how:
public class testprog {
public static String morph (String s) {
int pos = s.lastIndexOf(".");
if (pos >= 0)
return s.substring(0,pos) + "!" + s.substring(pos+1);
return s;
}
public static void main(String args[]) {
System.out.println (morph("hello.world.how.are.you!"));
System.out.println (morph("no dots in here"));
System.out.println (morph(". first"));
System.out.println (morph("last ."));
}
}
The output is:
hello.world.how.are!you!
no dots in here
! first
last !

The regex you need is \\.(?=[^.]*$). the ?= is a lookahead assertion
"hello.world.how.are.you!".replace("\\.(?=[^.]*$)", "!")

Try this:
string = string.replaceAll("[.]$", "");

How do I delete specific characters from a particular String in Java?

For example I'm extracting a text String from a text file and I need those words to form an array. However, when I do all that some words end with comma (,) or a full stop (.) or even have brackets attached to them (which is all perfectly normal).
What I want to do is to get rid of those characters. I've been trying to do that using those predefined String methods in Java but I just can't get around it.

Reassign the variable to a substring:
s = s.substring(0, s.length() - 1)
Also an alternative way of solving your problem: you might also want to consider using a StringTokenizer to read the file and set the delimiters to be the characters you don't want to be part of words.

Use:
String str = "whatever";
str = str.replaceAll("[,.]", "");
replaceAll takes a regular expression. This:
[,.]
...looks for each comma and/or period.

To remove the last character do as Mark Byers said
s = s.substring(0, s.length() - 1);
Additionally, another way to remove the characters you don't want would be to use the .replace(oldCharacter, newCharacter) method.
as in:
s = s.replace(",","");
and
s = s.replace(".","");

You can't modify a String in Java. They are immutable. All you can do is create a new string that is substring of the old string, minus the last character.
In some cases a StringBuffer might help you instead.

The best method is what Mark Byers explains:
s = s.substring(0, s.length() - 1)
For example, if we want to replace \ to space " " with ReplaceAll, it doesn't work fine
String.replaceAll("\\", "");
or
String.replaceAll("\\$", ""); //if it is a path

Note that the word boundaries also depend on the Locale. I think the best way to do it using standard java.text.BreakIterator. Here is an example from the java.sun.com tutorial.
import java.text.BreakIterator;
import java.util.Locale;
public static void main(String[] args) {
String text = "\n" +
"\n" +
"For example I'm extracting a text String from a text file and I need those words to form an array. However, when I do all that some words end with comma (,) or a full stop (.) or even have brackets attached to them (which is all perfectly normal).\n" +
"\n" +
"What I want to do is to get rid of those characters. I've been trying to do that using those predefined String methods in Java but I just can't get around it.\n" +
"\n" +
"Every help appreciated. Thanx";
BreakIterator wordIterator = BreakIterator.getWordInstance(Locale.getDefault());
extractWords(text, wordIterator);
}
static void extractWords(String target, BreakIterator wordIterator) {
wordIterator.setText(target);
int start = wordIterator.first();
int end = wordIterator.next();
while (end != BreakIterator.DONE) {
String word = target.substring(start, end);
if (Character.isLetterOrDigit(word.charAt(0))) {
System.out.println(word);
}
start = end;
end = wordIterator.next();
}
}
Source: http://java.sun.com/docs/books/tutorial/i18n/text/word.html

You can use replaceAll() method :
String.replaceAll(",", "");
String.replaceAll("\\.", "");
String.replaceAll("\\(", "");
etc..

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Decipher this regex - java

public static void main(String[] args) { final String[] split = "Word_Object(\"id\"): preword:subword_underscoreword".split("Word_Object(\\(\"" + "id" + "\")\\): "); System.out.println("split = " + split[1]); } Leads to split = preword:subword_underscoreword

Since you might need to keep the id dynamic, here is a replaceAll solution: String s = "Word_Object(\"id\"): preword:subword_underscoreword"; System.out.println(s.replaceAll("Word_Object(\\(\"" + "id" + "\")\\):\\s*","")); See IDEONE demo Output: preword:subword_underscoreword

Related

How not to match the first empty string in this regex?

How do i split a string using delimiters in java?

Regular expression for string with apostrophes

How to replace last dot in a string using a regular expression?

How do I delete specific characters from a particular String in Java?

Categories

Resources