Java regular expressions starts and ends with and contains - java

I have a file that I need to use regex to replace a specific character.
I have strings of the following format:
1234 4215 "aaa.bbb" 5215 1524
and I need to replace the periods with colons.
I know that these periods are always contained within quotation marks, so I need a regex that finds a substring that starts with '"', ends with '"', and contains "." and replace the "." with ":". Could someone shed some light?

You can use:
str = str.replaceAll("\\.(?!(([^"]*"){2})*[^"]*$)", ":");
RegEx Demo
This regex will find dots if those are inside double quotes by using a lookahead to make sure there are NOT even number of quotes after the dot.

Update
After thinking about it, your question says "period(s)" possibly more than one period in double quotes.
Here's a way to cover that scenario
public static void main(String[] args) throws Exception {
String str = "1234 \"aaa.bbb\" \"a.aa.b.bb\" 5215 1524 \"12.345.123\" \".sage.\" \".afwe\" \"....\"";
// Find all substrings in double quotes
Matcher matcher = Pattern.compile("\"(.*?)\"").matcher(str);
while (matcher.find()) {
// Extract the match
String match = matcher.group(1);
// Replace all the periods with colons
match = match.replaceAll("\\.", ":");
// Replace the original matched group with the new string
str = str.replace(matcher.group(1), match);
}
System.out.println(str);
}
Results:
1234 "aaa:bbb" "a:aa:b:bb" 5215 1524 "12:345:123" ":sage:" ":afwe" "::::"
And after testing #anubhava pattern, his produces the same results so more credit to him for simplicity (+1).
OLD ANSWER
You can try this pattern in a String.replaceAll()
"\"([^\\.]*?)(\\.)([^\\.]*?)\""
With a replacement of
"\"$1:$3\""
This essentially captures the contents, between double quotes, into groups (1-3).
Group 1 ($1) - All characters, present or not (*?), that is not a period
Group 2 ($2) - The period
Group 3 ($3) - All characters, present or not (*?), that is not a period
and replaces it with "{Group 1}:{Group 3}"
public static void main(String[] args) throws Exception {
String str = "1234 4215 \"aaa.bbb\" 5215 1524 \"12345.123\" \"sage.\" \".afwe\" \".\"";
System.out.println(str.replaceAll("\"([^\\.]*?)(\\.)([^\\.]*?)\"", "\"$1:$3\""));
}
Results:
1234 4215 "aaa:bbb" 5215 1524 "12345:123" "sage:" ":afwe" ":"

Related

Java String keep numeric characters only at the end of a String

what is the regular expression so I can keep only the LAST numbers at the END of a String?
For example
Test123 -> 123
T34est56 -> 56
123Test89 -> 89
Thanks
I tried
str.replaceAll("[^A-Za-z\\s]", ""); but this removes all the numbers of the String.
I also tried str.replaceAll("\\d*$", ""); but this returns the following:
Test123 -> Test
T34est56 -> T34est
123Test89 -> 123Test
I want exactly the opposite.
Getting group of the last integers in line and then replacing string with that group seems to work:
String str = "123Test89";
String result = str.replaceAll(".*[^\\d](\\d+$)", "$1");
System.out.println(result);
This outputs:
89
You can use replaceFirst() to remove everything (.*) up to the last non-digit (\\D):
s = s.replaceFirst(".*\\D", "");
Complete example:
public class C {
public static void main(String args[]) {
String s = "T34est56";
s = s.replaceFirst(".*\\D", "");
System.out.println(s); // 56
}
}
You could use a regex like this:
String result = str.replaceFirst(".*?(\\d+$)", "$1");
Try it online.
Explanation:
.*: Any amount of leading characters
?: Optionally. This makes sure the regex part after it ((\\d+$)) has priority over the .*. Without the ?, every test case would only return the very last digit (i.e. 123Test89 would return 9 instead of 89).
\\d+: One or more digits
$: At the very end of the string
(...): Captured in a capture group
Which is then replaced with:
$1: The match of the first capture group (so the trailing digits)
To perhaps make it slightly more clear, you could add a leading ^ to the regex: "^.*?(\\d+$)", although it's not really necessary because .* already matches every leading character.
I like to use the Pattern and Matcher API:
Pattern pattern = Pattern.compile("[1-9]*$");
Matcher matcher = pattern.matcher("Test123");
if (matcher.find()) {
System.out.println(matcher.group()); // 123
}
I think use /.*?(\d+)$/, it will work.

Splitting a string on whitespaces

I'm currently trying to splice a string into a multi-line string.
The regex should select white-spaces which has 13 characters before.
The problem is that the 13 character count does not reset after the previous selected white-space. So, after the first 13 characters, the regex selects every white-space.
I'm using the following regex with a positive look-behind of 13 characters:
(?<=.{13})
(there is a whitespace at the end)
You can test the regex here and the following code:
import java.util.ArrayList;
public class HelloWorld{
public static void main(String []args){
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
for (String string : str.split("(?<=.{13}) ")) {
System.out.println(string);
}
}
}
The output of this code is as follows:
This is a test.
The
app
should
break
this
string
in
substring
on
whitespaces
after
13
characters
But it should be:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
You may actually use a lazy limiting quantifier to match the lines and then replace with $0\n:
.{13,}?[ ]
See the regex demo
IDEONE demo:
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
System.out.println(str.replaceAll(".{13,}?[ ]", "$0\n"));
Note that the pattern matches:
.{13,}? - any character that is not a newline (if you need to match any character, use DOTALL modifier, though I doubt you need it in the current scenario), 13 times at least, and it can match more characters but up to the first space encountered
[ ] - a literal space (a character class is redundant, but it helps visualize the pattern).
The replacement pattern - "$0\n" - is re-inserting the whole matched value (it is stored in Group 0) and appends a newline after it.
You can just match and capture 13 characters before white spaces rather than splitting.
Java code:
Pattern p = Pattern.compile( "(.{13}) +" );
Matcher m = p.matcher( text );
List<String> matches = new ArrayList<>();
while(m.find()) {
matches.add(m.group(1));
}
It will produce:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
RegEx Demo
you can do this with the .split and using regular expression. It would be like this
line.split("\\s+");
This will spilt every word with one or more whitespace.

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Currency values string split by comma

I have a String which contains formatted currency values like 45,890.00 and multiple values seperated by comma like 45,890.00,12,345.00,23,765.34,56,908.50 ..
I want to extract and process all the currency values, but could not figure out the correct regular expression for this, This is what I have tried
public static void main(String[] args) {
String currencyValues = "45,890.00,12,345.00,23,765.34,56,908.50";
String regEx = "\\.[0-9]{2}[,]";
String[] results = currencyValues.split(regEx);
//System.out.println(Arrays.toString(results));
for(String res : results) {
System.out.println(res);
}
}
The output of this is:
45,890 //removing the decimals as the reg ex is exclusive
12,345
23,765
56,908.50
Could someone please help me with this one?
You need a regex "look behind" (?<=regex), which matches, but does consume:
String regEx = "(?<=\\.[0-9]{2}),";
Here's your test case now working:
public static void main(String[] args) {
String currencyValues = "45,890.00,12,345.00,23,765.34,56,908.50";
String regEx = "(?<=\\.[0-9]{2}),"; // Using the regex with the look-behind
String[] results = currencyValues.split(regEx);
for (String res : results) {
System.out.println(res);
}
}
Output:
45,890.00
12,345.00
23,765.34
56,908.50
You could also use a different regular expression to match the pattern that you're searching for (then it doesn't really matter what the separator is):
String currencyValues = "45,890.00,12,345.00,23,765.34,56,908.50,55.00,345,432.00";
Pattern pattern = Pattern.compile("(\\d{1,3},)?\\d{1,3}\\.\\d{2}");
Matcher m = pattern.matcher(currencyValues);
while (m.find()) {
System.out.println(m.group());
}
prints
45,890.00
12,345.00
23,765.34
56,908.50
55.00
345,432.00
Explanation of the regex:
\\d matches a digit
\\d{1,3} matches 1-3 digits
(\\d{1,3},)? optionally matches 1-3 digits followed by a comma.
\\. matches a dot
\\d{2} matches 2 digits.
However, I would also say that having comma as a separator is probably not the best design and would probably lead to confusion.
EDIT:
As #tobias_k points out: \\d{1,3}(,\\d{3})*\\.\\d{2} would be a better regex, as it would correctly match:
1,000,000,000.00
and it won't incorrectly match:
1,00.00
In all of the above solutions, it takes care if all values in the string are decimal values with a comma. What if the currency value string looks like this:
String str = "1,123.67aed,34,234.000usd,1234euro";
Here not all values are decimals. There should be a way to decide if the currency is in decimal or integer.

How do I make a regex match for measurement units?

I'm building a small Java library which has to match units in strings. For example, if I have "300000000 m/s^2", I want it to match against "m" and "s^2".
So far, I have tried most imaginable (by me) configurations resembling (I hope it's a good start)
"[[a-zA-Z]+[\\^[\\-]?[0-9]+]?]+"
To clarify, I need something that will match letters[^[-]numbers] (where [ ] denotes non obligatory parts). That means: letters, possibly followed by an exponent which is possibly negative.
I have studied regex a little bit, but I'm really not fluent, so any help will be greatly appreciated!
Thank you very much,
EDIT:
I have just tried the first 3 replies
String regex1 = "([a-zA-Z]+)(?:\\^(-?\\d+))?";
String regex2 = "[a-zA-Z]+(\\^-?[0-9]+)?";
String regex3 = "[a-zA-Z]+(?:\\^-?[0-9]+)?";
and it doesn't work... I know the code which tests the patterns work, because if I try something simple, like matching "[0-9]+" in "12345", it will match the whole string. So, I don't get what's still wrong. I'm trying with changing my brackets for parenthesis where needed at the moment...
CODE USED TO TEST:
public static void main(String[] args) {
String input = "30000 m/s^2";
// String input = "35345";
String regex1 = "([a-zA-Z]+)(?:\\^(-?\\d+))?";
String regex2 = "[a-zA-Z]+(\\^-?[0-9]+)?";
String regex3 = "[a-zA-Z]+(?:\\^-?[0-9]+)?";
String regex10 = "[0-9]+";
String regex = "([a-zA-Z]+)(?:\\^\\-?[0-9]+)?";
Pattern pattern = Pattern.compile(regex3);
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
System.out.println("MATCHES");
do {
int start = matcher.start();
int end = matcher.end();
// System.out.println(start + " " + end);
System.out.println(input.substring(start, end));
} while (matcher.find());
}
}
([a-zA-Z]+)(?:\^(-?\d+))?
You don't need to use the character class [...] if you're matching a single character. (...) here is a capturing bracket for you to extract the unit and exponent later. (?:...) is non-capturing grouping.
You're mixing the use of square brackets to denote character classes and curly brackets to group. Try this instead:
[a-zA-Z]+(\^-?[0-9]+)?
In many regular expression dialects you can use \d to mean any digit instead of [0-9].
Try
"[a-zA-Z]+(?:\\^-?[0-9]+)?"

Categories