regex seems to be off for special characters (e.g. +-.,!##$%^&*;)

regex seems to be off for special characters (e.g. +-.,!##$%^&*;) - java

I am using regex to print out a string and adding a new line after a character limit. I don't want to split up a word if it hits the limit (start printing the word on the next line) unless a group of concatenated characters exceed the limit where then I just continue the end of the word on the next line. However when I hit special characters(e.g. +-.,!##$%^&*;) as you'll see when I test my code below, it adds an additional character to the limit for some reason. Why is this?
My function is:
public static String limiter(String str, int lim) {
str = str.trim().replaceAll(" +", " ");
str = str.replaceAll("\n +", "\n");
Matcher mtr = Pattern.compile("(.{1," + lim + "}(\\W|$))|(.{0," + lim + "})").matcher(str);
String newStr = "";
int ctr = 0;
while (mtr.find()) {
if (ctr == 0) {
newStr += (mtr.group());
ctr++;
} else {
newStr += ("\n") + (mtr.group());
}
}
return newStr ;
}
So my input is:
String str = " The 123456789 456789 +-.,!##$%^&*();\\/|<>\"\' fox jumpeded over the uf\n 2 3456 green fence ";
With a character line limit of 7.
It outputs:
456789 +
-.,!##$%
^&*();\/
|<>"
When the correct output should be:
456789
+-.,!##
$%^&*()
;\/|<>"
My code is linked to an online compiler you can run here:
https://ideone.com/9gckP1

You need to replace the (\W|$) with \b as your intention is to match whole words (and \b provides this functionality). Also, since you do not need trailing whitespace on newly created lines, you need to also use \s*.
So, use
Matcher mtr = Pattern.compile("(?U)(.{1," + lim + "}\\b\\s*)|(.{0," + lim + "})").matcher(str);
See demo
Note that (?U) is used here to "fix" the word boundary behavior to keep it in sync with \w (so that diacritics were not considered word characters).

In your pattern, \\W is part of the first capturing group. It is adding this one (non-word) character to the .{1,limit} pattern.
Try with: "(.{1," + lim + "})(\W|$)|(.{0," + lim + "})"
(I can't currently use your regex online compiler)

Related

How can I add a character inside a Regular Expression which changes each time?

String s = scan.nextLine();
s = s.replaceAll(" ", "");
for (int i = 0; i < s.length(); i++) {
System.out.print(s.charAt(i) + "-");
int temp = s.length();
// this line is the problem
s = s.replaceAll("[s.charAt(i)]", '');
System.out.print((temp - s.length()) + "\n");
i = -1;
}
I was actually using the above method to count each character.
I wanted to use s.charAt(i) inside Regular Expression so that it counts and displays as below. But that line (line 10) doesn't work I know.
If it's possible how can I do it?
Example:
MALAYALAM (input)
M-2
A-4
L-2
Y-1

Java does not have string interpolation, so code written inside a string literal will not be executed; it is just part of the string. You would need to do something like "[" + s.charAt(i) + "]" instead to build the string programmatically.
But this is problematic when the character is a regex special character, for example ^. In this case the character class would be [^], which matches absolutely any character. You could escape regex special characters while building the regex, but this is overly complicated.
Since you just want to replace occurrences an exact substring, it is simpler to use the replace method which does not take a regex. Don't be fooled by the name replace vs. replaceAll; both methods replace all occurrences, the difference is really that replaceAll takes a regex but replace just takes an exact substring. For example:
> "ababa".replace("a", "")
"bb"
> "ababa".replace("a", "c")
"cbcbc"

Use regex to un camelCase Java String

This code seems to work perfectly, but I'd love to clean it up with regex.
public static void main(String args[]) {
String s = "IAmASentenceInCamelCaseWithNumbers500And1And37";
System.out.println(unCamelCase(s));
}
public static String unCamelCase(String string) {
StringBuilder newString = new StringBuilder(string.length() * 2);
newString.append(string.charAt(0));
for (int i = 1; i < string.length(); i++) {
if (Character.isUpperCase(string.charAt(i)) && string.charAt(i - 1) != ' '
|| Character.isDigit(string.charAt(i)) && !Character.isDigit(string.charAt(i - 1))) {
newString.append(' ');
}
newString.append(string.charAt(i));
}
return newString.toString();
}
Input:
IAmASentenceInCamelCaseWithNumbers500And1And37
Output:
I Am A Sentence In Camel Case With Numbers 500 And 1 And 37
I'm not a fan of using that ugly if statement, and I'm hoping there's a way to use a single line of code that utilizes regex. I tried for a bit but it would fail on words with 1 or 2 letters.
Failing code that doesn't work:
return string.replaceAll("(.)([A-Z0-9]\\w)", "$1 $2");

The right regex and code to do your job is this.
String s = "IAmASentenceInCamelCaseWithNumbers500And1And37";
System.out.println("Output: " + s.replaceAll("[A-Z]|\\d+", " $0").trim());
This outputs,
Output: I Am A Sentence In Camel Case With Numbers 500 And 1 And 37
Editing answer for query asked by OP in comment:
If input string is,
ThisIsAnABBRFor1Abbreviation
Regex needs a little modification and becomes this, [A-Z]+(?![a-z])|[A-Z]|\\d+ for handling abbreviation.
This code,
String s = "ThisIsAnABBRFor1Abbreviation";
System.out.println("Input: " + s.replaceAll("[A-Z]+(?![a-z])|[A-Z]|\\d+", " $0").trim());
Gives expected output as per OP ZeekAran in comment,
Input: This Is An ABBR For 1 Abbreviation

You may use this lookaround based regex solution:
final String result = string.replaceAll(
"(?<=\\S)(?=[A-Z])|(?<=[^\\s\\d])(?=\\d)", " ");
//=> I Am A Sentence In Camel Case With Numbers 500 And 1 And 37
RegEx Demo
RegEx Details:
Regex matches either of 2 conditions and replaces it with a space. It will ignore already present spaces in input.
(?<=\\S)(?=[A-Z]): Previous char is non-space and next char is a uppercase letter
|: OR
(?<=[^\\s\\d])(?=\\d): previous char is non-digit & non-space and next one is a digit

I think you can try this
let str = "IAmASentenceInCamelCaseWithNumbers500And1And37";
function unCamelCase(str){
return str.replace(/(?:[A-Z]|[0-9]+)/g, (m)=>' '+m.toUpperCase()).trim();
}
console.log(unCamelCase(str));
Explanation
(?:[A-Z]|[0-9]+)
?: - Non capturing group.
[A-Z] - Matches any one capital character.
'|' - Alternation (This works same as Logical OR).
[0-9]+ - Matches any digit from 0-9 one or more time.
P.S Sorry for the example in JavaScript but same logic can be achived in JAVA pretty easily.

Regex display with arrays

So I have a regex question. When running this code
if (str1.trim().contains(search2)){
String str3 = str1;
str3 = str3.replaceAll("[^-?0-9]+", " ");
System.out.println("location: " + Arrays.asList(str3.trim().split(" ")));
System.out.println(" ");
}
it produces
location: [290, -70]
is it possible to replace the bracket characters with "[ x, x]" with "x x" so that they just show the characters within quotes?
location: "290 -70"?
I'm kinda new to regex so I tried some things like .replace("[", " "); but it did not work.
EDIT ----
Here's my entire code.
public static void main (String [] args) throws IOException {
BufferedReader in = new BufferedReader (new FileReader ("/Users/Dannybwee/Documents/workspace/csc199/src/csc199/test.txt"));
String str;
List<String> finallist = new ArrayList<String>();
while ((str = in.readLine()) != null){
finallist.add(str);
}
String search = "node";
String search2 = "position";
for (String str1: finallist) {
if (str1.trim().contains(search)){
System.out.print("{ key " + str1+ ",\n" +
"name: " + str1 + ",\n" +
"Truth: 'Tainted'," + "\n" +
"False: 'NotTainted, \n");
}
if (str1.trim().contains(search2)){
String str3 = str1;
str3 = str3.replaceAll("[^-?0-9]+", " ");
System.out.println("location: " + Arrays.asList(str3.trim().split(" ")));
System.out.println("}");
}
}
}
What i'm trying to do is take a text file, and then change the formatting of the text. I thought it would be easiest to take the file and scan for what needed to change. for instance, All I want is to change the brackets outputed above to braces.
So basically I want it to output location: "290 -70" instead of location: [290, -70] without the comma and brackets

I'm splitting because the line is positions = (number number); What I'm trying to do is just extract the number from that index
Then if you split, you get ["(number", "number)"].
You want to remove the round brackets, not the square ones. And you have already done that [^-?0-9]+ removes all characters but one or more 0-9, -, and ?
You don't need to split anything.
if (str1.trim().contains(search2)){
str1 = st1.replaceAll("[^-?0-9]+", " ");
System.out.println("location: \"" + str1 + "\"");
System.out.println("}");
}
You could also forget the regex entirely and use str1.substring(1, str1.length() - 1)
By the way, if you are trying to produce JSON, it isn't valid. The keys need to be quoted

you can specify the literal bracket with the backslash "escape character" \[. This is common for many regex entries that also correspond to triggered characters.
\\ , \. , \( ... etc
It is important to note that in Java we must escape our escape character, therefore whenever you use it you'll need a single backslash for each backslash:
\\[, \\\\, \\., \\( ... etc
You can implement this into your existing code, or you could make your life a little easier by using a pattern matcher.
Pattern p = Pattern.compile("\\D+?(-?\\d++)\\D+?(-?\\d++)\\D*");
Matcher m = p.matcher(STRING);
String results = "location: "+m.group(1)+" "+m.group(2);
\\D+? eliminates non-digit (0-9) characters reluctantly, this will spare the '-' when found.
(-?\\d++) will capture m.group(n) which will possessively contain as many digits as it can find in a row. Since the '-' was spared earlier it should be present for this capture if at all.

Regular Expression That Contains All Of The Specific Letters In Java

I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.
Regular Expression Pattern;
^(?=.*B)(?=.*T)(?=.*L).+$
Input Text File;
AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
And output of the regular expression in notepad++;
LABAT
BALAT
LATAB
As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.
Here is my test code;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;
public class Test {
public static void main(String[] args) {
String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";
String dictionary =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(dictionary);
while(m.find())
{
System.out.println("Match: " + m.group());
}
}
}
The output is errorneous as below;
Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
My question is simply, what is the java-compatible version of this regular expression?

Java-specific answer
In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.
That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.
I suggest this almost regex-free approach (String#split() still uses it):
String dictionary_str =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++) // Iterate through lines
{
if(dictionary[i].indexOf("B") > -1 && // There must be B
dictionary[i].indexOf("T") > -1 && // There must be T
dictionary[i].indexOf("L") > -1) // There must be L
{
System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
}
}
See IDEONE demo
Original regex-based answer
You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:
^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)
The regex breakdown:
^ - start of string
(?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B
(?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T
(?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than L
See Java demo:
String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(dictionary[i]);
if(m.find())
{
System.out.println("Match: " + dictionary[i]);
}
}
Output:
Match: LABAT
Match: BALAT
Match: LATAB

Change your Pattern to:
String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";
Output
Match: LABAT
Match: BALAT
Match: LATAB

I did not debug your situation, but I think your problem is caused by matching the entire string rather than individual words.
You're matching "AL\nBAL\nBAK\nLABAT\n" plus some more. Of course that string has all the required characters. You can see it in the fact that your output only contains one Match: prefix.
Please have a look at this answer. You need to use Pattern.MULTILINE.

Getting the "context" text of a matched group

I'm using the Matcher class of Java to get some strings, now when I get my matches, I also find their begin index and end index. Now what I want to do is get the x preceding and proceeding characters.
So what I did was just call the substring method on the string with {begin index minusx} to {end index plusx}, but it seems to be a little heavy, for every match, I'll have to loop the string for it's context.
I wanted to know whether there's a better way to do that.
Here is what I've done so far:
The part that bothers me is the text.substring, how expensive is it
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("\\d{2}").matcher(text);
int x = 5;
while (matcher.find()) {
String match = matcher.group();
int start = matcher.start();
int end = matcher.end();
String pretext = text.substring(start - x, start);
String postext = text.substring(end, end + x);
System.out.println(pretext + " - " + match + " - " + postext);
}
Suggested answer of using grouping to solve this:
using the regex (.{5})(\d{2}(.{5}).
First of all, this wouldn't be able to captures ones without at least 5 characters before it. So the solution to that is (.{0,5})(\d{2})(.{0.5}), very nice for that simple regex (\d{2})but for one like "c?at" and the given text "cat" this would match the groups
c
at

String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("(.{5})(\\d{2})(.{5})").matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2) + " - " + matcher.group(3));
}
output :
Some - 22 - text
with - 44 - char

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex seems to be off for special characters (e.g. +-.,!##$%^&*;) - java

In your pattern, \\W is part of the first capturing group. It is adding this one (non-word) character to the .{1,limit} pattern. Try with: "(.{1," + lim + "})(\W|$)|(.{0," + lim + "})" (I can't currently use your regex online compiler)

Related

How can I add a character inside a Regular Expression which changes each time?

Use regex to un camelCase Java String

Regex display with arrays

Regular Expression That Contains All Of The Specific Letters In Java

Getting the "context" text of a matched group

Categories

Resources