Removing Certain Characters inside a String, Java - java

My problem here is that i want a Character remove in some parts of a String but I do not know how to restrict the removing.
Example:
A computer is a general purpose device that can be\n
programmed to carry out a finite set of\n
millions to billions of times more capable.\n
\n
In this era mechanical analog computers were used\n
for military applications.\n
1.1 Limited-function early computers\n
1.2 First general-purpose computers\n
1.3 Stored-program architecture\n
1.4 Semiconductors and\n
this here example is the content of my string, what i want to happen is to remove the \n of lines 1 and 2 above but not to remove the \n in line 5 onwards. How do i remove the \n without removing the other \n?. My Goal here is to make the string a paragraph without \n after line. like the example the first 3 lines can be a paragraph and the next lines are in bullet form(example). what i am saying is that I do not want to remove \n in bulleted characters.
The real contents of the string is dynamic.
I have tried using String.replaceAll("\n", " ") well clearly that would not work it will remove all the \n i have thought of using Regex to determine what is Alphanumberic but it would remove some letters after \n

Try using this regex: -
str = str.replaceAll("(.+)(?<!\\.)\n(?!\\d)", "$1 ");
System.out.println(str);
This will replace your \n if it is not preceded by a dot - termination of a paragraph, and it is not followed by a digit, for when it is followed by a bulleted point. (like, your \n in first bullet point is followed by a 1.2. So, it will not be replaced.).
(.+) at the start, ensures that you are not replacing a blank line.
This will work for the string you have shown.
Explanation: -
(.+) -> A capture group, capturing anything, occurring at least once.
(?<!\\.) -> This is called negative-look-behind. It matches the string following it, only if that string is not preceded by a dot(.) given in the negative-look-behind pattern.
For e.g.: - You don't need to replace \n after the line: - millions to billions of times more capable.\n.
(?!\\d) -> This is called negative -look-ahead. It matches string behind it, only if that string is not followed by a digit (\\d) given in the negative-look-ahead pattern.
For e.g.: - In your bulleted points, computers\n is followed by 1.2. where 1 is a digit. So, you don't want to replace that \n.
Now, $1 and $2 represent the groups captured in the pattern match. Since you just want to replace "\n". So, we took the remaining pattern match as it is, while replacing "\n" with a space.
So, $1 is representation for 1st group - (.+)
Note, look-ahead and look-behind regexes are non-capturing groups.
For More Details, follow these links: -
http://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

I suspect your requirement is to remove the \n of lines 1 and 2 .
What you can do is as below:
split your string into segments,
String[] array = yourString.split("\n");
concat every segments by adding \n tag, except line 1,2
array[1] + array[2] + array[3] + '\n' + array[4] + '\n' ...// and so
forth

Related

Regex Pattern Need Help Java toString() method

I have a java toString on code generated from XML . We as a company are logging the toString() to logs and I am having trouble making a good regex to mask all the data effectively .
Here is the sample to String
String input="com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=HARVARD LAW SCHOOL, THE, clientId=12345]";
expected output
com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=****************, clientId=12345]
Can someone help me with a regex that will mask everything up until the last comma(,) before the next equal =
here is what I tried
maskPatterns.add("clientName=(.*?)=");
This ends up masking till next = . I cant seem to figure how to have it backtrack to last comma(,) before next equal(=).
Also if anyone has better regex for it I am all ears
You can use
clientName=(.*?)(?=\s*,\s*\w+=|\])
See the regex demo
Details
clientName= - a literal string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
(?=\s*,\s*\w+=|\]) - a positive lookahead that requires either ] (\] or (|) a comma enclosed with zero or more whitespaces on both ends (\s*,\s*), then one or more word chars and = immediately to the right of the current location.
Or, if you need the same amount of asterisks, use
String result = text.replaceAll("(\\G(?!^)|clientName=).(?=.*?,\\s*\\w+=|\\])", "$1*");
See this regex demo.
Details
(\\G(?!^)|clientName=)
. - any char but a line break char
(?=.*?,\s*\w+=|\]) - up to the first occurrence of
.*?,\s*\w+= - any zero or more chars other than line break chars as few as possible, a comma, zero or more whitespaces, one or more word chars and a =
| - or
\] - a ] char.
Use String#replaceAll here:
String input = "com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=HARVARD LAW SCHOOL, THE, clientId=12345]";
String output = input.replaceAll("\\bclientName=.*?(\\s*)(?=\\w+=|\\])", "clientName=****************$1");
System.out.println(input);
System.out.println(output);
This prints:
com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=HARVARD LAW SCHOOL, THE, clientId=12345]
com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=**************** clientId=12345]
Note that the number of asterisks probably should not exactly match the number of original characters in the clientName. Doing so would actually be partially revealing the original content, insofar that it would reveal at least the original length of the clientName string.
According to your example of maskPatterns.add("clientName=(.*?)="); I assume that you want the value in capture group 1.
If it should be agnostic of the square brackets for marking the end of the value, but you don't want to match them either, you might use:
\bclientName=([^\r\n,=\[\]]+(?:,(?!\h*\w+=)[^\r\n,=\[\]]*)*)
Explanation
\bclientName= A word boundary, then match clientName=
( Capture group 1
[^\r\n,=\[\]]+ Match 1+ times any char except , = [ ] or a newline
(?: Non capture group
,(?!\h*\w+=) Match a comma asserting what is directly to the right is not 0+ horizontal whitespace chars, 1+ word chars and an = sign
[^\r\n,=\[\]]* Optionally match any char except a newline , = [ ]
)* Close non capture group and repeat 0+ times to get all occurrences of a comma
) Close group 1
Regex demo
If the [ and ] can also be part of the clientName, you can omit them from the character classes.

What is the functionality of this regex? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am recently learning regex and i am not quite sure how the following regex works:
str.replaceAll("(\\w)(\\w*)", "$2$1ay");
This allows us to do the following:
input string: "Hello World !"
return string: "elloHay orldWay !"
From what I know: w is supposed to match all word characters including 0-9 and underscore and $ matches stuff at the end of string.
In the replaceAll method, the first parameter can be a regex. It matches all words in the string with the regex and changes them to the second parameter.
In simple cases replaceAll works like this:
str = "I,am,a,person"
str.replaceAll(",", " ") // I am a person
It matched all the commas and replaced them with a space.
In your case, the match is every alphabetic character(\w), followed by a stream of alphabetic characters(\w*).
The () around \w is to group them. So you have two groups, the first letter and the remaining part. If you use regex101 or some similar website you can see a visualization of this.
Your replacement is $2 -> Second group, followed by $1(remaining part), followed by ay.
Hope this clears it up for you.
Enclosing a regex expression in brackets () will make it a Capturing group.
Here you have 2 capturing groups , (\w) captures a single word character, and (\w*) catches zero or more.
$1 and $2 are used to refer to the captured groups, first and second respectively.
Also replaceAll takes each word individually.
So in this example in 'Hello' , 'H' is the first captured groups and 'ello' is the second. It's replaced by a reordered version - $2$1 which is basically swapping the captured groups.
So you get '$2$1ay' as 'elloHay'
The same for the next word also.

Regex to split String on pattern but with a minimum number of characters

I want to split a long text stored in a String variable following those rules:
Split on a dot (.)
The Substrings should have a minimum length of 30 (for example).
Take this example:
"The boy ate the apple. The sun is shining high in the sky. The answer to life the universe and everything is forty two, said the big computer."
let's say the minimum length I want is 30.
The result splits obtained would be:
"The boy ate the apple. The sun is shining high in the sky."
"The answer to life the universe and everything is forty two, said the big computer."
I don't want to take "The boy ate the apple." as a split because it's less than 30 characters.
2 ways I thought of:
Loop through all the characters and add them to a String builder. And whenever I reach a dot (.) I check if my String builder is more than the minimum I split it, otherwise I continue.
Split on all dots (.), and then loop through the splits. if one of the Splitted strings is smaller than the minimum, I concatenate it with the one after.
But I am looking if this can be done directly by using a Regex to split and test the minimum number of characters before a match.
Thanks
Instead of using split, you could also match your values using a capturing group.
To make the dot also match a newline you could use Pattern.DOTALL
\s*(.{30}[^.]*\.|.+$)
In Java:
String regex = "\\s*(.{30}[^.]*\\.|.+$)";
Explanation
\s* Match 0_ times a whitespace character
( Capturing group
.{30} Match any character 30 times
[^.]* Match 0+ times not a dot using a negated character class
\. Match literally
| Or
.+$ Match 1+ times any character until the end of the string.
) Close capturing group
Regex demo | Java demo
Instead of using the split method, try matching with the following regexp: \S.{29,}?[.]
Demo
This should do the job:
"\W*+(.{30,}?)\W*\."
Test: https://regex101.com/r/aavcme/3
\W*+ takes as much as non-word character to trim spaces between sentences
. matches any character (I guess you want to match any kind of character in your sentences)
{30,} asserts the minimum length of the match (30)
? means "as few as possible"
\. matches the dot separating the sentences (assuming that you always have a dot at the end of a sentence, even the last one)

Regexp for finding second space within 20 characters

I'm trying to write a Java regexp that matches a string in this format:
AXXXXYYYYB
Where XXXX is a string that terminates at the 20th character or the second space, whichever comes first, and YYYY is a string that terminates at the 20th character or the first space, whichever comes first.
And I need XXXX and YYYY to be the first and second capture groups.
I can get it to work terminating at the first space in XXXX with this:
^A([^ ]{1,20}) ?([^ ]{1,20})B$
But I can't figure out the rule that would terminate at the 20th character or the second space.
Also, I don't care if either capture group ends up with an extra leading or trailing space.
Sample input -> output:
MR SMITH BROOKLYN -> "MR SMITH" and "BROOKLYN" (separated at second space)
MR SMYTHE-JONES BRONX -> "MR SMYTHE-JONES" and "BRONX" (separated at second space)
12345678901234567890QUEENS -> "12345678901234567890" and "QUEENS" (separated at 20th character)
1234567890 1234567890QUEENS -> "1234567890 123456789" and "0QUEENS" (separated at 20th character)
1234567890 1234567890STATEN ISLAND -> "1234567890 123456789" and "0STATEN" (separated at 20th character, then separated at space)
^([^ ]+[ ][^ ]+)[ ](.*)$|(.{20})(.*)$
You can try this.Grab the captures.
1)([^ ]+[ ][^ ]+)[ ](.*) will break on second space
2)(.{20})(.*) will break on 20 characters.
See demo.
http://regex101.com/r/gT6kI4/4
This is my solution, which makes use of lookbehind:
"([^ ]*(?:[ ][^ ]*)?)(?<!.{21})[ ]?([^ ]{0,20})"
([^ ]*(?:[ ][^ ]*)?)(?<!.{21}) matches and captures the first part, which must be strictly less than 21 characters and contains maximum one space. Due to the greedy quantifiers, it will always try for the longest possible string first (always match past the first space first) and reduces its length when limited by the look-behind. The lookbehind only allows the matcher to proceed when you can't find 21 characters to match, which means the part in front is less than 20 characters.
Since the first part can end with space, I need to match it with [ ]?.
Then, since the second part can't contain any space (since it breaks at the first space), it can simply be matched and captured by ([^ ]{0,20}).
Note that this solution assumes there is no line separator character in the input string.
There is a caveat: the first part may contain trailing space, if it is the first space and it is the 20th character. You can prevent that by making a small change:
"([^ ]*(?:[ ][^ ]+)?)(?<!.{21})[ ]?([^ ]{0,20})"
^
Demo on ideone
I don't think this could be done using one regex pattern.
I suggest running this pattern first:
^(.{20})(.*)$
if sub-pattern no. 1 contains more than one space then fail it and run this pattern instead
^(\S+\s\S+)\s(.*)$

Analysing a more complex regex

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.
The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.
This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

Categories