Split by space but not newline - java

I am trying to convert all links in a given string to clickable a tags using the following code :
String [] parts = comment.split("\\s");
String newComment=null;
for( String item : parts ) try {
URL url = new URL(item);
// If possible then replace with anchor...
if(newComment==null){
newComment=""+ url + " ";
}else{
newComment=newComment+""+ url + " ";
}
} catch (MalformedURLException e) {
// If there was an URL that was not it!...
if(newComment==null){
newComment = item+" ";
}else{
newComment = newComment+item+" ";
}
}
It works fine for
Hi there, click here http://www.google.com ok?
converting it to
Hi there, click here http://www.google.com ok?
But when the string is this :
Hi there, click
here http://www.google.com
ok?
its still converting it to :
Hi there, click here http://www.google.com ok?
Whereas I want the final result to be :
Hi there, click
here http://www.google.com
ok?
I think its including the newline character also while making the split.
How do I preserve the newline character in this case ?

I would suggest a different approach:
String noNewLines = "Hi there, click here http://www.google.com ok?";
String newLines = "Hi there, \r\nclick here \nhttp://www.google.com ok?";
// This is a String format with two String variables.
// They will be replaced with the desired values once the "format" method is called.
String replacementFormat = "%s";
// The first round brackets define a group with anything starting with
// "http(s)". The second round brackets delimit that group by a lookforward reference
// to whitespace.
String pattern = "(http(s)?://.+?)(?=\\s)";
noNewLines = noNewLines.replaceAll(
pattern,
// The "$1" literals are group back-references.
// In our instance, they reference the group enclosed between the first
// round brackets in the "pattern" String.
new Formatter().format(replacementFormat, "$1", "$1")
.toString()
);
System.out.println(noNewLines);
System.out.println();
newLines = newLines.replaceAll(
pattern,
new Formatter().format(replacementFormat, "$1", "$1")
.toString()
);
System.out.println(newLines);
Output:
Hi there, click here http://www.google.com ok?
Hi there,
click here
http://www.google.com ok?
This will replace all your http(s) links to an anchor reference, whether or not you have newlines (windows or *nix) in your text.
Edit
For best coding practices you should set the replacementFormat and pattern variables as constants (so, final static String REPLACEMENT_FORMAT and so on).
Edit II
Actually grouping the URl pattern isn't really necessary, as the whitespace lookahead is sufficient. But well, I'm leaving it as is, it works.

You could just use
String [] parts = comment.split("\\ ");
instead of
String [] parts = comment.split("\\s");
as eldris said, "\s" is for every white-space character, so "\ ", for just the space character itself should do for you.

I would suggest following solution to your problem:
First split by new line character
For each line do processing that you have mentioned above
Add all processed lines
That ways new line character will be retained and also you will be able to do in each line what you are currently doing.
Hope this helps.
Cheers !!

Related

Regular expression to find everything except a pattern

I'm pretty new to regular expressions and am looking for one that matches anything except all that matches a given regex. I've found ways to find anything except a specific string, but I need it to not match a regex. Also it has to work in Java.
Background: I am working with Ansi-colored strings. I want to take a string that has some text that may be formatted with Ansi color codes and remove anything except those color codes. This should give me the current color formatting for any character appended onto the string.
A formatted string may look like this:
Hello \u001b[31;44mWorld\u001b[0m!
which would display as Hello World! where the World would be colored red on a blue background.
My regex to find the codes is
\u001b\[\d+(;\d+)*m
Now I want a regex that matches everything but the color codes, so it matches
Hello \u001b[31;44m World \u001b[0m !
Your regex in context:
public static void main(String[] args) {
String input = "Hello \u001b[31;44mWorld\u001b[0m!";
String result = Pattern.compile("\u001b\\[\\d+(;\\d+)*m").matcher(input).replaceAll("");
System.out.println("Output: '" + result + "'");
}
Output:
Output: 'Hello World!'
Regex isn't really meant to give 'everything but' the regex match. The easiest way to generally do something like this though is match what you want (like the color codes in your case), then take the string you have, and remove the matches you found, this will leave 'everything but' the match.
Quick sample (very untested)
String everythingBut = "string that has regex matches".replaceAll("r[eg]+x ", "");
Should result in string that has matches i.e. the inverse of your regex
String text="Hello \u001b[31;44mWorld\u001b[0m!";
Arrays.asList( text.split("\\[([;0-9]+)m"))
.stream()
.forEach(s -> aa.replaceAll(s,""));
OUTPUT:
[31;44m[0m
You can do it like this. It simply finds all the matches and puts them in an array which can be joined to a String if desired.
String pat = "\u001b\\[\\d+(;\\d+)*m";
String html = "Hello \u001b[31;44mWorld\u001b[0m!";
Matcher m = Pattern.compile(pat).matcher(html);
String[] s = m.results().map(mr->mr.group()).toArray(String[]::new);

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Get a specific word out of a String with regex

I have an String
String string = "-minY:50 -maxY:100 -minVein:8 -maxVein:10 -meta:0 perChunk:5;";
And I want to somehow get the -meta:0 out of it with regex (replace everything except -meta:0), I made an regex which deletes -meta:0 but I can't make it delete everything except -meta:0
I tried using some other regex but it was ignoring whole line when I had -meta:[0-9] in it, and like you can see I have one line for everything.
This is how it has been deleting -meta:0 from the String:
String meta = string.replaceAll("( -meta:[0-9])", "");
System.out.println(meta);
I just somehow want to reverse that and delete everything except -meta:[0-9]
I couldn't find anything on the page about my issue because everything was ignoring whole line after it found the word, so sorry if there's something similar to this.
You should be capturing your match in a captured group and use it's reference in replacement as:
String meta = string.replaceAll("^.*(-meta:\\d+).*$", "$1");
System.out.println(meta);
//=> "-meta:0"
RegEx Demo
As I understand your requirement you want to :
a) you want to extract meta* from the string
b) replace everything else with ""
You could do something like :
String string = "-minY:50 -maxY:100 -minVein:8 -maxVein:10 -meta:0 perChunk:5;";
Pattern p = Pattern.compile(".*(-meta:[0-9]).*");
Matcher m = p.matcher(string);
if ( m.find() )
{
string = string.replaceAll(m.group(0),m.group(1));
System.out.println("After removal of meta* : " + string);
}
What this code does is it finds meta:[0-9] and retains it and removes other found groups

Regular expression with URL Encoded Strings

I have strings that contain URL encoding (%22) and other characters [!##$%^&*]. I need to use a RegEx to check if the string contains a character within that group but exclude the URL encoded quote (%22). I can't get the negative look ahead to work properly nor am I able to get an excluded string (or negation) working either. Can someone help? Here is code so far that doesn't work:
Pattern p = Pattern.compile("[!##$%^&*]"); //
String[] tokens = {"%22Hobo%22", "Shoe*", "Rail%6Road","Sugar"};
for (String string : tokens) {
Matcher m = p.matcher(string);
boolean b = m.find()
System.out.println(string + ": " + b);
}
The desired output should be false, true, true, false.
(?!%22)[!##$%^&*]
Try this.See demo.
https://regex101.com/r/mS3tQ7/16
export const uriParser = (x) =>
//replace/regex exclude-negated [set-of-tokens], doesn't work/parse for (%[A-Fa-f0-9]{2})+
//decodeURI() does the same I believe, but this will always return a string,
//without an error object
//a-z or A-Z includes underscore '_' but not space whitespace, nor (\r\n|\r|\n)+
x.replace(/(%[A-Fa-f0-9]{2})+[^a-zA-Z0-9-+ ]+/g, "_");
https://www.ietf.org/rfc/rfc3986.txt#:~:text=2.4.%20%20When%20to%20Encode%20or%20Decode%0A
for my purposes I make my uri link fragmens go thru (%[A-Fa-f0-9]{2})+ on mount so I use .replace("_"," ") for ui but uriParser() for outgoing links in ux to bypass redundancy as much as possible. The use case choice is between getting a string always & putting specifications for other characters before this. "Whyo you do not use a URLEncoder?" – Jens' comment on question

removing white spaces from string value

i have a link http://localhost:8080/reporting/pvsUsageAction.do?form_action=inline_audit_view&days=7&projectStatus=scheduled&justificationId=5&justificationName= No Technicians in Area in my struts based web application.
The variable in URL justificationName have some spaces before its vales as shown. when i get value of justificationName using request.getParameter("justificationName") it gives me that value with spaces as given in the URL. i want to remove those spaces. i tried trim() i tries str = str.replace(" ", ""); but any of them did not removed those spaces. can any one tell some other way to remove the space.
Noted one more thing that i did right click on the link and opened the link into new tab there i noticed that link looks like.
http://localhost:8080/reporting/pvsUsageAction.do?form_action=inline_audit_view&days=7&projectStatus=scheduled&justificationId=5&justificationName=%A0%A0%A0%A0%A0%A0%A0%A0No%20Technicians%20in%20Area
Notable point is that in the address bar it shows %A0 for white spaces and also show %20 for space as well see the link and tell the difference please if any one have idea about it.
EDIT
Here is my code
String justificationCode = "";
if (request.getParameter("justificationName") != null) {
justificationCode = request.getParameter("justificationName");
}
justificationCode = justificationCode.replace(" ", "");
Note: replace function remove the space from inside the string but not removing starting spaces.
e-g if my string is " This is string" after using replace it becomes " Thisisstring"
Thanks in advance
Strings are immutable in Java, so the method doesn't change the string you pass but returns a new one. You must use the returned value :
str = str.replace(" ", "");
Manual trim
You need to remove the spaces the string. This will remove any number of consecutive spaces.
String trimmed = str.replaceAll(" +", "");
If you want to replace all whitespace characters:
String trimmed = str.replaceAll("\\s+", "");
URL Encoding
You could also use an URLEncoder, which sounds like a more appropriate way to go:
import java.net.UrlEncoder;
String url = "http://localhost:8080/reporting/" + URLEncoder.encode("pvsUsageAction.do?form_action=inline_audit_view&days=7&projectStatus=scheduled&justificationId=5&justificationName= No Technicians in Area", "ISO-8859-1");
You have to assign the result of the replace(String regex, String replacement) operation to another variable. See the Javadoc for the replace(String regex, String replacement) method. It returns a brand new String object and this is because the String(s) in Java are immutable. In your case, you can simply do the following
String noSpacesString = str.replace("\\s+", "");
You can use replaceAll("\\s","") It will remove all white space.
If you are trying to remove the trailing and ending white spaces, then
s = s.trim();
Or if you want to remove all the spaces the use :
s = s.replace(" ","");
There are two ways of doing one is regular expression based or your own way of implementing the logic
replaceAll("\\s","")
or
if (text.contains(" ") || text.contains("\t") || text.contains("\r")
|| text.contains("\n"))
{
//code goes here
}

Categories