How can I push regex matches to array in java? - java

I've currently got a string, of which I want to use certain parts. With these parts I want to do various things, like pushing them to an array or showing them in a text area.
Fist I try to split method. It delete my regex matches and prints other part of string. I want to delete other part and print the regex match.
How can I do this?
For example:
There are lot of youtube links like this
https://www.youtube.com/watch?v=qJuoXM7G322&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7
I want to take only simple video link with this expression
"https:\\/\\/www.youtube.com\\/watch\\?v=.{11}"
when I use this code :
String ytLink = linkArea.getText();
String regexp = "https:\\/\\/www.youtube.com\\/watch\\?v=.{11}";
String[] tokenVal;
tokenVal = ytLink.split(regexp);
System.out.println("Count of Links : "+tokenVal.length);
for (String t : tokenVal) {
System.out.println(t);
}
It prints
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7"
I want to output be like this:
"https://www.youtube.com/watch?v=SATL2mTfZO0"

"when I Right this code :"
You are splitting the string with that regular expression, which is not the correct tool for the job.
It is dividing your example string into:
"" // The bit before the separator.
"https://www.youtube.com/watch?v=qJuoXM7G322" // The separator
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7" // The bit after the separator
but then discarding the separator, so you'd get back a 2-element array containing:
"" // The bit before the separator.
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7" // The bit after the separator
If you want to get the thing that matches the regex, you'd need to use Pattern and Matcher:
Pattern pattern = Pattern.compile("https:\\/\\/www.youtube.com\\/watch\\?v=.{11}");
Matcher matcher = pattern.matcher(ytLink);
if (matcher.find()) {
System.out.println(matcher.group());
}
(I don't entirely trust your escaped backslashes in your regular expression; however the pattern is not really important to the principle)

You can negate your regex using the negative lookaround: (?!pattern)
See also : How to negate the whole regex?

Related

String split method returning first element as empty using regex

I'm trying to get the digits from the expression [1..1], using Java's split method. I'm using the regex expression ^\\[|\\.{2}|\\]$ inside split. But the split method returning me String array with first value as empty, and then "1" inside index 1 and 2 respectively. Could anyone please tell me what's wrong I'm doing in this regex expression, so that I only get the digits in the returned String array from split method?
You should use matching. Change your expression to:
`^\[(.*?)\.\.(.*)\]$`
And get your results from the two captured groups.
As for why split acts this way, it's simple: you asked it to split on the [ character, but there's still an "empty string" between the start of the string and the first [ character.
Your regex is matching [ and .. and ]. Thus it will split at this occurrences.
You should not use a split but match each number in your string using regex.
You've set it up such that [, ] and .. are delimiters. Split will return an empty first index because the first character in your string [1..1] is a delimiter. I would strip delimiters from the front and end of your string, as suggested here.
So, something like
input.replaceFirst("^[", "").split("^\\[|\\.{2}|\\]$");
Or, use regex and regex groups (such as the other answers in this question) more directly rather than through split.
Why not use a regex to capture the numbers? This will be more effective less error prone. In that case the regex looks like:
^\[(\d+)\.{2}(\d+)\]$
And you can capture them with:
Pattern pat = Pattern.compile("^\\[(\\d+)\\.{2}(\\d+)\\]$");
Matcher matcher = pattern.matcher(text);
if(matcher.find()) { //we've found a match
int range_from = Integer.parseInt(matcher.group(1));
int range_to = Integer.parseInt(matcher.group(2));
}
with range_from and range_to the integers you can no work with.
The advantage is that the pattern will fail on strings that make not much sense like ..3[4, etc.

How to remove the # in a string using Pattern in java

I need to remove a part of the string which starts with #.
My sample code works for one string and fails for another.
Failed one: Not able to remove #news4buffalo:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Success one:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Output:
couldn't agree more. Good crowd last night. #LetsGoFish
From your question it looks like this regex can work for you:
rawContent = rawContent.replaceAll("#\\S*", "");
You can try in this way as well.
String s = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
System.out.println(s.replaceAll("#[^\\s]*\\s+", ""));
// Look till space is not found----^^^^ ^^^^---------remove extra spaces as well
The regex is only considering word characters whereas your input String contains a colon :. You can solve this by replacing \\w with \\S (any non-whitespace character) in your regex. Also there is no need for two patterns.
String regex = "#\\S*";
You don't need to escape # so don't add \ before it like "\\#" (it confuses people).
Don't use matcher to check if string contains part which should be replaced and than use replaceAll because you will have to iterate second time. Just use replaceAll at start, and if it doesn't have anything to replace, it will leave string unchanged. BTW. use replaceAll from Matcher instance to avoid recompiling Pattern.
Regex in form foo||bar doesn't seem right. Regex uses only one pipe | to represent OR so such regex represents foo OR emptyString OR bar. Since empty String is kind of special (every string contains empty string at start, and at end, and even in between characters) it can cause some problems like "foo".replaceAll("|foo", "x") returns xfxoxox, instead of for instance "xxx" because consumption of empty string before f prevented it from being used as potential first character of foo :/
Anyway it seems that you would like to accept any #xxxx words so consider maybe something like "#\\w+" if you want to make sure that there will be at least one character after #.
You can also add condition that # must be first character of word (in case you wouldn't want to remove part after # from e-mail addresses). To do this just use look-behind like (?<=\\s|^)# which will check that before # exist some whitespace, or it is placed at start of the string.
You can also remove space after word you wanted to remove (it there is any).
So you can try with
String regex = "(?<=\\s|^)#\\w*\\s?";
which for data like
RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
will return
RT : Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
But if you would also like to remove other characters beside alphabetic or numeric ones from \\w like : you can simply use \\S which represents non-whitespace-characters, so your regex can look like
String regex = "(?<=\\s|^)#\\S*\\s?";

Java Repeat regular expression

I have following RegEx which should match e.g. some ids in brackets:
[swpf_02-7679, swpf_02-7622, ...]
Pattern p = Pattern.compile("[\\[\\s]*?[a-z]{1,8}[0-9]*?_[0-9]{2,}\\-[0-9]+[\\s]*?\\]");
The goal is now to combine this pattern with "split" at "," to fit the string [swpf_02-7679, swpf_02-7622] and not only [swpf_02-7679] like the posted RegEx above.
Can someone give me a hint?
Just remove the [ and ] from the string then split at the ,
The easiest way to do what you want to do I think is to just remove the '[' and ']' in front and back (use String.subString()), then split on comma with String.split() and use the regex on each individual string so returned (adjust the regex to remove the brackets of course).
Ok, assuming that you want the bits that the id's are like "swpf_02-7622", then split on the comma, and loop through the remains, trimming as you go. Some thing like
List<String> cleanIds = new ArrayList<String>();
for(String id : ids.split(","))
cleanIds.add(id.trim());
If you want rid of the "swpf_" bits, then id.substring(5).
Finally, to git rid of the square brackets, use id.startsWith('[') and id.endsWith(']') .
Why don't you use the Java StringTokenizer class and then just use the regex on the tokens you get out of this? You can post-process them to include the brackets you need or modify the regex slightly.
As #was and #garyh already mentioned the simplest way is to remove [], then split your list using `String.split("\s*,\S*"), then match each member using your pattern.
You can also match your string multiple times using start position as a end position of the previous iteration:
Pattern p = .... // your pattern in capturing brackets ()
Matcher m = p.matcher(str);
for (int start = 0; m.find(start); start = m.end()) {
String element = m.group(1);
// do what you need with the element.
}
If you simply want to extract all the codes in you list you could use this regular expression:
[^,\s\[\]]+
Getting all the matches from the following string:
[swpf_02-7679, swpf_02-762342, swpf_02-7633 , swpf_02-723422]
Would give you the following results:
swpf_02-7679
swpf_02-762342
swpf_02-7633
swpf_02-723422

java Regex - split but ignore text inside quotes?

using only regular expression methods, the method String.replaceAll and ArrayList
how can i split a String into tokens, but ignore delimiters that exist inside quotes?
the delimiter is any character that is not alphanumeric or quoted text
for example:
The string :
hello^world'this*has two tokens'
should output:
hello
worldthis*has two tokens
I know there is a damn good and accepted answer already present but I would like to add another regex based (and may I say simpler) approach to split the given text using any non-alphanumeric delimiter which not inside the single quotes using
Regex:
/(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+/
Which basically means match a non-alphanumeric text if it is followed by even number of single quotes in other words match a non-alphanumeric text if it is outside single quotes.
Code:
String string = "hello^world'this*has two tokens'#2ndToken";
System.out.println(Arrays.toString(
string.split("(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+"))
);
Output:
[hello, world'this*has two tokens', 2ndToken]
Demo:
Here is a live working Demo of the above code.
Use a Matcher to identify the parts you want to keep, rather than the parts you want to split on:
String s = "hello^world'this*has two tokens'";
Pattern pattern = Pattern.compile("([a-zA-Z0-9]+|'[^']*')+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
See it working online: ideone
You cannot in any reasonable way. You are posing a problem that regular expressions aren't good at.
Do not use a regular expression for this. It won't work. Use / write a parser instead.
You should use the right tool for the right task.

Java replaceAll regex With Similar Result

Alright folks, my brain is fried. I'm trying to fix up some EMLs with bad boundaries by replacing the incorrect
--Boundary_([ArbitraryName])
lines with more proper
--Boundary_([ArbitraryName])--
lines, while leaving already correct
--Boundary_([ThisOneWasFine])--
lines alone. I've got the whole message in-memory as a String (yes, it's ugly, but JavaMail dies if it tries to parse these), and I'm trying to do a replaceAll on it. Here's the closest I can get.
//Identifie bondary lines that do not end in --
String regex = "^--Boundary_\\([^\\)]*\\)$";
Pattern pattern = Pattern.compile(regex,
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(targetString);
//Store all of our unique results.
HashSet<String> boundaries = new HashSet<String>();
while (matcher.find())
boundaries.add(s);
//Add "--" at the end of the Strings we found.
for (String boundary : boundaries)
targetString = targetString.replaceAll(Pattern.quote(boundary),
boundary + "--");
This has the obvious problem of replacing all of the valid
--Boundary_([WasValid])--
lines with
--Boundary_([WasValid])----
However, this is the only setup I've gotten to even perform the replacement. If I try changing Pattern.quote(boundary) to Pattern.quote(boundary) + "$", nothing is replaced. If I try just using matcher.replaceAll("$0--") instead of the two loops, nothing is replaced. What's an elegant way to achieve my aim and why does it work?
There's no need to iterate through the matches with find(); that's part of what replaceAll() does.
s = s.replaceAll("(?im)^--Boundary_\\([^\\)]*\\)$", "$0--");
The $0 in the replacement string is a placeholder whatever the regex matched in this iteration.
The (?im) at the beginning of the regex turns on CASE_INSENSITIVE and MULTILINE modes.
You can try something like this:
String regex = "^--Boundary_\\([^\\)]*\\)(--)?$";
then see if the string ends with -- and replace only ones that don't.
Assuming all the strings are on there own line this works:
"(?im)^--Boundary_\\([^)]*\\)$"
Example script:
String str = "--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n";
System.out.println(str.replaceAll("(?im)^--Boundary_\\([^)]*\\)$", "$0--"));
Edit: changed from JavaScript to Java, must have read too fast.(Thanks for pointing it out)

Categories