Given _<A_>_<B_>_<Z_>, I want to extract A, B, C in an array.
Basically _< is the starting delimiter and _> is the ending delimiter.
You can use lookaround assertions to match only the content of the tags.
String text = "_<A_>_<B_>_<Z_>";
List<String> Result = new ArrayList<String>();
Pattern p = Pattern
.compile("(?<=_<)" + // Lookbehind assertion to ensure the opening tag before
".*?" + // Match a less as possible till the lookahead is true
"(?=_>)" // Lookahead assertion to ensure the closing tag ahead
);
Matcher m = p.matcher(text);
while(m.find()){
Result.add(m.group(0));
}
That's simple - cut out first opening and last closing , and then split it by close-open
string.replaceFirst( "^_<(.*)_>$", "$1" ).split( "_>_<" );
You extract them using capture groups.
split by _< to get 2 elements, take the 2nd and split it by _> to get 2 elements, take the 1st and split it by _>_< to get A, B, C
Related
As title says, I've a string and I want to extract some data from It.
This is my String:
text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
and I want to extract all the data between the pipes: tab_PRO, 1, 1...and so on
.
I've tried:
Pattern p = Pattern.compile("\\|(.*?)\\|");
Matcher m = p.matcher(text);
while(m.find())
{
for(int i = 1; i< 10; i++) {
test = m.group(i);
System.out.println(test);
}
}
and with this i get the first group that's tab_PRO. But i also get an error
java.lang.IndexOutOfBoundsException: No group 2
Now, probably I didn't understand quite well how the groups works, but I thought that with this I could get the remaining data that I need. I'm not able to understand what I'm missing.
Thanks in advance
Use String.split(). Take into account it expects a regex as an argument, and | is a reserved regex operand, so you'll need to escape it with a \. So, make it two \ so \| won't be interpreted as if you're using an - invalid - escape sequence for the | character:
String[] parts = text.split("\\|");
See it working here:
https://ideone.com/WibjUm
If you want to go with your regex approach, you'll need to group and capture every repetition of characters after every | and restrict them to be anything except |, possibly using a regex like \\|([^\\|]*).
In your loop, you iterate over m.find() and just use capture group 1 because its the only group every match will have.
String text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
Pattern p = Pattern.compile("\\|([^\\|]*)");
Matcher m = p.matcher(text);
while(m.find()){
System.out.println(m.group(1));
}
https://ideone.com/RNjZRQ
Try using .split() or .substring()
As mentioned in the comments, this is easier done with String.split.
As for your own code, you are unnecessarily using the inner loop, and that's leading to that exception. You only have one group, but the for loop will cause you to query more than one group. Your loop should be as simple as:
Pattern p = Pattern.compile("(?<=\\|)(.*?)\\|");
Matcher m = p.matcher(text);
while (m.find()) {
String test = m.group(1);
System.out.println(test);
}
And that prints
tab_PRO
1
1
#tRecordType#
0
tab_PRO
Note that I had to use a look-behind assertion in your regex.
I want to split a string based on text qualifier for example
"1","10411721","MikeTison","08/11/2009","21/11/2009","2800.00","002934538","051","New York","10411720-002",".\Images\b.jpg",".\RTF\b.rtf"
Qualifer="
Spliter = ,
I want to split string based on Spliter , but if Spliter comes inside qualifier " than ignore it and return string including Spliter .
Regular expression i am using is (?:|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
but this regular expression only returns commas,please help me in this perspective as i am new to regular expressions
please note that if we have newline characters in string ie \r\n than it should ignore newline character
"1","10411","Muis","a","21/11/2009","2800.06","0029683778","03005136851","Awan","10411720-001",".\Images\a.jpg",".\RTF\a.rtf"
"2","08/10/2009","07:32","Call","On-Net","030092343242342376543","Monk","00:00","1.500","0.000","10.000","0.200"
"2","08/10/2009","02:50","Call","Off-Net","030092343242342376543","Une","08:00","1.500","2.000","20.000","3.500"
"2","09/10/2009","03:55","SMS","On-Net","030092343242342376543","Mink","00:00","1.500","0.000","5.000","100.500"
"2","09/10/2009","12:30","Call","Off-Net","030092343242342376543","Zog","01:01","3.500","3.000","70.000","6.500"
"2","09/10/2009","09:11","Call","On-Net","030092343242342376543","Monk","02:30","2.00","2.000","90.000","4.000"
Probably easiest solution is not searching for place to split, but finding elements which you want to return. In your case these elements
starts "
ends with "
have no " inside.
So you try with something like
String data = "\"1\",\"10411721\",\"MikeTison\",\"08/11/2009\",\"21/11/2009\",\"2800.00\",\"002934538\",\"051\",\"New York\",\"10411720-002\",\".\\Images\\b.jpg\",\".\\RTF\\b.rtf\"";
Pattern p = Pattern.compile("\"([^\"]+)\"");
Matcher m = p.matcher(data);
while(m.find()){
System.out.println(m.group(1));
}
Output:
1
10411721
MikeTison
08/11/2009
21/11/2009
2800.00
002934538
051
New York
10411720-002
.\Images\b.jpg
.\RTF\b.rtf
You can split using this regex:
String[] arr = input.split( "(?=(([^\"]*\"){2})*[^\"]*$),+" );
This regex will split on commas if those are outside double quotes by using a lookahead to make sure there are even number of quotes after a comma.
Remove the first and the last character of the whole string. Then split with ","
String test = "\"1\",\"10411721\",\"MikeTison\",\"08/11/2009\",\"21/11/2009\",\"2800.00\",\"002934538\",\"051\",\"New York\",\"10411720-002\",\".\\Images\\b.jpg\",\".\\RTF\\b.rtf\"";
if (test.length() > 0)
test = test.substring(1, test.length()-1);
System.out.println(Arrays.toString(test.split("\",\"")));
This works even if you have new line character..try it out
String str="\"1\",\"10411721\",\"MikeTison\",\"08/11/2009\",\"21/11/2009\",\"2800.00\",\"002934538\",\"051\",\"New York\",\"10411720-002\",\".\\Images\\b.jpg\",\".\\RTF\\b.rtf\"";
System.out.println(Arrays.toString(str.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")));
I have one string
5,(5,5),C'A,B','A,B',',B','A,',"A,B",C"A,B"
I want to split it on comma but need to exclude commas within parentheses and quotes(Both single and double quotes).
Like this
5 (5,5) C'A,B' 'A,B' ',B' 'A,' "A,B" C"A,B"
Using java Regular Expression how to achieve this ??
You can use this regex:
String input = "5,(5,5),C'A,B','A,B',',B','A,',\"A,B\",C\"A,B\"";
String[] toks = input.split(
",(?=(([^']*'){2})*[^']*$)(?=(([^\"]*\"){2})*[^\"]*$)(?![^()]*\\))" );
for (String tok: toks)
System.out.printf("<%s>%n", tok);
Output:
<5>
<(5,5)>
<C'A,B'>
<'A,B'>
<',B'>
<'A,'>
<"A,B">
<C"A,B">
Explanation:
, # Match literal comma
(?=(([^']*'){2})*[^']*$) # Lookahead to ensure comma is followed by even number of '
(?=(([^"]*"){2})*[^"]*$) # Lookahead to ensure comma is followed by even number of "
(?![^()]*\\)) # Negative lookahead to ensure ) is not followed by matching
# all non [()] characters in between
,(?![^(]*\))(?![^"']*["'](?:[^"']*["'][^"']*["'])*[^"']*$)
Try this.
See demo.
For java
,(?![^(]*\\))(?![^"']*["'](?:[^"']*["'][^"']*["'])*[^"']*$)
Instead of splitting the string, consider matching instead.
String s = "5,(5,5),C'A,B','A,B',',B','A,',\"A,B\",C\"A,B\"";
Pattern p = Pattern.compile("(?:[^,]*(['\"])[^'\"]*\\1|\\([^)]*\\))|[^,]+");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Output
5
(5,5)
C'A,B'
'A,B'
',B'
'A,'
"A,B"
C"A,B"
I have the input string of the following form "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]" and I need to extract the tokens "Animal rights" , "Anthropocentrism" and so on etc.
I tried using the split method in the String library but I am not able to find the appropriate regular expression to get the tokens, it would be great if someone could help.
I am basically trying to parse the internal links in a Wikipedia XML file you can check out the format here.
You probably shouldn't be using split() here but instead a Matcher:
String input = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
Matcher m = Pattern.compile("\\[\\[(.*?)\\]\\]").matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Animal rights
Anthropocentrism
Anthropology
A pattern like this should work:
\[\[(.*?)\]\]
This will match a literal [[ followed by zero or more of any character, non-greedily, captured in group 1, followed by a literal ]].
Don't forget to escape the \ in the Java string literal:
Pattern.compile("\\[\\[(.*)?\\]\\]");
It's pretty easy with regex.
\[\[(.+?)\]\]
Edit live on Debuggex
I recommend doing a .+ to make sure there is something actually in the brackets and you won't get a null if something doesn't exist when you're trying to put it in your array.
string output = new string [10];
string pattern = "\[\[(.+?)\]\]";
string input = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
Matcher m = Pattern.compile(pattern).matcher(input);
int increment= 0;
while (m.find()) {
output[increment] = m.group(1);
increment++;
}
Since you said you wanted to learn regex also i'll break it down.
\[ 2x is finding [ brackets you need a \ because it's regex's special characters
. can denote every character except newlines
+ means one or more of that character
? Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
\] is capturing the ]
Try the next:
String str = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
str = str.replaceAll("(^\\[\\[|\\]\\]$)", "");
String[] array = str.split("\\]\\] \\[\\[");
System.out.println(Arrays.toString(array));
// prints "[Animal rights, Anthropocentrism, Anthropology]"
Would anyone be able to assist me with some regex.
I want to split the following string into a number, string number
"810LN15"
1 method requires 810 to be returned, another requires LN and another should return 15.
The only real solution to this is using regex as the numbers will grow in length
What regex can I used to accomodate this?
String.split won't give you the desired result, which I guess would be "810", "LN", "15", since it would have to look for a token to split at and would strip that token.
Try Pattern and Matcher instead, using this regex: (\d+)|([a-zA-Z]+), which would match any sequence of numbers and letters and get distinct number/text groups (i.e. "AA810LN15QQ12345" would result in the groups "AA", "810", "LN", "15", "QQ" and "12345").
Example:
Pattern p = Pattern.compile("(\\d+)|([a-zA-Z]+)");
Matcher m = p.matcher("810LN15");
List<String> tokens = new LinkedList<String>();
while(m.find())
{
String token = m.group( 1 ); //group 0 is always the entire match
tokens.add(token);
}
//now iterate through 'tokens' and check whether you have a number or text
In Java, as in most regex flavors (Python being a notable exception), the split() regex isn't required to consume any characters when it finds a match. Here I've used lookaheads and lookbehinds to match any position that has a digit one side of it and a non-digit on the other:
String source = "810LN15";
String[] parts = source.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
System.out.println(Arrays.toString(parts));
output:
[810, LN, 15]
(\\d+)([a-zA-Z]+)(\\d+) should do the trick. The first capture group will be the first number, the second capture group will be the letters in between and the third capture group will be the second number. The double backslashes are for java.
This gives you the exact thing you guys are looking for
Pattern p = Pattern.compile("(([a-zA-Z]+)|(\\d+))|((\\d+)|([a-zA-Z]+))");
Matcher m = p.matcher("810LN15");
List<Object> tokens = new LinkedList<Object>();
while(m.find())
{
String token = m.group( 1 );
tokens.add(token);
}
System.out.println(tokens);