Find substring that surrounds a pattern using java - java

I have a long string variable X and another string(a word or two in length) Y. I want to find 50 words before and after Y where it appears in X. How can I achieve this using reg-ex.

Why does this have to be an regexp? What if there aren't 50 words surrounding it, because the match is at the beginning of the string?
Consider just locating the match, then separately finding an appropriate "snippet" surrounding it, without trying to cram it all into one magic, unmaintainable regular expression.
There is nothing wrong with doing it explicit: find the match, grow the snippet to the desired size, return the match. Make that a well-documented method "extractSnippet" instead of trying to do it in a single regular expression.

This code generates a string of 300 words (Word0 .. Word299), defines the target to search for as "Word12 Word13" and then finds up to 50 words before that string and up to 50 words after it.
final StringBuilder b = new StringBuilder();
final String matchWords = "Word12 Word13";
for (int i = 0; i < 300; i++) b.append("Word").append(i).append(" ");
final Matcher m =
Pattern.compile(
"((?:\\S+\\s+){0,50})" + Pattern.quote(matchWords) + "((?:\\s+\\S+){0,50})"
).matcher(b.toString());
if (m.find()) System.out.println("Words before: " + m.group(1) +
"\nAfter: " + m.group(2));

Check this PHP regex out, I'm pretty sure it'll work for Java too:
php > preg_match_all("/([a-z]+ ){4}donkey( [a-z]+){4}/","summer donna summer donna summer donkey hop hop hop hop bzzp",$matches); print_r($matches);
Array
(
[0] => Array
(
[0] => donna summer donna summer donkey hop hop hop hop
)
[1] => Array
(
[0] => summer
)
[2] => Array
(
[0] => hop
)
)

Java needs the Java.util.regex.* lib (the last kliny is for the dependancies) to preform that. Import that and invoke an instance such as:
Pattern p = Pattern.compile("(\\d+)");
Matcher m = p.matcher(name);
StringBuffer sb = new StringBuffer();
while(m.find()){
sb.append(m.group()); //this appends the context of Pattern p to the appended sb
}
In the Pattern, regular regex syntax can be invoked.
I would think you could run into issues where there may not be 50+- words preceding or succeeding the y string.
Roughly, I would say first check for existence with a pattern like $y to preform on X.Then go to the expense of counting words with a split operation and a " " space delimiter. From there, its a counting problem.

Related

Is there an easier way to find a certain format within a longer String?

I'm building a program and ran into a problem, I'm not sure how to conquer it most efficiently.
I need to write an algorithm that takes a String in this format:
12/05/2014 PROJ Assignment 4 20/20 100 4
and it will remove everything but
20/20
so I can then substring that and parse it to an integer value. This is what I've tried, but I'm not sure what the best way to do this would be. My while loop works, going from each / to the next, but the loop will only stop when the string has 20 100 4 left, I need the 20/20, but not the 100 or 4.
String line = "12/05/2014 PROJ Assignment 4 20/20 100 4";
int slashIndex = line.indexOf("/");
String temp = line.substring((slashIndex+1));
System.out.println(temp);
while(temp.indexOf("/") != -1){
slashIndex = temp.indexOf("/");
temp = temp.substring((slashIndex+1));
System.out.println(temp);
}
If I do it the way I'm doing, I could potentially use the slashIndex of the last slash, and then make a substring from the original String- however the score may vary. It could be 20/20 or it could be 100/200, or 10/100, so how could I make the program dynamic enough to decide where to cut it up?
Any thoughts or ideas would be great, thanks.
Connor
Just split the input on one or more whitespaces (\\s+). The 5th field will have index 4 of the parts.
String t = "12/05/2014 PROJ Assignment 4 20/20 100 4";
String[] parts = t.split("\\s+");
System.out.println(parts[4]);
Output:
20/20
try this
str = str.replaceAll(".* (\\d+/\\d+) .*", "$1");
String line = "12/05/2014 PROJ Assignment 4 20/20 100 4"
String pattern = " ([0-9]{1,3}\/[0-9]{1,3}) ";
String numbers = line.replaceAll(pattern, "$1");
System.out.println(numbers);
If you want to do it with regex, this one, ensure that you got exacly the same input format string.
I create an regexplain for an "explication of the regex"
Pattern mypat = Pattern.compile("\d{2}\\/\d{2}\\/\d{4} [A-Z]{4} \w+ \d? (\d+\/\d+) \d+ \d?");
// ...
Matcher m = mypat.matcher("12/05/2014 PROJ Assignment 4 20/20 100 4");
if (m.matches()) {
String value = m.group(1);
}
Create a regex (you may find something more sophisticated regarding the actual regex, i am not a regex expert) that matches the string you want, for example:
Pattern pattern = Pattern.compile("([2][0]\\/[2][0])");
then create a matcher using the pattern
Matcher m = pattern.matcher("12/05/2014 PROJ Assignment 4 20/20 100 4");
and finally if m.matches() get the first group that matched:
m.group(0)

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

How to find all occurrences of a substring (with wildcards allowed) in a given String

I'm searching for an efficient way for a wildcard-enabled search in Java. My first approach was of course to use regex. However this approach does NOT find ALL possible matches!
Here's the code:
public static ArrayList<StringOccurrence> matchesWildcard(String string, String pattern, boolean printToConsole) {
Pattern p = Pattern.compile(normalizeWildcards(pattern));
Matcher m = p.matcher(string);
ArrayList<StringOccurrence> res = new ArrayList<StringOccurrence>();
int count = 0;
while (m.find()){
res.add(new StringOccurrence(m.start(), m.end(), count, m.group()));
if(printToConsole)
System.out.println(count + ") " + m.group() + ", " + m.start() + ", " + m.end());
count +=1;
}
return res;
For a query q: ab*b and a String str: abbccabbccbbb I get the output:
0) abb, 0, 3
1) abb, 5, 8
But the whole String should be also a result, because it matches the pattern. It seems that the Java-implementation of regex starts each new search after the last match...
Any ideas how this could work (or suggestions for frameworks...)?
If you really need all possible matches, this answer is not useful for you (anyway maybe other user finds it useful).
If the widest match would be sufficient for you, then use a greedy quantifier (I guess you're using a reluctant one, showing your pattern would be useful).
Google for greedy vs reluctant quantifiers for regex.
Cheers.
ab*b means "a" followed by zero or more "b" followed by a "b". The minimum match would be "ab". Soulds like you're looking for something like: a[a-z]*b where [a-z]* indicates zero or more of any lowercase letter. You may also want to bound it so that the start of the "word" must be an "a" and the end must be a "b": \ba[a-z]*b\b
You are expecting * to mean .* and .*? at the same time (and more).
You should reconsider what you really need. Let's extend your example:
abbccabbccbbbcabb
Do you really want all possibilities?
To achieve what you want you'll have to
iterate p1 over all occurrences of "ab"
from p1+2 on
iterate p2 over all occurrences of "b"
output substring between p1 and p2+1
This is the corresponding Java code:
public static void main( String[] args ){
String s = "abbccabbccbbb";
int f1 = 0;
int p1;
while( (p1 = s.indexOf( "ab", f1 )) >= 0 ){
int f2 = p1 + 2;
int p2;
while( (p2 = s.indexOf( "b", f2 )) >= 0 ){
System.out.println( s.substring( p1, p2 + 1 ) );
f2 = p2 + 1;
}
f1 = p1 + 2;
}
}
Below is the output. You may be surprised - maybe that's more than you expect, but then you'll need to refine your specification.
abb 0:3
abbccab 0:7
abbccabb 0:8
abbccabbccb 0:11
abbccabbccbb 0:12
abbccabbccbbb 0:13
abb 5:8
abbccb 5:11
abbccbb 5:12
abbccbbb 5:13
Later
Why is a single regular expression not capable of doing it?
The basic mechanism of pattern matching is to try and match the regex against a string, starting at some position, initially 0. If a match is found, this position is advanced according to the matched string. The pattern matcher never looks back.
A pattern ab.*?b will try and find the next 'b' after an "ab". This means that *no match is possible beginning with the same "ab" and ending at some 'b' following that previously found "next 'b'".
In other words: one regex cannot find overlapping substrings.

Discard the leading and trailing series of a character, but retain the same character otherwise

I have to process a string with the following rules:
It may or may not start with a series of '.
It may or may not end with a series of '.
Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of '.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a * at the end of my regex, instead of +. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.
This one works for me.
String str = "''''aa''bb''cc''''";
Pattern p = Pattern.compile("^'*(.*?)'*$");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
have a look at the boundary matcher of Java's Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). Especially $ (=end of a line) might be interesting. I also recommend the following eclipse plugin for regex testing: http://sourceforge.net/projects/quickrex/ it gives you the possibilty to exactly see what will be the match and the group of your regex for a given test string.
E.g. try the following pattern: [^']+(.+'*.+)+[^'$]
I'm not that good in Java, so I hope the regex is sufficient. For your examples, it works well.
s/^'*(.+?)'*$/$1/gm

This RegEx captures wrong number of groups

I have to parse a string and capture some values:
FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE
I want to capture 2 groups:
grp 1: 2, 2
grp 2: TU, WE
The Numbers represents intervals. TU, WE represents weekdays. I need both.
I'm using this code:
private final static java.util.regex.Pattern regBYDAY = java.util.regex.Pattern.compile(".*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*");
String rrule = "FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE";
java.util.regex.Matcher result = regBYDAY.matcher(rrule);
if (result.matches())
{
int grpCount = result.groupCount();
for (int i = 1; i < grpCount; i++)
{
String g = result.group(i);
...
}
}
grpCount == 2 - why? If I read the java documentation correctly (that little bit) I should get 5? 0 = the whole expression, 1,2,3,4 = my captures 2,2,TU and WE.
result.group(1) == "2";
I'm a C# Programmer with very little java experience so I tested the RegEx in the "Regular Expression Workbench" - a great C# Program for testing RegEx. There my RegEx works fine.
https://code.msdn.microsoft.com/RegexWorkbench
RegExWB:
.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*
Matching:
FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR
1 => 22
1 => -2
1 => +223
2 => TU
2 => WE
2 => FR
You may also use this approach to increase readability and up to certain point independence from the implementation using a more common regexp subset
final Pattern re1 = Pattern.compile(".*;BYDAY=(.*)");
final Pattern re2 = Pattern.compile("(?:([+-]?[0-9]*)([A-Z]{2}),?)");
final Matcher matcher1 = re1.matcher(rrule);
if ( matcher1.matches() ) {
final String group1 = matcher1.group(1);
Matcher matcher2 = re2.matcher(group1);
while(matcher2.find()) {
System.out.println("group: " + matcher2.group(1) + " " +
matcher2.group(2));
}
}
Your regex works the same in Java as it does in C#; it's just that in Java you can only access the final capture for each group. In fact, .NET is one of only two regex flavors I know of that let you retrieve intermediate captures (Perl 6 being the other).
This is probably the simplest way to do what you want in Java:
String s= "FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR";
Pattern p = Pattern.compile("(?:;BYDAY=|,)([+-]?[0-9]+)([A-Z]{2})");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.printf("Interval: %5s, Day of Week: %s%n",
m.group(1), m.group(2));
}
Here's the equivalent C# code, in case you're interested:
string s = "FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR";
Regex r = new Regex(#"(?:;BYDAY=|,)([+-]?[0-9]+)([A-Z]{2})");
foreach (Match m in r.Matches(s))
{
Console.WriteLine("Interval: {0,5}, Day of Week: {1}",
m.Groups[1], m.Groups[2]);
}
I'm a bit rusty, but I'll propose to "caveats". First of all, regexp(s) come in various dialects. There is a fantastic O'Reilly book about this, but there is a chance that your C# utility applies slightly different rules.
As an example, I used a similar (but different tool) and discovered that it did parse things differenty...
First of all it rejected your regexp (maybe a typo?) the initial "*" does not make sense, unless you put a dot (.) in front of it. Like this:
.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*
Now it was accepted, but it "matched" only the 2/WE part, and "skipped" the 2/TU pair.
(I suggest you read about greedy and non-greedy matching to understand this a bit better.
Therefore I updated your pattern as follows:
.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?),(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*
And now it works and correctly captures 2,TU,2 and WE.
Maybe this helps?

Categories