How can I find overlapping sets of words with regex expression?

How can I find overlapping sets of words with regex expression? - java

Right now I have a regex expression that looks like "\\w+ \\w+" to find 2-word phrases, however, they do not overlap. For example, if my sentence was The dog ran inside, the output would show "The dog", "ran inside" when I need it to show "The dog", "dog ran", "ran inside". I know there's a way to do this but I'm just way too new to using regex expressions to know how to do this.
Thanks!

You can do this with a lookahead, a capturing group and a word boundary anchor:
Pattern regex = Pattern.compile("\\b(?=(\\w+ \\w+))");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}

This is not possible purely with regex, you can't match the same characters twice ("dog" can't be in two separate groups). Something like this doesn't need regex at all, you can simply split the string by spaces and combine it however you like:
>>> words = "The dog ran inside".split(" ")
>>> [" ".join(words[i:i+2]) for i in range(len(words)-1)]
['The dog', 'dog ran', 'ran inside']
If that doesn't solve your problem please provide more details about what exactly you're trying to accomplish.

Use a lookahead to get the second word, the concatenate the non-lookahead with the lookahead part.
# This is Perl. The important bits:
#
# $1 is what the first parens captured.
# $2 is what the second parens captured.
# . is the concatenation operator (like Java's "+").
while (/(\w+)(?=(\s+\w+))/g) {
my $phrase = $1 . $2;
...
}
Sorry, don't know enough Java, but this should be easy enough to do in Java too.

The easy (and faster for big String) way is to use split :
final String[] arrStr = "The dog ran inside".split(" ");
for (int i = 0, n = arrStr.length - 1; i < n; i++) {
System.out.format("%s %s%n", arrStr[i], arrStr[i + 1]);
}
out put
The dog
dog ran
ran inside
No found trick with regex

Related

Java - matcher re-reading words

I'm trying to create a lexical analyzer for Delphi using java. Here's the sample code:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
So, when I run the program, it works but it's re-reading certain words which the program considers as 2 token for example: record is a keyword, but re-reads it to find the word or for the token logical operators which is from rec"or"d. How can I cancel out the re-reading of words? Thanks!

Add \b to your regular expressions for breaks between words. So:
Pattern.compile("\\b" + keywords[i] + "\\b")
will ensure that the characters on either side of your word aren't letters.
This way "record" will only match with "record," not with "or."

As mentioned in answer by EvanM, you need to add a \b word boundary matcher before and after the keyword, to prevent substring matching within a word.
For better performance, you should also use the | logical regex operator to match one of many values, instead of creating multiple matchers, so you only have to scan the line once, and only have to compile one regex.
You can even combine the 3 different kinds of token you are looking for in a single regex, and use capture groups to differentiate them, so you only have to scan the line once in total.
Like this:
String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" +
"|(=|<[>=]?|>=?)" +
"|\\b(and|not|or|xor)\\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
You can even optimize it further by combining keywords with common prefixes, e.g. as|asm could become asm?, i.e. as optionally followed by m. Will make the keyword list less readable, but would perform better.
In the code above, I did that for the logic ops, to show how, and also to fix the matching error in the original code, where >= in the line would show up 3 times as =, >, >= in that order, a problem similar to the sub-keyword problem asked for in the question.

Java - Regex Match Multiple Words

Lets say that you want to match a string with the following regex:
".when is (\w+)." - I am trying to get the event after 'when is'
I can get the event with matcher.group(index) but this doesnt work if the event is like Veteran's Day since it is two words. I am only able to get the first word after 'when is'
What regex should I use to get all of the words after 'when is'
Also, lets say I want to capture someones bday like
'when is * birthday
How do I capture all of the text between is and birthday with regex?

You could try this:
^when is (.*)$
This will find a string that starts with when is and capture everything else to the end of the line.
The regex will return one group. You can access it like so:
String line = "when is Veteran's Day.";
Pattern pattern = Pattern.compile("^when is (.*)$");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
}
And the output should be:
group 1: when is Veteran's Day.
group 2: Veteran's Day.

If you want to allow whitespace to be matched, you should explicitly allow whitespace.
([\w\s]+)
However, roydukkey's solution will work if you want to capture everything after when is.

Don't use regular expressions when you don't need to!! Although the theory of regular expressions is beautiful in the thought that you can have a string do code operations for you, it is very memory inefficient for simple use cases.
If you are trying to get the word after "when is" ending by a space, you could do something like this:
String start = "when is ";
String end = " ";
int startLocation = fullString.indexOf(start) + start.length();
String afterStart = fullString.substring(startLocation, fullString.length());
String word = afterStart.substring(0, afterStart.indexOf(end));
If you know the last word is Day, you can just make end = "Day" and add the length of that string of where to end the second substring.

You can express this as a character class and include spaces in it: when is ([\w ]+).

\w only includes word characters, which doesn't include spaces. Use [\w ]+ instead.

Match exactly N repetitions of the same character

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.

Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]

I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).
But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.
UPDATE:
^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.

It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!
Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:
use strict;
use warnings;
my $str = 'xxaaaayyybbbbbzzccccxx';
while ( $str =~ /((.)\2*)/g ) {
next unless length $1 == 4;
my $substr = $1;
print "$substr\n";
}
output
aaaa
cccc

Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.
sub runs_of_length {
my($n,$str) = #_;
my $n_minus_1 = $n - 1;
my $_run_pattern = qr/
(?:
# In the middle of the string, we have to force the
# run being matched to start on a new character.
# Otherwise, the regex engine will give a false positive
# by starting in the middle of a run.
(.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
#$1 $2 $3
# Don't forget about a potential run that starts at
# the front of the target string.
^(.) (\4{$n_minus_1}) (?!\4)
# $4 $5
)
/x;
my #runs;
while ($str =~ /$_run_pattern/g) {
push #runs, defined $4 ? "$4$5" : "$2$3";
}
#runs;
}
A few test cases:
my #tests = (
"xxaaaayyybbbbbzzccccxx",
"aaaayyybbbbbzzccccxx",
"xxaaaa",
"aaaa",
"",
);
$" = "][";
for (#tests) {
my #runs = runs_of_length 4, $_;
print qq<"$_":\n>,
" - [#runs]\n";
}
Output:
"xxaaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"xxaaaa":
- [aaaa]
"aaaa":
- [aaaa]
"":
- []
It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.

How about this in python?
def match(string, n):
parts = []
current = None
for c in string:
if not current:
current = c
else:
if c == current[-1]:
current += c
else:
parts.append(current)
current = c
result = []
for part in parts:
if len(part) == n:
result.append(part)
return result
Testing with your string with various sizes:
match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']
Explanation:
The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.
Anyways, I think this'll do!

Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?
In Perl:
my $str = 'xxaaaayyybbbbbzzccccxx';
while($str =~ /(.)\1{3,}/g){
if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
print (($1 x 4), "\n")
}
}

>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]
>>> z
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']
From there you can iterate through the list and check for occasions when N==4 very easily, like this:
>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']

In Java we can do like below code
String test ="xxaaaayyybbbbbzzccccxx uuuuuutttttttt";
int trimLegth = 4; // length of the same characters
Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);
Matcher m = p.matcher(test);
while (m.find())
{
if(m.group().length()==trimLegth) {
System.out.println("Same Characters String " + m.group());
}
}

Java regex split on whitespace not preceded or followed by single or double quotes

I can't get this to work..
I have an String which I want to split on spaces. However, I do not want to split inside Strings. That is, text which is inside double or single quotes.
Example
Splitting the following string:
private String words = " Hello, today is nice " ;
..should produce the following tokens:
private
String
words
=
" Hello, today is nice "
;
What kind of regex can I use for this?

The regex ([^ "]*)|("[^"]*") should match all the tokens. Drawing on my limited knowledge of Java and http://www.regular-expressions.info/java.html, you should be able to do something like this:
// Please excuse any syntax errors, I'm used to C#
Pattern pattern = Pattern.compile("([^ \"]*)|(\"[^\"]*\")");
Matcher matcher = pattern.matcher(theString);
while (matcher.find())
{
// do something with matcher.group();
}

Have you tried this?
((['"]).*?\2|\S+)
Here is what it does:
( <= Group everything
(['"]) <= Find a simple or double quote
.*? <= Capture everything after the quote (ungreedy)
\2 <= Find the simple or double quote (same as we had before)
| <= Or
\S+ <= Non space characters (one at least)
)
On another note, if you want to create a parser, do a parser and don't use regexes.

Extract words out of a text file

Let's say you have a text file like this one:
http://www.gutenberg.org/files/17921/17921-8.txt
Does anyone has a good algorithm, or open-source code, to extract words from a text file?
How to get all the words, while avoiding special characters, and keeping things like "it's", etc...
I'm working in Java.
Thanks

This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:
String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);
while ( m.find() ) {
System.out.println(input.substring(m.start(), m.end()));
}
The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.

Here's a good approach to your problem:
This function receives your text as an input and returns an array of all the words inside the given text
private ArrayList<String> get_Words(String SInput){
StringBuilder stringBuffer = new StringBuilder(SInput);
ArrayList<String> all_Words_List = new ArrayList<String>();
String SWord = "";
for(int i=0; i<stringBuffer.length(); i++){
Character charAt = stringBuffer.charAt(i);
if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
SWord = SWord + charAt;
}
else{
if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
SWord = "";
}
}
return all_Words_List;
}

Pseudocode would look like this:
create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right
The python code would be something like this:
words = input.split()
words = [word.strip(PUNCTUATION) for word in words]
where
PUNCTUATION = ",. \n\t\\\"'][#*:"
or any other characters you want to remove.
I believe Java has equivalent functions in the String class: String.split() .
Output of running this code on the text you provided in your link:
>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis',
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for',
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and',
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may',
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under',
... etc etc.

Basically, you want to match
([A-Za-z])+('([A-Za-z])*)?
right?

You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I find overlapping sets of words with regex expression? - java

You can do this with a lookahead, a capturing group and a word boundary anchor: Pattern regex = Pattern.compile("\\b(?=(\\w+ \\w+))"); Matcher regexMatcher = regex.matcher(subjectString); while (regexMatcher.find()) { matchList.add(regexMatcher.group(1)); }

The easy (and faster for big String) way is to use split : final String[] arrStr = "The dog ran inside".split(" "); for (int i = 0, n = arrStr.length - 1; i < n; i++) { System.out.format("%s %s%n", arrStr[i], arrStr[i + 1]); } out put The dog dog ran ran inside No found trick with regex

Related

Java - matcher re-reading words

Java - Regex Match Multiple Words

Match exactly N repetitions of the same character

Java regex split on whitespace not preceded or followed by single or double quotes

Extract words out of a text file

Categories

Resources