I'm making a Lexer, and have chosen to use Regex to split my tokens.
I'm working on all different tokens, except the one that really bugs me is words and identifiers.
You see, the rules I have in place are the following:
Words cannot start with or end with an underscore.
Words can be one or more characters in length.
Underscores can only be used between letters, and can appear multiple times.
Example of what I want:
_foo <- Invalid.
foo_ <- Invalid.
_foo_ <- Invalid.
foo_foo <- Valid.
foo_foo_foo <- Valid.
foo_foo_ <- Partially Valid. Only "foo_foo" should be picked up.
_foo_foo <- Partially Valid. Only "foo_foo" should be picked up.
I'm getting close, as this is what I currently have:
([a-zA-Z]+_[a-zA-Z]+|[a-zA-Z]+)
Except, it only detects the first occurence of an underscore. I want all of them.
Personal Request:
I would rather the answer be contained inside of a single group, as I have structured my tokeniser around them, except I would be more than happy to change my design if you can think of a better way of handling it. This is what I currently use:
private void tokenise(String regex, String[] data) {
Set<String> tokens = new LinkedHashSet<String>();
Pattern pattern = Pattern.compile(regex);
// First pass. Uses regular expressions to split data and catalog token types.
for (String line : data) {
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
for (int i = 1; i < matcher.groupCount() + 1; i++) {
if (matcher.group(i) != null) {
switch(i) {
case (1):
// Example group.
// Normally I would structure like:
// 0: Identifiers
// 1: Strings
// 2-?: So on so forth.
tokens.add("FOO:" + matcher.group());
break;
}
}
}
}
}
}
Try ([a-zA-Z]+(?:_[a-zA-Z]+)*)
The first part of the pattern, [a-zA-Z]+, matches one or more letters.
The second part of the pattern, (?:_[a-zA-Z]+), matches an undescore if it is followed by one or more letters.
The * at the end means the second part can be repeated zero or more times.
The (?: ) is like plain (), but doesn't return the matched group.
Related
I'm trying to create a lexical analyzer for Delphi using java. Here's the sample code:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
So, when I run the program, it works but it's re-reading certain words which the program considers as 2 token for example: record is a keyword, but re-reads it to find the word or for the token logical operators which is from rec"or"d. How can I cancel out the re-reading of words? Thanks!
Add \b to your regular expressions for breaks between words. So:
Pattern.compile("\\b" + keywords[i] + "\\b")
will ensure that the characters on either side of your word aren't letters.
This way "record" will only match with "record," not with "or."
As mentioned in answer by EvanM, you need to add a \b word boundary matcher before and after the keyword, to prevent substring matching within a word.
For better performance, you should also use the | logical regex operator to match one of many values, instead of creating multiple matchers, so you only have to scan the line once, and only have to compile one regex.
You can even combine the 3 different kinds of token you are looking for in a single regex, and use capture groups to differentiate them, so you only have to scan the line once in total.
Like this:
String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" +
"|(=|<[>=]?|>=?)" +
"|\\b(and|not|or|xor)\\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
You can even optimize it further by combining keywords with common prefixes, e.g. as|asm could become asm?, i.e. as optionally followed by m. Will make the keyword list less readable, but would perform better.
In the code above, I did that for the logic ops, to show how, and also to fix the matching error in the original code, where >= in the line would show up 3 times as =, >, >= in that order, a problem similar to the sub-keyword problem asked for in the question.
I have two regexes. I want to delete all matches of second one if they are placed inside matches of first one. Basically, nothing can be matched in what was already matched. Example:
First regex (bold) - c\w+ finds words beginning with c
Second regex (underlined) - me finds me
Result: cam̲e̲l crim̲e̲ care cool m̲e̲dium m̲e̲lt hom̲e̲
The me in c-words are matched too. Want I want is: camel crime care cool m̲e̲dium m̲e̲lt hom̲e̲
Two results of second regex are in results of first regex, I want to delete them, or just don't match them at all. Here's what I tried:
String text = "camel crime care cool medium melt home";
static final Pattern PATTERN_FIRST = Pattern.compile("c\w+");
static final Pattern PATTERN_SECOND = Pattern.compile("me");
// Save all matches
List<int[]> firstRegexMatches = new ArrayList<>();
for (Matcher m = PATTERN_FIRST.matcher(text); m.find();) {
firstRegexMatches.add(new int[]{m.start(), m.end()});
}
List<int[]> secondRegexMatches = new ArrayList<>();
for (Matcher m = PATTERN_SECOND.matcher(text); m.find();) {
secondRegexMatches.add(new int[]{m.start(), m.end()});
}
// Remove matches of second inside matches of first
for (int[] pos : firstRegexMatches) {
Iterables.removeIf(secondRegexMatches, p -> p[0] > pos[0] && p[1] < pos[1]);
}
In this code I store all matches of both into list then try to remove from the second list matches placed inside first list matches.
Not only does this not work, but I'm not sure it's very efficient. Note that this a simplified version of my situation, which contains more regexes and a large text. Iterables is from Guava.
First of all you can achieve something like this merging both expressions into one.
(^c\w+)|\s(c\w+)|(\w*me\w*)
If you match against this regex every match will be either a word starting with "c" followed by some word-characters or a word containing "me". For every match you then either get the group:
(1) or (2) indicating a word starting with "c" or
(3) indicating a word containing "me"
However note that this only works in case you know the delimiter of the words, in this case a \s character.
Example code:
String text = "camel crime care cool medium melt home";
final Pattern PATTERN = Pattern.compile("(^c\\w+)|\\s(c\\w+)|(\\w*me\\w*)");
// Save all matches
List<String> wordsStartingWithC = new ArrayList<>();
List<String> wordsIncludingMe = new ArrayList<>();
for (Matcher m = PATTERN.matcher(text); m.find();) {
if(m.group(1) != null) {
wordsStartingWithC.add(m.group(1));
} else if(m.group(2) != null) {
wordsStartingWithC.add(m.group(2));
} else if(m.group(3) != null) {
wordsIncludingMe.add(m.group(3));
}
}
System.out.println(wordsStartingWithC);
System.out.println(wordsIncludingMe);
I'd recommend to simplify this by taking a somewhat different approach.
As you seem to know the word limiter, namely the whitespace character, you can get a collection of all words simply by splitting the original string.
String[] words = "camel crime care cool medium melt home".split(" ");
You then simply iterate over all of these.
for(String word: words) {
if(word.startsWith("c")) {
// put in your list for words starting with "c"
} else if (word.contains("me")) {
// put in your list for words containing "me"
}
}
This will result in two lists without duplicate entries, as the second if statement will only be executed in case the first one fails.
Isn't it possible to combine the two Regexes? For example, the me after c can be found using one Regex with this code:
((?<=c)|(?<=c\w)|(?<=c\w{2})|(?<=c\w{3})|(?<=c\w{4})|(?<=c\w{5}))me
Check it out here: https://regex101.com/r/bfNkvF/2
this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.
This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo
You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.
You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}
You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.
Examples:
rythm&blues -> Rythm&Blues
.. DON'T WEAR WHITE/LIVE -> Don't Wear White/Live
First I convert the whole string to lowercase (because I want to have only Uppercase at the start of a word).
I currently do this by using a split pattern: [&/\\.\\s-]
And then I convert the parts' first letter to Uppercase.
It works well, except, that it also converts HTML entities of course:
E.g. don't is converted to don&Apos;t but that entity should be left alone.
While writing this I discover an additional problem... the initial conversion to lowercase potentially messes up some HTML entities as well. So, the entities should be totally left alone. (E.g. Ç is not the same as ç)
An HTML entity is probably matched like this: &[a-z][A-Z][a-z]{1,5};
I am thinking of doing something with groups, but unfortunately I find it very hard to figure out.
This pattern seems to handle your situation
"\\w+|&#?\\w+;\\w*"
There may be some edge cases, but we can adjust accordingly as they come up.
Pattern Breakdown:
\\w+ - Match any word
&#?\\w+;\\w* - Match an HTML entity
Code Sample:
public static void main(String[] args) throws Exception {
String[] lines = {
"rythm&blues",
".. DON'T WEAR WHITE/LIVE"
};
Pattern pattern = Pattern.compile("\\w+|&#?\\w+;\\w*");
for (int i = 0; i < lines.length; i++) {
Matcher matcher = pattern.matcher(lines[i]);
while (matcher.find()) {
if (matcher.group().startsWith("&")) {
// Handle HTML entities
// There are letters after the semi-colon that
// need to be lower case
if (!matcher.group().endsWith(";")) {
String htmlEntity = matcher.group();
int semicolonIndex = htmlEntity.indexOf(";");
lines[i] = lines[i].replace(htmlEntity,
htmlEntity.substring(0, semicolonIndex) +
htmlEntity.substring(semicolonIndex + 1)
.toLowerCase());
}
} else {
// Uppercase the first letter of the word and lowercase
// the rest of the word
lines[i] = lines[i].replace(matcher.group(),
Character.toUpperCase(matcher.group().charAt(0)) +
matcher.group().substring(1).toLowerCase());
}
}
}
System.out.println(Arrays.toString(lines));
}
Results:
[Rythm&Blues, .. Don't Wear White/Live]
The solution here will probably be lookahead assertions. That means a split should match & character only if it is not a start of entity. Problem here is that I am not sure whether your data can contain text, that can be mistakenly taken as an entity (basically any stuff ending with ;). But let's assume for now it does not. This is how such split with negative lookahead pattern could look:
/(?!')[&/\.\s-]/
Note this is a case with only ' entity. You probably would like to extend possible entity list or provide pattern that matches all valid entities.
Here's a fiddle (JS, but should work in Java as well): http://refiddle.com/refiddles/55a5078c75622d15bb010000
I'm having trouble figuring out the proper regex.
Here is some sample code:
#Test
public void testFindEasyNaked() {
System.out.println("Naked_find");
String arg = "hi mom <us-patent-grant seq=\"002\" image=\"D000001\" >foo<name>Fred</name></us-patent-grant> extra stuff";
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
System.out.println(nakedPat);
Pattern naked = Pattern.compile(nakedPat, Pattern.MULTILINE + Pattern.DOTALL );
Matcher m = naked.matcher(arg);
if (m.find()) {
System.out.println("found naked");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.printf("%d: %s\n", i, m.group(i));
}
} else {
System.out.println("can't find naked either");
}
System.out.flush();
}
My regex matches the string, but I am not able to pull the repeated pattern.
What I want is to have
seq=\"002\" image=\"D000001\"
pulled out as a group. Here is what the program shows when I execute it.
Naked_find
<(us-patent-grant)((\s*[\S&&[^>]])*)*\s*>(.+?)</\1>
found naked
0: <us-patent-grant seq="002" image="D000001" >foo<name>Fred</name></us-patent-grant>
1: us-patent-grant
2:
3: "
4: foo<name>Fred</name>
The group #4 is fine, but where is the data for #2 and #3, and why is there a double quote in #3?
Thanks
Pat
Even if using an XML parser would be sound, I think I can explain the error in your regular expression:
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
You try to match the parameters in the part ((\\s*[\\S&&[^>]])*)*. Look at your innermost group: you have \s* ("one or more space") followed by \\S&&[^>] ("one non-space which is not >). It means that in your group, you will either have from zero to some spaces followed by a single non-space character.
So this will match any non-space character between "us-patent-grant" and >. And every time the regular expression engine will match it, it will assign the value to the group 3. It means the group previously matched are lost. That's why you have the last character of the tag, that is ".
You can improve it a bit by adding a + after [\\S&&[^>]], so it will match at least a complete sequence of non-spaces, but you would only obtain the last tag attribute in your group. You should instead use a better and simpler way:
Your goal being to pull out seq="002" image="D000001" in a group, what you should do is simply to match the sequence of every characters which are not > after "us-patent-grant":
"<(us-patent-grant)\\s*([^>]*)\\s*>(.+?)</\\1>"
This way, you have the following values in your groups:
Group 1: us-patent-grant
Group 2: seq=\"002\" image=\"D000001\"
Group 3: foo<name>Fred</name>
Here is the test on Regexplanet: http://fiddle.re/ezfd6