I am trying to do a sort of nested Neo4j query in Java, which first labels a subset of nodes and then tries to match certain patterns among them. More specifically it is like combining 2 queries of this type:
1 - MATCH (n)-[r:RELATIONSHIP*1..3]->(m) set m:LABEL
2 - MATCH (p:LABEL)-[r2:RELATIONSHIP]->(q:OTHERLABEL) where r2.time<100 return p,r2,q
Is there a way I can merge these two query in only one using the Java function engine.execute() ?
'p' in query #2 will, in general, correspond to a superset of 'm' in query #1. If that is your intention, then the following should work. Notice that the 2 MATCH statements have no common variables, but a WITH is required by the Cypher syntax, so I arbitrarily picked the variable 'm' to pass to the second MATCH (even though it will be ignored).
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (p:LABEL)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100
RETURN p,r2,q;
If you intend 'm' and ''p' to be the exactly the same, then just replace '(p:LABEL)' with '(m)':
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (m)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100
RETURN m,r2,q;
Related
I am trying to parse a query which I need to modify to replace a specific property and its value with another property and different values. I am struggling to write a regex that will match the specify property and its value that I need.
Here are some examples to illustrate my point. test:property is the property name that we need to match.
Property with a single value: test:property:schema:Person
Property with multiple values (there is no limit on how many values there can be - this example uses 3): test:property:(schema:Person OR schema:Organization OR schema:Place)
Property with a single value in brackets: test:property:(schema:Person)
Property with another property in the query string (i.e. there are other parts of the string that I'm not interested in): test:property:schema:Person test:otherProperty:anotherValue
Also note that other combinations are possible such as other properties being before the property I need to capture, my property having multiple values with another property present in the query.
I want to match on the entire test:property section with each value captured within that match. Given the examples above these are the results I am looking for:
#
Match
Groups
1
test:property:schema:Person
schema:Person
2
test:property:(schema:Person OR schema:Organization OR schema:Place)
schema:Personschema:Organizationschema:Person
3
test:property:(schema:Person)
schema:Person
4
test:property:schema:Person
schema:Person
Note: #1 and #4 produce the same output. I wanted to illustrate that the rest of the string should be ignored (I only need to change the test:property key and value).
The pattern of schema:Person is defined as \w+\:\w+, i.e. one or more word characters, followed by a colon, followed by one or more word characters.
If we define the known parts of the string with names I think I can express what I want to match.
schema:Person - <TypeName> - note that the first part, schema in this case, is not fixed and can be different
test:property - <MatchProperty>
<MatchProperty>: // property name (which is known and the same - in the examples this is `test:property`) followed by a colon
( // optional open bracket
<TypeName>
(OR <TypeName>)* // optional additional TypeNames separated by an OR
) // optional close bracket
Every example I've found has had simple alphanumeric characters in the repeating section but my repeating pattern contains the colon which seems to be tripping me up. The closest I've got is this:
(test\:property:(?:\(([\w+\:\w+]+ [OR [\w+\:\w+]+)\))|[\w+\:\w+]+)
Which works okayish when there are no other properties (although the match for example #2 contains the entire property and value as the first group result, and a second group with the property value) but goes crazy when other properties are included.
Also, putting that regex through https://regex101.com/ I know it's not right as the backslash characters in the square brackets are being matched exactly. I started to have a go with capturing and non-capturing groups but got as far as this before giving up!
(?:(\w+\:\w+))(?:(\sOR\s))*(?:(\w+\:\w+))*
This isn't a complete solution if you want pure regex because there are some limitations to regex and Java regex in particular, but the regexes I came up with seem to work.
If you're looking to match the entire sequence, the following regex will work.
test:property:(?:\((\w+:\w+)(?:\sOR\s(\w+:\w+))*\)|(\w+:\w+))
Unfortunately, the repeated capture groups will only capture the last match, so in queries with multiple values (like example 2), groups 1 and 2 will be the first and last values (schema:Person and schema:Place). In queries without parentheses, the value will be in group 3.
If you know the maximum number of values, you could just generate a massive regex that will have enough groups, but this might not be ideal depending on your application.
The other regex to find values in groups of arbitrary length uses regex's positive lookbehind to match valid values. You can then generate an array of matches.
(?<=test:property:(?:(?:\((?:\w+:\w+\sOR\s)+)|\(?))\w+:\w+
The issue with this method is that it looks like Java lookbehind has some limitations, specifically, not allowing unbound or complex quantifiers. I'm not a Java person so I haven't tried things out for myself, but it seems like this wouldn't work either. If someone else has another solution, please post another answer!
With this in mind, I would probably suggest going with a combination regex + string parsing method. You can use regex to parse out the value or multiple values (separated by OR), then split the string to get your final values.
To match the entire part inside parentheses or the single value no parentheses, you can use this regex:
test:property:(?:\((\w+:\w+(?:\sOR\s\w+:\w+)*)\)|(\w+:\w+))
It's still split into two groups where one matches values with parentheses and the other matches values without (to avoid matching unpaired parentheses), but it should be usable.
If you want to play around with these regexes or learn more, here's a regexr: https://regexr.com/65kma
I have a String being generated by concatenating a set of String with a comma delimeter. Now I want to write a unit test covering this code, I want to check that all Strings in the set made it into the concatenated String. The problem is that sets are not ordered, so I can't know for sure what the concatenated String will be. And I can't change the Set to an ordered Set or a List as I don't own that bit of code.
As an example, if my set was {"VAL1", "VAL2"}, my test currently looks like this:
assertTrue("VAL1,VAL2".equals(concString) || "VAL2,VAL1".equals(concString));
This is fine, but if my set had 5, or even 10 values, this will become impractical. So I considered changing it to:
assertTrue("VAL[1-2],VAL[1-2]".matches(concString));
However this could also match the incorrect case "VAL1,VAL1". Is there a way in regex to say "use this set of values, but don't match a value that was matched already"?
In general no, but in this case, yes.
Pattern.compile("^VAL([12]),VAL(?!\\1\\b)([12])$")
This matches
VAL
followed by [12] with the matching text stored in group 1
followed by ,VAL
followed by text that is not the same as group 1 followed by a word-break
followed by [12] with the matching text stored in group 2
The "is not" is handled by the negative lookahead operator (?!...) and \1 is a back-reference to the content stored in group 1.
This is a little complicated for a unit test.
Unit test code should be as simple as possible so that you're not confused about what you're testing.
If the number of variants is small,
ImmutableSet.of("VAL1,VAL2", "VAL2,VAL1").contains(...)
is simpler and readable.
If the number of variants is not that small, then splitting, sorting, and joining can help you get a canonical value to test against.
I use neo4j-rest-binding API to develop, but I face a problem when using parameters of RestCypherQueryEngine.
QueryResult<Map<String,Object>> result = engine.query("MATCH (n:{label}) RETURN n", MapUtil.map("label", label));
label is the parameter I assign in the map structure, but it has an error:
org.neo4j.rest.graphdb.RestResultException: Invalid input '{': expected whitespace or an identifier (line 1, column 10)
"MATCH (n:{label}) RETURN n"
^ at
SyntaxException
org.neo4j.cypher.internal.compiler.v2_0.parser.CypherParser$$anonfun$parse$1.apply(CypherParser.scala:51)
org.neo4j.cypher.internal.compiler.v2_0.parser.CypherParser$$anonfun$parse$1.apply(CypherParser.scala:41)
...
I can use another method to solve this problem:
QueryResult<Map<String,Object>> result = engine.query("MATCH (n:" + label +") RETURN n", null);
But I think the above method is not appropriate when I want to pass multiple parameters.
:{ is a syntactical error. As the exception tells you, Cypher expects an identifier after a colon - namely, the name of a label - and an identifier (as in most languages) cannot contain a bracket.
It sounds like you're confused about the difference between labels and parameters:
The following would be valid: MATCH (n:employee{name:"foo"}) Here, employee is a label. You can apply an arbitrary number of labels delimited by colons. {name:"foo"} is a parameter block - note that it contains both the field you want to match and the value. So, this query will return all nodes labelled employee with a name value of "foo". MATCH (n:employee:custodian{name:"foo"}) will give you all employees who are custodians named "foo".
If you want all nodes with a name value of "foo", use MATCH (n {name:"foo"}) (note the absence of a colon).
Edit (responding to your comment) There are two differences between your query and the one in the example you're referring to, start n=node({id}) return n is, obviously, a START clause, which do very different things and have different syntactical rules from MATCH clauses: The id in ({id)} is simply a value to look up in an index. In a MATCH clause, what goes inside a { } block are key-value pairs, as is explained above. Inside a parameter block (i.e. a set of braces), colons are used to separate keys from values. A colon outside the brackets in a MATCH clause are used to separate labels which are different different things entirely.
The second difference is that, if you look more closely at the START clause, there is a parenthesis separating the colon from the bracket. :{ is never okay, which is what your error message is telling you.
tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.
Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.
My question is fairly straightforward, even if the purpose it will serve is pretty complicated. I will use a simple example:
AzzAyyAxxxxByyBzzB
So normally I would want to get everything between A and B. However, because some of the content between the first A and the last B (one pair) contains additional AB pairs I need to push back the end of the match. (Not sure if that last part made sense).
So what I'm looking for is some RegEx that would allow me to have the following output:
Match 1
Group 1: AzzAyyAxxxxByyBzzB
Group 2: zzAyyAxxxxByyBzz
Then I would match it again to get:
Match 2
Group 1: AyyAxxxxByyB
Group 2: yyAxxxxByy
Then finally again to get:
Match 3
Group 1: AxxxxB
Group 2: xxxx
Obviously if I try (A(.*?)B) on the whole input I get:
Match x
Group 1: AzzAyyAxxxxB
Group 2: zzAyyAxxxx
Which is not what I'm looking for :)
I hope this makes sense. I understand if this can't be done in RegEx, but I thought I would ask some of you regex wizards before I give up on it and try something else. Thanks!
Additional Info:
The project I'm working on is written in Java.
One other problem is that I'm parsing a document which could contain something like this:
AzzAyyAxxxxByyBzzB
Here is some unrelated stuff
AzzAyyAxxxxByyBzzB
AzzzBxxArrrBAssssB
And the top AB pairs needs to be separate from the bottom AB pairs
You made your regex explicitly ungreedy by using the ?. Just leave it out and the regex will consume as much as possible before matching the B:
(A(.*)B)
However, in general nested structures are beyond the scope of regular expressions. In a case like this:
AxxxByyyAzzzB
You would now also match from the first A to the last B. If this is possible in your scenario, you might be better of going through the string yourself character-by-character and counting As and Bs to figure out which ones belong together.
EDIT:
Now that you have updated the question and we figured this out in the comments, you do have the problem of multiple consecutive pairs. In this case, this cannot be done with a regex engine that does not support recursion.
However you can switch to matching from the inside out.
A([^AB]*)B
This will only get innermost pairs, because there can be neither an A nor a B between the delimiters. If you find it, you can then remove the pair and continue with your next match.
Use word boundary if you use multiline mode:
\bA(.*)B\b #for matches that does not start from beginning of line to end
or
^A(.*)B$ #for matches that start from beginning of line till end
You won't be able to do this with Regular Expressions alone. What you're describing is more Context-Free than Regular. In order to parse something like this you need to push a new context onto a stack every time to encounter an 'A' and pop the stack every time you encounter a 'B'. You need something more like a pushdown automaton than a regular expression.