Combine multiple tokenizers in Solr - java

I'm trying to combine LetterTokenizerFactory with WhitespaceTokenizerFactory and not able to find how to do it without copying content using copyField.
Let me describe my idea:
I have two entries in text, e.g. H&M and Hewlett-Packard
User should be able to find H&M entering h&m - I use WhitespaceTokenizerFactory for this purpose, no need to split tokens on special chars
User should be able to find Hewlett-Packard entering 'packard' - LetterTokenizerFactory serves this case, tokens are splitted on special characters
Now I want to combine both this tokenizers
How can I achieve it without declaring 2 different types with different tokenizer factories and then copying value to field with second type?

You can use the WhitespaceTokenizerFactory as the main tokenizer, and then add the WordDelimiterGraphFilter to split your tokens further up into smaller tokens.
From the example for the WordDelimiterGraphFilter (previously named WordDelimiterFilter, but that's deprecated now - so the name will depend on which Solr version you're using):
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
That would allow packard to match hewlett. Be advised that this will also allow 'm' to match h&m, since you're splitting on non-alphanumeric characters. You can either use the protected setting for the filter to specify a list of words that should not be touched, or even better, if you want everything with & to remain untouched, use the types parameter to redefine what type & should be considered as.

Related

Match custom pattern in regex multiple times

I am trying to parse a query which I need to modify to replace a specific property and its value with another property and different values. I am struggling to write a regex that will match the specify property and its value that I need.
Here are some examples to illustrate my point. test:property is the property name that we need to match.
Property with a single value: test:property:schema:Person
Property with multiple values (there is no limit on how many values there can be - this example uses 3): test:property:(schema:Person OR schema:Organization OR schema:Place)
Property with a single value in brackets: test:property:(schema:Person)
Property with another property in the query string (i.e. there are other parts of the string that I'm not interested in): test:property:schema:Person test:otherProperty:anotherValue
Also note that other combinations are possible such as other properties being before the property I need to capture, my property having multiple values with another property present in the query.
I want to match on the entire test:property section with each value captured within that match. Given the examples above these are the results I am looking for:
#
Match
Groups
1
test:property:schema:Person
schema:Person
2
test:property:(schema:Person OR schema:Organization OR schema:Place)
schema:Personschema:Organizationschema:Person
3
test:property:(schema:Person)
schema:Person
4
test:property:schema:Person
schema:Person
Note: #1 and #4 produce the same output. I wanted to illustrate that the rest of the string should be ignored (I only need to change the test:property key and value).
The pattern of schema:Person is defined as \w+\:\w+, i.e. one or more word characters, followed by a colon, followed by one or more word characters.
If we define the known parts of the string with names I think I can express what I want to match.
schema:Person - <TypeName> - note that the first part, schema in this case, is not fixed and can be different
test:property - <MatchProperty>
<MatchProperty>: // property name (which is known and the same - in the examples this is `test:property`) followed by a colon
( // optional open bracket
<TypeName>
(OR <TypeName>)* // optional additional TypeNames separated by an OR
) // optional close bracket
Every example I've found has had simple alphanumeric characters in the repeating section but my repeating pattern contains the colon which seems to be tripping me up. The closest I've got is this:
(test\:property:(?:\(([\w+\:\w+]+ [OR [\w+\:\w+]+)\))|[\w+\:\w+]+)
Which works okayish when there are no other properties (although the match for example #2 contains the entire property and value as the first group result, and a second group with the property value) but goes crazy when other properties are included.
Also, putting that regex through https://regex101.com/ I know it's not right as the backslash characters in the square brackets are being matched exactly. I started to have a go with capturing and non-capturing groups but got as far as this before giving up!
(?:(\w+\:\w+))(?:(\sOR\s))*(?:(\w+\:\w+))*
This isn't a complete solution if you want pure regex because there are some limitations to regex and Java regex in particular, but the regexes I came up with seem to work.
If you're looking to match the entire sequence, the following regex will work.
test:property:(?:\((\w+:\w+)(?:\sOR\s(\w+:\w+))*\)|(\w+:\w+))
Unfortunately, the repeated capture groups will only capture the last match, so in queries with multiple values (like example 2), groups 1 and 2 will be the first and last values (schema:Person and schema:Place). In queries without parentheses, the value will be in group 3.
If you know the maximum number of values, you could just generate a massive regex that will have enough groups, but this might not be ideal depending on your application.
The other regex to find values in groups of arbitrary length uses regex's positive lookbehind to match valid values. You can then generate an array of matches.
(?<=test:property:(?:(?:\((?:\w+:\w+\sOR\s)+)|\(?))\w+:\w+
The issue with this method is that it looks like Java lookbehind has some limitations, specifically, not allowing unbound or complex quantifiers. I'm not a Java person so I haven't tried things out for myself, but it seems like this wouldn't work either. If someone else has another solution, please post another answer!
With this in mind, I would probably suggest going with a combination regex + string parsing method. You can use regex to parse out the value or multiple values (separated by OR), then split the string to get your final values.
To match the entire part inside parentheses or the single value no parentheses, you can use this regex:
test:property:(?:\((\w+:\w+(?:\sOR\s\w+:\w+)*)\)|(\w+:\w+))
It's still split into two groups where one matches values with parentheses and the other matches values without (to avoid matching unpaired parentheses), but it should be usable.
If you want to play around with these regexes or learn more, here's a regexr: https://regexr.com/65kma

Best way to validate non-printable ascii characters in XML

Application needs to validate the different input XML(s) messages for non-printable ascii characters. We currently know two options to do this.
Change the XSD to include the restriction.
Validate the input xml string in java application using Regular Expression
Which approach is better in terms of performance as our application has to return the response within a few seconds? Is there any other option available to do this?
It's mainly a matter of opinion but if you have an XSD that seems to be the natural place to include the validations. The only thing you may need to consider is that via XSD you will either fail or pass, whereas with ad-hoc java validation you can ignore non-printable, or replace or take an action without failing the input completely.
The only characters that are (a) ASCII, (b) non-printable, and (c) allowed in XML 1.0 documents are CR, NL, and TAB. I find it hard to see why excluding those three characters is especially important, but if you already have an XSD schema, then it makes sense to add the restriction there.
The usual approach is not to make these three characters invalid, but to treat them as equivalent to space characters, which you can do by using a data type that has the whitespace facet value "normalize" or "collapse".

Java's named capturing groups and trying to see if they are optional

I am currently using named capturing groups in a regex applied to a URL. The client feeds in the regex, but I need to get:
list of capturing group names
which of the names are required
which of the names are optional
Currently, I am cheating and translate {id} or {someVar} to a capture group and everything is required. Now however, because of add/edit, some urls are like this
/postadd
/postedit/someIdHere
so the regex is ONE route matching both. I believe it would look something like this
/postadd|/postedit/(?<id>[^/]+)
I would really really really prefer not to use a regex on the regex to find out if it is optional(as code like that is hard to read and reverse engineer). Is there any way instead to list the capturing groups and find out if it's optional or not?

Solr - Match sentence beginning with a particular word

Any tips on how this is done?
I've tried using the PatternTokenizerFactory, but it's not working as expected.
Is it possible to do this without writing a custom tokenizer?
you can tokenize the field in question using KeyWordTokenizerFactory and then do wildcard search
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
provided that you are not doing any other operation which does not work with the above Tokenizer.
Another way is a roundabout way. You can create a copyfield which will have its spaces stripped out using the following technique (or some other) :-
What is the regular expression to remove spaces in SOLR
You can then tokenize that copyfield using WhiteSpaceTokenizer (which essentially creates one token only since the copyfield values have no space) and then do a wildcard search on it.
The second approach might fail in some of the cases (for eg. "wor them" will match "worth*" after the spaces are stripped)

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.
Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

Categories