English contraction catastrophes

English contraction catastrophes - java

Background
Writing a straight quote to curly quote converter and am looking to separate substitution into a few different steps. The first step is to replace contractions in text using a lexicon of known contractions. This won't solve ambiguities, but should convert straight quote usages in common contractions.
Problem
In Java, \b and \w don't include apostrophes as part of a word, which makes this problem a bit finicky. The issue is in matching words that:
contain one or more apostrophes, but do not start or end with one (inner);
begin with an apostrophe, may contain one or more, but do not end with one (began);
end with an apostrophe, may contain one or more, but do not start with one (ended); and
begin and end with an apostrophe, but may not contain one (outer).
Code
Given some nonsensical text:
'Twas---Wha'? Wouldn'tcha'? 'Twas, or 'twasn't, 'tis what's 'tween dawn 'n' dusk 'n stuff. Cookin'? 'Sams' place, 'yo''
the regexes should capture the following words:
inner: what's
began: 'Twas, 'Twas, 'twasn't, 'tis, 'tween, 'n
ended: Wha', Wouldn'tcha', Cookin'
outer: 'n', 'Sams', 'yo'
Here are non-working expressions, a mix-mash of maladroit thoughts:
inner: \p{L}+'\p{L}*\p{L}
began: ((?<=[^\p{L}])|^)'\p{L}+('\p{L}|\p{L})?
ended: (\p{L}|\p{L}')+'(?=[^\p{L}]|$)
This one appears to work:
outer: ((?<=[^\p{L}])|^)'\p{L}+'(?!\p{L})
Question
What regular expressions would categorize this quartet of contractions correctly?

This regex should do what you want. It uses named capture groups to categorise the words with appropriate lookarounds to ensure that we match the whole words with the required outer quotes:
(?<inner>(?<![\p{L}'])(?:\p{L}+')+\p{L}+(?![\p{L}']))|
(?<began>(?<!\p{L})(?:'\p{L}+)+(?![\p{L}']))|
(?<ended>(?<![\p{L}'])(?:\p{L}+')+(?!\p{L}))|
(?<outer>(?<!\p{L})'\p{L}+'(?!\p{L}))
Group inner looks for a string with some number of groups of letters followed by a quote (?:\p{L}+')+ followed by some number of letters \p{L}+.
Group began looks for a string with some number of groups of a quote followed by some number of letters (?:'\p{L}+)+.
Group ended looks for a string with some number of groups of letters followed by a quote (?:\p{L}+')+.
Group outer looks for a string with quotes on either end and some number of letters in the middle '\p{L}+'.
Demo on regex101

Related

Regex format for a particular Match

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?

The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.

according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

Regex to Split String on Even Number of Preceding Characters

I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.

This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')

Have you tried positive lookbehind
(?<=.')
Regex101

This regex will do it:
[A-Z]+(\?\?)*'

If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.

multiple regular expressions vs search algorithm

I have a text file where every line is a random combination of any of the following groups
Numbers - English Letters - Arabic Letters - Punctuation
\w which is composed of a-zA-Z0-9_ for the first 2 groups
\p{InArabic} for the third group
\p{Punct} which is composed of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ for the fifth group
I got this info from here
i read a line. The ONLY time I do something to this line is if the line contains Arabic letters AND (English letters OR Unicode Symbols)
After reading this post and this post I came up with the following expression. Obviously it's wrong as my output is all wrong >.<
pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
Here's the input
1
1a
a!
aش
شa
ششa
aشش
شaش
aشa
!aش
The first three shouldn't be matched but my output shows that NONE are a match.
Edit: sorry I just realized that I forgot to change my title. But if any of you feel that searching is better performance wise then please suggest a search algorithm. Using search algo instead of regex looks ugly but I'd go with it if it performed better. Thanks to the posts I read, I learned that I can make regex faster if I put this in the constructor so that it'd be executed once only instead of including them in my loop thereby being executed everytime
pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
matcher = pattern.matcher("");

To follow your idea, the correct pattern is:
pattern = Pattern.compile("(?=.*\\p{InArabic})(?=.*[a-zA-Z\\p{Punct}])");
The same position in a string can not be followed by an arabic letter and a punctuation character or a latin letter at the same time. In other words, you have written an always false condition. Adding .* allows characters to be anywhere in the string.
If you want a more optimised pattern, you can use Jason C idea but with negative character classes to reduce the backtracking:
pattern = Pattern.compile("\\p{inArabic}[^a-zA-Z\\p{Punct}]*[a-zA-Z\\p{Punct}]|[a-zA-Z\\p{Punct}]\\P{inArabic}*\\p{inArabic}");

If you want to find a line with a mix, all you really need are 2 boundry condition checks.
A sucessfull match indicates a mix.
# "\\p{InArabic}(?=[\\w\\p{Punct}])|(?<=[\\w\\p{Punct}])\\p{InArabic}"
\p{InArabic}
(?= [\w\p{Punct}] )
|
(?<= [\w\p{Punct}] )
\p{InArabic}

How to combine these regex for javascript

Hi I am trying to use regEx in JS for identifying 3 identical consecutive characters (could be alphabets,numbers and also all non alpha numeric characters)
This identifies 3 identical consecutive alphabets and numbers : '(([0-9a-zA-Z])\1\1)'
This identifies 3 identical consecutive non alphanumerics : '(([^0-9a-zA-Z])\1\1)'
I am trying to combine both, like this : '(([0-9a-zA-Z])\1\1)|(([^0-9a-zA-Z])\1\1)'
But I am doing something wrong and its not working..(returns true for '88aa3BBdd99##')
Edit : And to find NO 3 identical characters, this seems to be wrong /(^([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1)/ --> RegEx in JS to find No 3 Identical consecutive characters
thanks
Nohsib

The problem is that backreferences are counted from left to right throughout the whole regex. So if you combine them your numbers change:
(([0-9a-zA-Z])\2\2)|(([^0-9a-zA-Z])\4\4)
You could also remove the outer parens:
([0-9a-zA-Z])\1\1|([^0-9a-zA-Z])\2\2
Or you could just capture the alternatives in one set of parens together and append one back-reference to the end:
([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1
But since your character classes match all characters anyway you can have that like this as well:
([\s\S])\1\1
And if you activate the DOTALL or SINGLELINE option, you can use a . instead:
(.)\1\1

It's actually much simpler:
(.)\1\1
The (.) matches any character, and each \1 is a back reference that matches the exact string that was matched by the first capturing group. You should be aware of what the . actually matches and then modify the group (in the parentheses) to fit your exact needs.

Java regex mix two patterns

How can i get this pattern to work:
Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");
Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:
(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])
pattern explained here: Java regex patterns
which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?

Unfortunately it seems like you can't merge both expressions, at least as far as I know.
However, maybe you can reformulate your problem.
If, for example, you want to split between words (which can contain hyphens), try this expression:
(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)
This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.
Using this expression for a split should result in this:
input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}
The regex might need some additional optimization but it is a start.
Edit: This expression should get rid of the empty string in the split:
(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)
The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.
Edit: in case you want to match words that could potentially contain hyphens, try this expression:
(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)
This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

English contraction catastrophes - java

Related

Regex format for a particular Match

Regex to Split String on Even Number of Preceding Characters

multiple regular expressions vs search algorithm

How to combine these regex for javascript

Java regex mix two patterns

Categories

Resources