Java Regex intersection (&&) is not commutative

Java Regex intersection (&&) is not commutative - java

The character class intersection operator &&, by definition of its function, should be commutative. [a&&b] should match exactly the same characters as [b&&a] for any a and b. I've found that the following patterns all satisfy this criterion.
[a-z&&abcd] same as [abcd&&a-z]
[a-z&&ab[cd]] same as [ab[cd]&&a-z]
[a-z&&[ab][cd]] same as [[ab][cd]&&a-z]
They are all equivalent to [abcd]. However, if expressed [a-z&&[ab]cd], this is no longer true. That expression only matches c and d, but not a and b. However, the flipped version [[ab]cd&&a-z] matches all four characters like the other patterns. In other words
[[ab]cd&&a-z] not same as [a-z&&[ab]cd]
I went into the sources of Pattern to find out why this is, and I found that this is how intersection is implemented (Java 1.8.0_60 JDK)
case '&':
// ...
ch = next();
if (ch == '&') {
ch = next();
CharProperty rightNode = null;
while (ch != ']' && ch != '&') {
if (ch == '[') {
if (rightNode == null)
rightNode = clazz(true);
else
rightNode = union(rightNode, clazz(true));
} else { // abc&&def
unread();
rightNode = clazz(false); // here is what happens
}
ch = peek();
}
Notice that the marked line is
rightNode = clazz(false);
and not
rightNode = union(rightNode, clazz(true));
In other words, on the right side of &&, whenever the first character that is not inside a nested character class is encountered, the pattern parser assumes there is nothing before it. So after &&, the parser reads [ab] into rightNode, then reads cd, but instead of merging with [ab], it just overwrites it.
I know that practically no one writes a regex like [a-z&&[ab]cd], but still, the documentation implies that it should work. Is this a bug in the implementation, or is it actually supposed to work this way?

Related

confusion in behavior of capturing groups in java regex

In this answer I recommended using
s.replaceFirst("\\.0*$|(\\.\\d*?)0+$", "$1");
but two people complained that the result contained the string "null", e.g., 23.null. This could be explained by $1 (i.e., group(1)) being null, which could be transformed via String.valueOf to the string "null". However, I always get the empty string. My testcase covers it and
assertEquals("23", removeTrailingZeros("23.00"));
passes. Is the exact behavior undefined?

The documentation of Matcher class from the reference implementation doesn't specify the behavior of appendReplacement method when a capturing group which doesn't capture anything (null) is specified in the replacement string. While the behavior of group method is clear, nothing is mentioned in appendReplacement method.
Below are 3 exhibits of difference in implementation for the case above:
The reference implementation does not append anything (or we can say append an empty string) for the case above.
GNU Classpath and Android's implementation appends null for the case above.
Some code has been omitted for the sake of brevity, and is indicated by ....
1) Sun/Oracle JDK, OpenJDK (Reference implementation)
For the reference implementation (Sun/Oracle JDK and OpenJDK), the code for appendReplacement doesn't seem to have changed from Java 6, and it will not append anything when a capturing group doesn't capture anything:
} else if (nextChar == '$') {
// Skip past $
cursor++;
// The first number is always a group
int refNum = (int)replacement.charAt(cursor) - '0';
if ((refNum < 0)||(refNum > 9))
throw new IllegalArgumentException(
"Illegal group reference");
cursor++;
// Capture the largest legal group string
...
// Append group
if (start(refNum) != -1 && end(refNum) != -1)
result.append(text, start(refNum), end(refNum));
} else {
Reference
jdk6/98e143b44620
jdk8/687fd7c7986d
2) GNU Classpath
GNU Classpath, which is a complete reimplementation of Java Class Library has a different implementation for appendReplacement in the case above. In Classpath, the classes in java.util.regex package in Classpath is just a wrapper for classes in gnu.java.util.regex.
Matcher.appendReplacement calls RE.getReplacement to process replacement for the matched portion:
public Matcher appendReplacement (StringBuffer sb, String replacement)
throws IllegalStateException
{
assertMatchOp();
sb.append(input.subSequence(appendPosition,
match.getStartIndex()).toString());
sb.append(RE.getReplacement(replacement, match,
RE.REG_REPLACE_USE_BACKSLASHESCAPE));
appendPosition = match.getEndIndex();
return this;
}
RE.getReplacement calls REMatch.substituteInto to get the content of the capturing group and appends its result directly:
case '$':
int i1 = i + 1;
while (i1 < replace.length () &&
Character.isDigit (replace.charAt (i1)))
i1++;
sb.append (m.substituteInto (replace.substring (i, i1)));
i = i1 - 1;
break;
REMatch.substituteInto appends the result of REMatch.toString(int) directly without checking whether the capturing group has captured anything:
if ((input.charAt (pos) == '$')
&& (Character.isDigit (input.charAt (pos + 1))))
{
// Omitted code parses the group number into val
...
if (val < start.length)
{
output.append (toString (val));
}
}
And REMatch.toString(int) returns null when the capturing group doesn't capture (irrelevant code has been omitted).
public String toString (int sub)
{
if ((sub >= start.length) || sub < 0)
throw new IndexOutOfBoundsException ("No group " + sub);
if (start[sub] == -1)
return null;
...
}
So in GNU Classpath's case, null will be appended to the string when a capturing group which fails to capture anything is specified in the replacement string.
3) Android Open Source Project - Java Core Libraries
In Android, Matcher.appendReplacement calls private method appendEvaluated, which in turn directly appends the result of group(int) to the replacement string.
public Matcher appendReplacement(StringBuffer buffer, String replacement) {
buffer.append(input.substring(appendPos, start()));
appendEvaluated(buffer, replacement);
appendPos = end();
return this;
}
private void appendEvaluated(StringBuffer buffer, String s) {
boolean escape = false;
boolean dollar = false;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\' && !escape) {
escape = true;
} else if (c == '$' && !escape) {
dollar = true;
} else if (c >= '0' && c <= '9' && dollar) {
buffer.append(group(c - '0'));
dollar = false;
} else {
buffer.append(c);
dollar = false;
escape = false;
}
}
// This seemingly stupid piece of code reproduces a JDK bug.
if (escape) {
throw new ArrayIndexOutOfBoundsException(s.length());
}
}
Since Matcher.group(int) returns null for capturing group which fails to capture, Matcher.appendReplacement appends null when the capturing group is referred to in the replacement string.
It is most likely that the 2 people complaining to you are running their code on Android.

Having had a careful look at the Javadoc, I conclude that:
$1 is equivalent to calling group(1), which is specified to return null when the group didn't get captured.
The handling of nulls in the replacement expression is unspecified.
The wording of the relevant parts of the Javadoc is on the whole surprisingly vague (emphasis mine):
Dollar signs may be treated as references to captured subsequences as described above...

You have two alternatives | or-ed together, but only the second is between ( ) hence if the first alternative is matched, group 1 is null.
In general place the parentheses around all alternatives
In your case you want to replace
"xxx.00000" by "xxx" or else
"xxx.yyy00" by "xxx.yyy"
Better do that in two steps, as that is more readable:
"xxx.y*00" by "xxx.y*" then
"xxx." by "xxx"
This does a bit extra, changing an initial "1." to "1".
So:
.replaceFirst("(\\.\\d*?)0+$", "$1").replaceFirst("\\.$", "");

Constructor throwing runtime exception

I have a constructor that takes in a string as a parameter. I want to throw a runtime exception everytime the string that is passed into the constructor contains anything that is not either "A", "C", "G", or "T". Currently this is what my code looks like:
public DNAStrandNovice(String strand) {
passedStrand = strand;
if (passedStrand.contains("a") || passedStrand.contains("c")
|| passedStrand.contains("g") || passedStrand.contains("t")) {
throw new RuntimeException("Illegal DNA strand");
} else if (passedStrand.contains("1") || passedStrand.contains("2")
|| passedStrand.contains("3") || passedStrand.contains("4")
|| passedStrand.contains("5") || passedStrand.contains("6")
|| passedStrand.contains("7") || passedStrand.contains("8")
|| passedStrand.contains("9") || passedStrand.contains("0")) {
throw new RuntimeException("Illegal DNA Strand");
} else if (passedStrand.contains(",") || passedStrand.contains(".")
|| passedStrand.contains("?") || passedStrand.contains("/")
|| passedStrand.contains("<") || passedStrand.contains(">")) {
throw new RuntimeException("Illegal DNA Strand");
}
}
I feel like this could be implemented in a much more concise way, but I don't know how. Right now I'm just checking for every character that is not the capital letters "A", "C", "G", or "T" and throwing a run time exception but I feel like it's too tedious and bad programming style. Anyone have any ideas?

Check negatively, instead of positively.
for (int i = 0; i < str.length(); i++) {
if (str.charAt(i) != 'A' && str.charAt(i) != 'C'
&& str.charAt(i) != 'G' && str.charAt(i) != 'T') {
throw new IllegalArgumentException("Bad character " + str.charAt(i));
}
}
...or, even shorter,
for (int i = 0; i < str.length(); i++) {
if (!"ACGT".contains(str.charAt(i))) {
throw new IllegalArgumentException("Bad character " + str.charAt(i));
}
}

You can achieve this using regex (regular expressions):
public DNAStrandNovice(String strand) {
if (!strand.matches("[ACGT]+")) { //or [ACGT] <-- see note below
throw new RuntimeException("Illegal DNA strand");
}
passedStrand = strand;
}
The regular expression [ACGT]+ means the string must have one or more characters, and each of them must be one of A, C, G or T. The ! in front of strand.matches reverses the boolean value returned by matches, essentially meaning if the string does not match the regex, then throw RuntimeException.
Note: If you need the string to have exactly one character, use the regex [ACGT]. If you need to allow spaces, you can use [ACGT ]+ (then trim and check for empty) or [ACGT][ACGT ]+ (which ensures the first character is not a space).
You can even do much more complex and powerful regex checks such as patterns that should contain exactly four characters repeated with spaces in between (example ATCG TACG) or even where only certain characters appear in certain places, like only A and C can appear as first two characters, and only G and T can appear following it (example ACTG is correct while AGTC is wrong). I will leave all that as an exercise.

Recommend against using an exception. Define an Enum and pass that.
public enum DnaCode { A, C, G, T }
...
public DNAStrandNovice(List<DnaCode> strand) {
...
}
Or make it a DnaCode[] if you prefer. You can control the input and avoid dealing with interrupted control flow. Exceptions are rather expensive to throw and are not really intended for use as a method of flow control.

You can make the code slightly more efficient by manaully looping through the characters and checking for the letters either with ifs or a Set.
But honestly, unless performance is a problem, it's good how it. Very obvious and easy to maintain.

I was going to jump in with a possibility...
public boolean validateLetter(String letter){
HashMap<String, String> dna = new HashMap<String, String>();
dna.put("A", "A");
dna.put("C", "C");
dna.put("G", "G");
dna.put("T", "T");
if(dna.get(letter) == null){
System.out.println("fail");
return false;
} else {
return true;
}
}
I would also not put that code in the constructor, rather put it in its own method and call from the constructor.

public DNAStrandNovice(String strand){
if(strand.matches("^[A-Za-z]*[0-9]+[A-Za-z]*$") || strand.matches("^[a-zA-Z]*[^a-zA-Z0-9][a-zA-Z]*$") || strand.matches("^[A-Za-z]*[acgt]+[A-Za-z]*$")){
throw new RuntimeException("Illegal DNA strand");
}
}

Why won't my code compile which checks if a string begins with vowel?

if (flipped.charAt(0) = "a" || "e" || "i" || "o" || "u"){
paren = "(" + flipped;
String firstpart = paren.substring(0,5);
String rest = paren.substring(5);
System.out.println(rest+firstpart);
}
In this code, I'm looking to check if the first character of String flipped is a vowel. If it is, I'm adding a parenthesis to the beginning and moving the first 5 characters to the end of the string. Eclipse is giving me java.lang.NullPointerException and saying that "The left-hand side of an assignment must be a variable." What can I do to fix this?

Your code has following issues,
Use conditional operator == instead of assignment = at if statement.
Use single quotation ' instead of double " for char
Make a separate method for vowel check.
boolean isVowel(char ch){
ch=Character.toLowerCase(ch);
return ch=='a' || ch=='e' || ch=='i' || ch=='o' || ch=='u';
}

Another very simple solution I often use:
if ("aeiou".indexOf(Character.toLowerCase(text.charAt(0))) >= 0) {
// text starts with vocal.
}

You can also use regular expression matching:
if (text.matches("^[aeiou].*")) {

Use a collection that holds all of these values.
Set<Character> myList = new HashSet<Character>(Arrays.asList('a', 'e', 'i', 'o', 'u'));
if(myList.contains(Character.toLowerCase(flipped.charAt(0)))) {
// Do work
}
This line of code (while wrong: = will assign, == will compare)
if (flipped.charAt(0) == "a" || "e" || "i" || "o" || "u"){
will first compare flipped.charAt(0) == "a" which returns a boolean. Then it will continue with boolean || "e" || "i" || "o" || "u".
boolean || "e" is not valid code.

The accepted answer although explained the problem didn't quite show the solution for how he was checking the problem. So I figured i'd show a corrected solution as well as offer my own solution to such a problem.
Someone who cannot understand Boolean compare syntax isn't going to understand All those special classes. Not to mention some of those needs imports he may not have and will now need to understand why he's getting errors. I assume this person has came to a resolution by now given it's been 5 years.. but in the evnet someone else or even this person still is unsure on something.
Your Original Code Updated ( I removed the contents inside as I don't know what they do or if they were accurate ).
char c = flipped.charAt(0);
if (c == 'a' || c == 'A' || c == 'e' || c == 'E' || c == 'i' ||
c == 'I' || c == 'o' || c == 'O' || c == 'U' || c == 'u')
{
Now this supports checking if "flipped.charAt(0)" equals a vowel weather lower case or uppercase. As you can see we do a Boolean check for each situation by checking if "C" equals something else. You only offered the check one time so the syntax error was because of that you were doing Boolean checks on non Boolean values. When you have values next to "||" it must be "false", "true" or "SomethingA == SomethingB". If that something is an object you typically have to do "SomethingA.equals(SomethingB); E.g. byte,int,short,long,float,double will all work just fine, but String would require the second method.
Below are some tips to reduce this further.
We can force char "c" to lowercase by doing any of the below methods.
char c = Character.toLowerCase(flipped.charAt(0));
Or we can do a more clever way.
char c = flipped.charAt(0) | 32;
As such now we only need to do the following to check if it's a vowel.
char c = flipped.charAt(0) | 32;
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' ||c == 'u')
{
However we can take this a step further.
We can reduce the code even more!
if (((1 << flipping.charAt(0)) & 2130466) != 0)
{
So basically how my final solution works. Unfortunately it's pretty involved to explain it, but i'll try my best.
In any programming language you have values of Byte, Short, Int, and Long these are 8bit, 16bit, 32bit, and 64bit respectively.
When you perform 1 << N you are doing 2^N which is basically the power of two method. The thing is though when you use this on the Byte, Short, Int, or Long the value "N" is reduced.
So.. (keep in mind different languages handle these differently).
Byte can only range from 0-7.
Short can only range from 0-15.
Int can only range from 0-31.
Long can only range from 0-63.
So now we know letters have a value A-Z = 65-90 and a-z = 97-122 when we do 1 << letter it will actually be 1 << (1-26) because the those numbers module or remainder of 32 is 1-26 in both cases.
You can see this by doing the following.
A = 65.
65-32=33.
33-32=1. Stop.
So now we know A will equal 1 in this situation.
So now we do 1 << 1 or 2^1 = 2. So the letter A gives us the value "2".
Repeat this for all the vowels and we can a sum of bit values. Bit values are just powers of two added together. I again really can't go hard explaining this it's pretty involved but hopefully you kind of have an idea.
Now what we are doing is taking the sum of the vowel bits and comparing it to the number 2130466 which contains the bit values of A,E,I,O,U already. If those bit value we check for happens to exist in 2130466 then it must be A,E,I,O,U and as such it's a vowel.
The return result is 0 or the value so we simply check that this value doesn't equal 0.
Please keep in mind if anyone uses this that this assume you know the letter will be A-Za-z situation because if it was for example a "!" this will return a false positive as a "A" vowel. You can solve this by prechecking if the value is below "A" and above "u" and return out early.

Tokenizing an algebraic expression in string format

I"m trying to take a string that represents a full algebraic excpression, such as x = 15 * 6 / 3 which is a string, and tokenize it into its individual components. So the first would be x, then =, then 15, then *, 6, / and finally 3.
The problem I am having is actually parsing through the string and looking at the individual characters. I can't think of a way to do this without a massive amount of if statements. Surely there has to be a better way tan specifically defining each individual case and testing for it.

For each type of token, you'll want to figure out how to identify:
when you're starting to read a particular token
if you're continuing to read the same token, or if you've started a different one
Let's take your example: x=15*6/3. Let's assume that you cannot rely on the fact that there are spaces in between each token. In that case, it's trivial: your new token starts when you reach a space.
You can break down the character types into letters, digits, and symbols. Let's call the token types Variable, Operator, and Number.
A letter indicates a Variable token has started. It continues until you read a non-letter.
A symbol indicates the start of an Operator token. I only see single symbols, but you can have groups of symbols correspond to different Operator tokens.
A digit indicates the start of a Number token. (Let's assume integers for now.) The Number token continues until you read a non-digit.
Basically, that's how a simple symbolic parser works. Now, if you add in negative numbers (where the '-' symbol can have multiple meanings), or parentheses, or function names (like sin(x)) then things get more complicated, but it amounts to the same set of rules, now just with more choices.

create regular expression for each possible element: integer, variable, operator, parentheses.
combine them using the | regular expression operator into one big regular expression with capture groups to identify which one matched.
in a loop match the head of the remaining string and break off the matched part as a token. the type of the token depends on which sub-expression matched as described in 2.
or
use a lexer library, such as the one in antlr or javacc

This is from my early expression evaluator that takes an infix expression like yours and turns it into postfix to evaluate. There are methods that help the parser but I think they're pretty self documenting. Mine uses symbol tables to check tokens against. It also allows for user defined symbols and nested assignments and other things you may not need/want. But it shows how I handled your issue without using niceties like regex which would simplify this task tremendously. In addition everything shown is of my own implementation - stack and queue as well - everything. So if anything looks abnormal (unlike Java imps) that's because it is.
This section of code is important not to answer your immediate question but to show the necessary work to determine the type of token you're dealing with. In my case I had three different types of operators and two different types of operands. Based on either the known rules or rules I chose to enforce (when appropriate) it was easy to know when something was a number (starts with a number), variable/user symbol/math function (starts with a letter), or math operator (is: /,*,-,+) . Note that it only takes seeing the first char to know the correct extraction rules. From your example, if all your cases are as simple, you'd only have to handle two types, operator or operand. Nonetheless the same logic will apply.
protected Queue<Token> inToPostParse(String exp) {
// local vars
inputExp = exp;
offset = 0;
strLength = exp.length();
String tempHolder = "";
char c;
// the program runs in a loop so make sure you're dealing
// with an empty queue
q1.reset();
for (int i = offset; tempHolder != null && i < strLength; ++i) {
c = exp.charAt(i);
// Spaces are useless so skip them
if (c == ' ') { continue; }
// If c is a letter
if ((c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')) {
// Here we know it must be a user symbol possibly undefined
// at this point or an function like SIN, ABS, etc
// We extract, based on obvious rules, the op
tempHolder = extractPhrase(i); // Used to be append sequence
if (ut.isTrigOp(tempHolder) || ut.isAdditionalOp(tempHolder)) {
s1.push(new Operator(tempHolder, "Function"));
} else {
// If not some math function it is a user defined symbol
q1.insert(new Token(tempHolder, "User"));
}
i += tempHolder.length() - 1;
tempHolder = "";
// if c begins with a number
} else if (c >= '0' && c <= '9') {
try {
// Here we know that it must be a number
// so we extract until we reach a non number
tempHolder = extractNumber(i);
q1.insert(new Token(tempHolder, "Number"));
i += tempHolder.length() - 1;
tempHolder = "";
}
catch (NumberFormatException nfe) {
return null;
}
// if c is in the math symbol table
} else if (ut.isMathOp(String.valueOf(c))) {
String C = String.valueOf(c);
try {
// This is where the magic happens
// Here we determine the "intersection" of the
// current C and the top of the stack
// Based on the intersection we take action
// i.e., in math do you want to * or + first?
// Depending on the state you may have to move
// some tokens to the queue before pushing onto the stack
takeParseAction(C, ut.findIntersection
(C, s1.showTop().getSymbol()));
}
catch (NullPointerException npe) {
s1(C);
}
// it must be an invalid expression
} else {
return null;
}
}
u2();
s1.reset();
return q1;
}
Basically I have a stack (s1) and a queue (q1). All variables or numbers go into the queue. Any operators trig, math, parens, etc.. go on the stack. If the current token is to be put on the stack you have to check the state (top) to determine what parsing action to take (i.e., what to do based on math precedence). Sorry if this seems like useless information. I imagine if you're parsing a math expression it's because at some point you plan to evaluate it. IMHO, postfix is the easiest so I, regardless of input format, change it to post and evaluate with one method. If your O is different - do what you like.
Edit: Implementations
The extract phrase and number methods, which you may be most interested in, are as follows:
protected String extractPhrase(int it) {
String phrase = new String();
char c;
for ( ; it < inputExp.length(); ++it) {
c = inputExp.charAt(it);
if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9')) {
phrase += String.valueOf(c);
} else {
break;
}
}
return phrase;
}
protected String extractNumber(int it) throws NumberFormatException {
String number = new String();
int decimals = 0;
char c;
for ( ; it < strLength; ++it) {
c = inputExp.charAt(it);
if (c >= '0' && c <= '9') {
number += String.valueOf(c);
} else if (c == '.') {
++decimals;
if (decimals < 2) {
number += ".";
} else {
throw new NumberFormatException();
}
} else {
break;
}
}
return number;
}
Remember - By the time they enter these methods I've already been able to deduce what type it is. This allows you to avoid the seemingly endless while-if-else chain.

Are components always separated by space character like in your question? if so, use algebricExpression.split(" ") to get a String[] of components.
If no such restrictions can be assumed, a possible solution can be to iterate over the input, and switch the Character.getType() of the current index, somthing like that:
ArrayList<String> getExpressionComponents(String exp) {
ArrayList<String> components = new ArrayList<String>();
String current = "";
int currentSequenceType = Character.UNASSIGNED;
for (int i = 0 ; i < exp.length() ; i++) {
if (currentSequenceType != Character.getType(exp.charAt(i))) {
if (current.length() > 0) components.add(current);
current = "";
currentSequenceType = Character.getType(exp.charAt(i));
}
switch (Character.getType(exp.charAt(i))) {
case Character.DECIMAL_DIGIT_NUMBER:
case Character.MATH_SYMBOL:
case Character.START_PUNCTUATION:
case Character.END_PUNCTUATION:
case Character.LOWERCASE_LETTER:
case Character.UPPERCASE_LETTER:
// add other required types
current = current.concat(new String(new char[] {exp.charAt(i)}));
currentSequenceType = Character.getType(exp.charAt(i));
break;
default:
current = "";
currentSequenceType = Character.UNASSIGNED;
break;
}
}
return components;
}
You can easily change the cases to meet with other requirements, such as split non-digit chars to separate components etc.

What's the correct algorithm to determine number of user-perceived-characters?

I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an int[]) which represents Unicode code points.
java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<)
I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?
Initially, I'd thought that all I have to do is to combine all occurences of:
U+0300 – U+036F (combining diacritical marks)
U+1DC0 – U+1DFF (combining diacritical marks supplement)
U+20D0 – U+20FF (combining diacritical marks for symbols)
U+FE20 - U+FE2F (combining half marks)
into the previous non-diacritic-mark.
However I've realised that prior to that operation, I have to first remove all non-characters as well.
This includes:
U+FDD0 - U+FDEF
The last two code points of every plane
But there seems to be more things to do. Unicode.org states we need to include U+200C (zero-width non joiner) and U+200D (zero width joiner) as part of the set of continuing characters (source).
Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables?
Does anyone know the correct algorithm to count the number of grapheme-clusters given an int[] of code points?

There's not a single canonical method appropriate to all uses, but a good starting point is the Unicode Grapheme Cluster Boundary algorithm on the Unicode.org page you link to. Basically, Unicode provides a database of each code point's grapheme break property, and then describes an algorithm to decide if a grapheme break is allowed between two code points based on their assigned grapheme break properties.
Here's part of an implementation (in C++) I played around with a while ago:
bool BoundaryAllowed(char32_t cp, char32_t cp2) {
// lbp: left break property; rbp: right break property
auto lbp = get_property_for_codepoint(cp),
rbp = get_property_for_codepoint(cp2);
// Do not break between a CR and LF. Otherwise, break before and after
// controls.
if ((CR == lbp && LF == rbp)) {
// The Unicode grapheme boundary algorithm does not handle LFCR new lines
return false;
}
if (Control == lbp || CR == lbp || LF == lbp || Control == rbp || CR == rbp ||
LF == rbp) {
return true;
}
// Do not break Hangul syllable sequences.
if ((L == lbp && (L == rbp || V == rbp || LV == rbp || LVT == rbp)) ||
((LV == lbp || V == lbp) && (V == rbp || T == rbp)) ||
((LVT == lbp || T == lbp) && (T == rbp))) {
return false;
}
// Do not break before extending characters.
if (Extend == rbp) {
return false;
}
// Do not break before SpacingMarks, or after Prepend characters.
if (Prepend == lbp || SpacingMark == rbp) {
return false;
}
return true; // Otherwise, break everywhere.
}
In order to obtain the ranges for the different types of codepoints you'll just have to look at the Unicode Character Database. The file with the grapheme break properties, which describes them in terms of the ranges, is about 1200 lines long: http://www.unicode.org/Public/6.1.0/ucd/auxiliary/
I'm not really sure how much value there is in ignoring non-character code points, but if your use requires it then you'll add that in to your implementation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex intersection (&&) is not commutative - java

Related

confusion in behavior of capturing groups in java regex

Constructor throwing runtime exception

Why won't my code compile which checks if a string begins with vowel?

Tokenizing an algebraic expression in string format

What's the correct algorithm to determine number of user-perceived-characters?

Categories

Resources