Related
I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the string are the same, and the middle part that differs is an integer, then compare based on the numeric values of those integers. For example, I want the following strings to end up in order they're shown:
aaa
bbb 3 ccc
bbb 12 ccc
ccc 11
ddd
eee 3 ddd jpeg2000 eee
eee 12 ddd jpeg2000 eee
As you can see, there might be other integers in the string, so I can't just use regular expressions to break out any integer. I'm thinking of just walking the strings from the beginning until I find a bit that doesn't match, then walking in from the end until I find a bit that doesn't match, and then comparing the bit in the middle to the regular expression "[0-9]+", and if it compares, then doing a numeric comparison, otherwise doing a lexical comparison.
Is there a better way?
Update I don't think I can guarantee that the other numbers in the string, the ones that may match, don't have spaces around them, or that the ones that differ do have spaces.
The Alphanum Algorithm
From the website
"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."
Edit: Here's a link to the Java Comparator Implementation from that site.
Interesting little challenge, I enjoyed solving it.
Here is my take at the problem:
String[] strs =
{
"eee 5 ddd jpeg2001 eee",
"eee 123 ddd jpeg2000 eee",
"ddd",
"aaa 5 yy 6",
"ccc 555",
"bbb 3 ccc",
"bbb 9 a",
"",
"eee 4 ddd jpeg2001 eee",
"ccc 11",
"bbb 12 ccc",
"aaa 5 yy 22",
"aaa",
"eee 3 ddd jpeg2000 eee",
"ccc 5",
};
Pattern splitter = Pattern.compile("(\\d+|\\D+)");
public class InternalNumberComparator implements Comparator
{
public int compare(Object o1, Object o2)
{
// I deliberately use the Java 1.4 syntax,
// all this can be improved with 1.5's generics
String s1 = (String)o1, s2 = (String)o2;
// We split each string as runs of number/non-number strings
ArrayList sa1 = split(s1);
ArrayList sa2 = split(s2);
// Nothing or different structure
if (sa1.size() == 0 || sa1.size() != sa2.size())
{
// Just compare the original strings
return s1.compareTo(s2);
}
int i = 0;
String si1 = "";
String si2 = "";
// Compare beginning of string
for (; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
break; // Until we find a difference
}
// No difference found?
if (i == sa1.size())
return 0; // Same strings!
// Try to convert the different run of characters to number
int val1, val2;
try
{
val1 = Integer.parseInt(si1);
val2 = Integer.parseInt(si2);
}
catch (NumberFormatException e)
{
return s1.compareTo(s2); // Strings differ on a non-number
}
// Compare remainder of string
for (i++; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
{
return s1.compareTo(s2); // Strings differ
}
}
// Here, the strings differ only on a number
return val1 < val2 ? -1 : 1;
}
ArrayList split(String s)
{
ArrayList r = new ArrayList();
Matcher matcher = splitter.matcher(s);
while (matcher.find())
{
String m = matcher.group(1);
r.add(m);
}
return r;
}
}
Arrays.sort(strs, new InternalNumberComparator());
This algorithm need much more testing, but it seems to behave rather nicely.
[EDIT] I added some more comments to be clearer. I see there are much more answers than when I started to code this... But I hope I provided a good starting base and/or some ideas.
Ian Griffiths of Microsoft has a C# implementation he calls Natural Sorting. Porting to Java should be fairly easy, easier than from C anyway!
UPDATE: There seems to be a Java example on eekboom that does this, see the "compareNatural" and use that as your comparer to sorts.
The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.
This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits.
public static final int compareNatural (String s1, String s2)
{
// Skip all identical characters
int len1 = s1.length();
int len2 = s2.length();
int i;
char c1, c2;
for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);
// Check end of string
if (c1 == c2)
return(len1 - len2);
// Check digit in first string
if (Character.isDigit(c1))
{
// Check digit only in first string
if (!Character.isDigit(c2))
return(1);
// Scan all integer digits
int x1, x2;
for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);
// Longer integer wins, first digit otherwise
return(x2 == x1 ? c1 - c2 : x1 - x2);
}
// Check digit only in second string
if (Character.isDigit(c2))
return(-1);
// No digits
return(c1 - c2);
}
I came up with a quite simple implementation in Java using regular expressions:
public static Comparator<String> naturalOrdering() {
final Pattern compile = Pattern.compile("(\\d+)|(\\D+)");
return (s1, s2) -> {
final Matcher matcher1 = compile.matcher(s1);
final Matcher matcher2 = compile.matcher(s2);
while (true) {
final boolean found1 = matcher1.find();
final boolean found2 = matcher2.find();
if (!found1 || !found2) {
return Boolean.compare(found1, found2);
} else if (!matcher1.group().equals(matcher2.group())) {
if (matcher1.group(1) == null || matcher2.group(1) == null) {
return matcher1.group().compareTo(matcher2.group());
} else {
return Integer.valueOf(matcher1.group(1)).compareTo(Integer.valueOf(matcher2.group(1)));
}
}
}
};
}
Here is how it works:
final List<String> strings = Arrays.asList("x15", "xa", "y16", "x2a", "y11", "z", "z5", "x2b", "z");
strings.sort(naturalOrdering());
System.out.println(strings);
[x2a, x2b, x15, xa, y11, y16, z, z, z5]
I realize you're in java, but you can take a look at how StrCmpLogicalW works. It's what Explorer uses to sort filenames in Windows. You can look at the WINE implementation here.
Split the string into runs of letters and numbers, so "foo 12 bar" becomes the list ("foo", 12, "bar"), then use the list as the sort key. This way the numbers will be ordered in numerical order, not alphabetical.
Here is the solution with the following advantages over Alphanum Algorithm:
3.25x times faster (tested on the data from 'Epilogue' chapter of Alphanum description)
Does not consume extra memory (no string splitting, no numbers parsing)
Processes leading zeros correctly (e.g. "0001" equals "1", "01234" is less than "4567")
public class NumberAwareComparator implements Comparator<String>
{
#Override
public int compare(String s1, String s2)
{
int len1 = s1.length();
int len2 = s2.length();
int i1 = 0;
int i2 = 0;
while (true)
{
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (Character.isDigit(ch1) && Character.isDigit(ch2))
{
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && Character.isDigit(s1.charAt(end1)))
end1++;
while (end2 < len2 && Character.isDigit(s2.charAt(end2)))
end2++;
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
// if the lengths are different, then the longer number is bigger
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1)
{
if (s1.charAt(i1) != s2.charAt(i2))
return s1.charAt(i1) - s2.charAt(i2);
i1++;
i2++;
}
}
else
{
// plain characters comparison
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
}
Instead of reinventing the wheel, I'd suggest to use a locale-aware Unicode-compliant string comparator that has built-in number sorting from the ICU4J library.
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
public class CollatorExample {
public static void main(String[] args) {
// Make sure to choose correct locale: in Turkish uppercase of "i" is "İ", not "I"
RuleBasedCollator collator = (RuleBasedCollator) Collator.getInstance(Locale.US);
collator.setNumericCollation(true); // Place "10" after "2"
collator.setStrength(Collator.PRIMARY); // Case-insensitive
List<String> strings = Arrays.asList("10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a"
);
strings.sort(collator);
System.out.println(String.join(", ", strings));
// Output: _1, _01, _2, _200, 01, 001, 1,
// 2, 02, 10, 10, 010, 20, 100, A 02, A01,
// a2, A20, t1A, t1a, t1ab, t1aB, t1Ab, t1AB,
// T010T01, T0010T01
}
}
The Alphanum algrothim is nice, but it did not match requirements for a project I'm working on. I need to be able to sort negative numbers and decimals correctly. Here is the implementation I came up. Any feedback would be much appreciated.
public class StringAsNumberComparator implements Comparator<String> {
public static final Pattern NUMBER_PATTERN = Pattern.compile("(\\-?\\d+\\.\\d+)|(\\-?\\.\\d+)|(\\-?\\d+)");
/**
* Splits strings into parts sorting each instance of a number as a number if there is
* a matching number in the other String.
*
* For example A1B, A2B, A11B, A11B1, A11B2, A11B11 will be sorted in that order instead
* of alphabetically which will sort A1B and A11B together.
*/
public int compare(String str1, String str2) {
if(str1 == str2) return 0;
else if(str1 == null) return 1;
else if(str2 == null) return -1;
List<String> split1 = split(str1);
List<String> split2 = split(str2);
int diff = 0;
for(int i = 0; diff == 0 && i < split1.size() && i < split2.size(); i++) {
String token1 = split1.get(i);
String token2 = split2.get(i);
if((NUMBER_PATTERN.matcher(token1).matches() && NUMBER_PATTERN.matcher(token2).matches()) {
diff = (int) Math.signum(Double.parseDouble(token1) - Double.parseDouble(token2));
} else {
diff = token1.compareToIgnoreCase(token2);
}
}
if(diff != 0) {
return diff;
} else {
return split1.size() - split2.size();
}
}
/**
* Splits a string into strings and number tokens.
*/
private List<String> split(String s) {
List<String> list = new ArrayList<String>();
try (Scanner scanner = new Scanner(s)) {
int index = 0;
String num = null;
while ((num = scanner.findInLine(NUMBER_PATTERN)) != null) {
int indexOfNumber = s.indexOf(num, index);
if (indexOfNumber > index) {
list.add(s.substring(index, indexOfNumber));
}
list.add(num);
index = indexOfNumber + num.length();
}
if (index < s.length()) {
list.add(s.substring(index));
}
}
return list;
}
}
PS. I wanted to use the java.lang.String.split() method and use "lookahead/lookbehind" to keep the tokens, but I could not get it to work with the regular expression I was using.
interesting problem, and here my proposed solution:
import java.util.Collections;
import java.util.Vector;
public class CompareToken implements Comparable<CompareToken>
{
int valN;
String valS;
String repr;
public String toString() {
return repr;
}
public CompareToken(String s) {
int l = 0;
char data[] = new char[s.length()];
repr = s;
valN = 0;
for (char c : s.toCharArray()) {
if(Character.isDigit(c))
valN = valN * 10 + (c - '0');
else
data[l++] = c;
}
valS = new String(data, 0, l);
}
public int compareTo(CompareToken b) {
int r = valS.compareTo(b.valS);
if (r != 0)
return r;
return valN - b.valN;
}
public static void main(String [] args) {
String [] strings = {
"aaa",
"bbb3ccc",
"bbb12ccc",
"ccc 11",
"ddd",
"eee3dddjpeg2000eee",
"eee12dddjpeg2000eee"
};
Vector<CompareToken> data = new Vector<CompareToken>();
for(String s : strings)
data.add(new CompareToken(s));
Collections.shuffle(data);
Collections.sort(data);
for (CompareToken c : data)
System.out.println ("" + c);
}
}
Prior to discovering this thread, I implemented a similar solution in javascript. Perhaps my strategy will find you well, despite different syntax. Similar to above, I parse the two strings being compared, and split them both into arrays, dividing the strings at continuous numbers.
...
var regex = /(\d+)/g,
str1Components = str1.split(regex),
str2Components = str2.split(regex),
...
I.e., 'hello22goodbye 33' => ['hello', 22, 'goodbye ', 33]; Thus, you can walk through the arrays' elements in pairs between string1 and string2, do some type coercion (such as, is this element really a number?), and compare as you walk.
Working example here: http://jsfiddle.net/F46s6/3/
Note, I currently only support integer types, though handling decimal values wouldn't be too hard of a modification.
My 2 cents.Is working well for me. I am mainly using it for filenames.
private final boolean isDigit(char ch)
{
return ch >= 48 && ch <= 57;
}
private int compareNumericalString(String s1,String s2){
int s1Counter=0;
int s2Counter=0;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
char currentChar1=s1.charAt(s1Counter++);
char currentChar2=s2.charAt(s2Counter++);
if(isDigit(currentChar1) &&isDigit(currentChar2)){
String digitString1=""+currentChar1;
String digitString2=""+currentChar2;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
if(isDigit(s1.charAt(s1Counter))){
digitString1+=s1.charAt(s1Counter);
s1Counter++;
}
if(isDigit(s2.charAt(s2Counter))){
digitString2+=s2.charAt(s2Counter);
s2Counter++;
}
if((!isDigit(s1.charAt(s1Counter))) && (!isDigit(s2.charAt(s2Counter)))){
currentChar1=s1.charAt(s1Counter);
currentChar2=s2.charAt(s2Counter);
break;
}
}
if(!digitString1.equals(digitString2)){
return Integer.parseInt(digitString1)-Integer.parseInt(digitString2);
}
}
if(currentChar1!=currentChar2){
return currentChar1-currentChar2;
}
}
return s1.compareTo(s2);
}
I created a project to compare the different implementations. It is far from complete, but it is a starting point.
Adding on to the answer made by #stanislav.
A few problems I faced while using the answer provided was:
Capital and small letters are separated by the characters between their ASCII codes. This breaks the flow when the strings being sorted have _ or other characters which are between small letters and capital letters in ASCII.
If two strings are the same except for the leading zeroes count being different, the function returns 0 which will make the sort depend on the original positions of the string in the list.
These two issues have been fixed in the new code. And I made a few function instead of a few repetitive set of code. The differentCaseCompared variable keeps track of whether if two strings are the same except for the cases being different. If so the value of the first different case characters subtracted is returned. This is done to avoid the issue of having two strings differing by case returned as 0.
public class NaturalSortingComparator implements Comparator<String> {
#Override
public int compare(String string1, String string2) {
int lengthOfString1 = string1.length();
int lengthOfString2 = string2.length();
int iteratorOfString1 = 0;
int iteratorOfString2 = 0;
int differentCaseCompared = 0;
while (true) {
if (iteratorOfString1 == lengthOfString1) {
if (iteratorOfString2 == lengthOfString2) {
if (lengthOfString1 == lengthOfString2) {
// If both strings are the same except for the different cases, the differentCaseCompared will be returned
return differentCaseCompared;
}
//If the characters are the same at the point, returns the difference between length of the strings
else {
return lengthOfString1 - lengthOfString2;
}
}
//If String2 is bigger than String1
else
return -1;
}
//Check if String1 is bigger than string2
if (iteratorOfString2 == lengthOfString2) {
return 1;
}
char ch1 = string1.charAt(iteratorOfString1);
char ch2 = string2.charAt(iteratorOfString2);
if (Character.isDigit(ch1) && Character.isDigit(ch2)) {
// skip leading zeros
iteratorOfString1 = skipLeadingZeroes(string1, lengthOfString1, iteratorOfString1);
iteratorOfString2 = skipLeadingZeroes(string2, lengthOfString2, iteratorOfString2);
// find the ends of the numbers
int endPositionOfNumbersInString1 = findEndPositionOfNumber(string1, lengthOfString1, iteratorOfString1);
int endPositionOfNumbersInString2 = findEndPositionOfNumber(string2, lengthOfString2, iteratorOfString2);
int lengthOfDigitsInString1 = endPositionOfNumbersInString1 - iteratorOfString1;
int lengthOfDigitsInString2 = endPositionOfNumbersInString2 - iteratorOfString2;
// if the lengths are different, then the longer number is bigger
if (lengthOfDigitsInString1 != lengthOfDigitsInString2)
return lengthOfDigitsInString1 - lengthOfDigitsInString2;
// compare numbers digit by digit
while (iteratorOfString1 < endPositionOfNumbersInString1) {
if (string1.charAt(iteratorOfString1) != string2.charAt(iteratorOfString2))
return string1.charAt(iteratorOfString1) - string2.charAt(iteratorOfString2);
iteratorOfString1++;
iteratorOfString2++;
}
} else {
// plain characters comparison
if (ch1 != ch2) {
if (!ignoreCharacterCaseEquals(ch1, ch2))
return Character.toLowerCase(ch1) - Character.toLowerCase(ch2);
// Set a differentCaseCompared if the characters being compared are different case.
// Should be done only once, hence the check with 0
if (differentCaseCompared == 0) {
differentCaseCompared = ch1 - ch2;
}
}
iteratorOfString1++;
iteratorOfString2++;
}
}
}
private boolean ignoreCharacterCaseEquals(char character1, char character2) {
return Character.toLowerCase(character1) == Character.toLowerCase(character2);
}
private int findEndPositionOfNumber(String string, int lengthOfString, int end) {
while (end < lengthOfString && Character.isDigit(string.charAt(end)))
end++;
return end;
}
private int skipLeadingZeroes(String string, int lengthOfString, int iteratorOfString) {
while (iteratorOfString < lengthOfString && string.charAt(iteratorOfString) == '0')
iteratorOfString++;
return iteratorOfString;
}
}
The following is a unit test I used.
public class NaturalSortingComparatorTest {
private int NUMBER_OF_TEST_CASES = 100000;
#Test
public void compare() {
NaturalSortingComparator naturalSortingComparator = new NaturalSortingComparator();
List<String> expectedStringList = getCorrectStringList();
List<String> testListOfStrings = createTestListOfStrings();
runTestCases(expectedStringList, testListOfStrings, NUMBER_OF_TEST_CASES, naturalSortingComparator);
}
private void runTestCases(List<String> expectedStringList, List<String> testListOfStrings,
int numberOfTestCases, Comparator<String> comparator) {
for (int testCase = 0; testCase < numberOfTestCases; testCase++) {
Collections.shuffle(testListOfStrings);
testListOfStrings.sort(comparator);
Assert.assertEquals(expectedStringList, testListOfStrings);
}
}
private List<String> getCorrectStringList() {
return Arrays.asList(
"1", "01", "001", "2", "02", "10", "10", "010",
"20", "100", "_1", "_01", "_2", "_200", "A 02",
"A01", "a2", "A20", "t1A", "t1a", "t1AB", "t1Ab",
"t1aB", "t1ab", "T010T01", "T0010T01");
}
private List<String> createTestListOfStrings() {
return Arrays.asList(
"10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a");
}
}
Suggestions welcome! I am not sure whether adding the functions changes anything other than the readability part of things.
P.S: Sorry to add another answer to this question. But I don't have enough reps to comment on the answer which I modified for my use.
modification of this answer
case insensitive order (1000a is less than 1000X)
nulls handling
implementation:
import static java.lang.Math.pow;
import java.util.Comparator;
public class AlphanumComparator implements Comparator<String> {
public static final AlphanumComparator ALPHANUM_COMPARATOR = new AlphanumComparator();
private static char[] upperCaseCache = new char[(int) pow(2, 16)];
private boolean nullIsLess;
public AlphanumComparator() {
}
public AlphanumComparator(boolean nullIsLess) {
this.nullIsLess = nullIsLess;
}
#Override
public int compare(String s1, String s2) {
if (s1 == s2)
return 0;
if (s1 == null)
return nullIsLess ? -1 : 1;
if (s2 == null)
return nullIsLess ? 1 : -1;
int i1 = 0;
int i2 = 0;
int len1 = s1.length();
int len2 = s2.length();
while (true) {
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (isDigit(ch1) && isDigit(ch2)) {
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && isDigit(s1.charAt(end1)))
end1++;
while (end2 != len2 && isDigit(s2.charAt(end2)))
end2++;
// if the lengths are different, then the longer number is bigger
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1) {
ch1 = s1.charAt(i1);
ch2 = s2.charAt(i2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
} else {
ch1 = toUpperCase(ch1);
ch2 = toUpperCase(ch2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
private boolean isDigit(char ch) {
return ch >= 48 && ch <= 57;
}
private char toUpperCase(char ch) {
char cached = upperCaseCache[ch];
if (cached == 0) {
cached = Character.toUpperCase(ch);
upperCaseCache[ch] = cached;
}
return cached;
}
}
I think you'll have to do the comparison on a character-by-character fashion. Grab a character, if it's a number character, keep grabbing, then reassemble to characters into a single number string and convert it into an int. Repeat on the other string, and only then do the comparison.
Short answer: based on the context, I can't tell whether this is just some quick-and-dirty code for personal use, or a key part of Goldman Sachs' latest internal accounting software, so I'll open by saying: eww. That's a rather funky sorting algorithm; try to use something a bit less "twisty" if you can.
Long answer:
The two issues that immediately come to mind in your case are performance, and correctness. Informally, make sure it's fast, and make sure your algorithm is a total ordering.
(Of course, if you're not sorting more than about 100 items, you can probably disregard this paragraph.) Performance matters, as the speed of the comparator will be the largest factor in the speed of your sort (assuming the sort algorithm is "ideal" to the typical list). In your case, the comparator's speed will depend mainly on the size of the string. The strings seem to be fairly short, so they probably won't dominate as much as the size of your list.
Turning each string into a string-number-string tuple and then sorting this list of tuples, as suggested in another answer, will fail in some of your cases, since you apparently will have strings with multiple numbers appearing.
The other problem is correctness. Specifically, if the algorithm you described will ever permit A > B > ... > A, then your sort will be non-deterministic. In your case, I fear that it might, though I can't prove it. Consider some parsing cases such as:
aa 0 aa
aa 23aa
aa 2a3aa
aa 113aa
aa 113 aa
a 1-2 a
a 13 a
a 12 a
a 2-3 a
a 21 a
a 2.3 a
Although the question asked a java solution, for anyone who wants a scala solution:
object Alphanum {
private[this] val regex = "((?<=[0-9])(?=[^0-9]))|((?<=[^0-9])(?=[0-9]))"
private[this] val alphaNum: Ordering[String] = Ordering.fromLessThan((ss1: String, ss2: String) => (ss1, ss2) match {
case (sss1, sss2) if sss1.matches("[0-9]+") && sss2.matches("[0-9]+") => sss1.toLong < sss2.toLong
case (sss1, sss2) => sss1 < sss2
})
def ordering: Ordering[String] = Ordering.fromLessThan((s1: String, s2: String) => {
import Ordering.Implicits.infixOrderingOps
implicit val ord: Ordering[List[String]] = Ordering.Implicits.seqDerivedOrdering(alphaNum)
s1.split(regex).toList < s2.split(regex).toList
})
}
My problem was that I have lists consisting of a combination of alpha numeric strings (eg C22, C3, C5 etc), alpha strings (eg A, H, R etc) and just digits (eg 99, 45 etc) that need sorting in the order A, C3, C5, C22, H, R, 45, 99. I also have duplicates that need removing so I only get a single entry.
I'm also not just working with Strings, I'm ordering an Object and using a specific field within the Object to get the correct order.
A solution that seems to work for me is :
SortedSet<Code> codeSet;
codeSet = new TreeSet<Code>(new Comparator<Code>() {
private boolean isThereAnyNumber(String a, String b) {
return isNumber(a) || isNumber(b);
}
private boolean isNumber(String s) {
return s.matches("[-+]?\\d*\\.?\\d+");
}
private String extractChars(String s) {
String chars = s.replaceAll("\\d", "");
return chars;
}
private int extractInt(String s) {
String num = s.replaceAll("\\D", "");
return num.isEmpty() ? 0 : Integer.parseInt(num);
}
private int compareStrings(String o1, String o2) {
if (!extractChars(o1).equals(extractChars(o2))) {
return o1.compareTo(o2);
} else
return extractInt(o1) - extractInt(o2);
}
#Override
public int compare(Code a, Code b) {
return isThereAnyNumber(a.getPrimaryCode(), b.getPrimaryCode())
? isNumber(a.getPrimaryCode()) ? 1 : -1
: compareStrings(a.getPrimaryCode(), b.getPrimaryCode());
}
});
It 'borrows' some code that I found here on Stackoverflow plus some tweaks of my own to get it working just how I needed it too.
Due to trying to order Objects, needing a comparator as well as duplicate removal, one negative fudge I had to employ was I first have to write my Objects to a TreeMap before writing them to a Treeset. It may impact performance a little but given that the lists will be a max of about 80 Codes, it shouldn't be a problem.
I had a similar problem where my strings had space-separated segments inside. I solved it in this way:
public class StringWithNumberComparator implements Comparator<MyClass> {
#Override
public int compare(MyClass o1, MyClass o2) {
if (o1.getStringToCompare().equals(o2.getStringToCompare())) {
return 0;
}
String[] first = o1.getStringToCompare().split(" ");
String[] second = o2.getStringToCompare().split(" ");
if (first.length == second.length) {
for (int i = 0; i < first.length; i++) {
int segmentCompare = StringUtils.compare(first[i], second[i]);
if (StringUtils.isNumeric(first[i]) && StringUtils.isNumeric(second[i])) {
segmentCompare = NumberUtils.compare(Integer.valueOf(first[i]), Integer.valueOf(second[i]));
if (0 != segmentCompare) {
// return only if uneven numbers in case there are more segments to be checked
return segmentCompare;
}
}
if (0 != segmentCompare) {
return segmentCompare;
}
}
} else {
return StringUtils.compare(o1.getDenominazione(), o2.getDenominazione());
}
return 0;
}
As you can see I have used Apaches StringUtils.compare() and NumberUtils.compere() as a standard help.
In your given example, the numbers you want to compare have spaces around them while the other numbers do not, so why would a regular expression not work?
bbb 12 ccc
vs.
eee 12 ddd jpeg2000 eee
If you're writing a comparator class, you should implement your own compare method that will compare two strings character by character. This compare method should check if you're dealing with alphabetic characters, numeric characters, or mixed types (including spaces). You'll have to define how you want a mixed type to act, whether numbers come before alphabetic characters or after, and where spaces fit in etc.
On Linux glibc provides strverscmp(), it's also available from gnulib for portability. However truly "human" sorting has lots of other quirks like "The Beatles" being sorted as "Beatles, The". There is no simple solution to this generic problem.
I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the string are the same, and the middle part that differs is an integer, then compare based on the numeric values of those integers. For example, I want the following strings to end up in order they're shown:
aaa
bbb 3 ccc
bbb 12 ccc
ccc 11
ddd
eee 3 ddd jpeg2000 eee
eee 12 ddd jpeg2000 eee
As you can see, there might be other integers in the string, so I can't just use regular expressions to break out any integer. I'm thinking of just walking the strings from the beginning until I find a bit that doesn't match, then walking in from the end until I find a bit that doesn't match, and then comparing the bit in the middle to the regular expression "[0-9]+", and if it compares, then doing a numeric comparison, otherwise doing a lexical comparison.
Is there a better way?
Update I don't think I can guarantee that the other numbers in the string, the ones that may match, don't have spaces around them, or that the ones that differ do have spaces.
The Alphanum Algorithm
From the website
"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."
Edit: Here's a link to the Java Comparator Implementation from that site.
Interesting little challenge, I enjoyed solving it.
Here is my take at the problem:
String[] strs =
{
"eee 5 ddd jpeg2001 eee",
"eee 123 ddd jpeg2000 eee",
"ddd",
"aaa 5 yy 6",
"ccc 555",
"bbb 3 ccc",
"bbb 9 a",
"",
"eee 4 ddd jpeg2001 eee",
"ccc 11",
"bbb 12 ccc",
"aaa 5 yy 22",
"aaa",
"eee 3 ddd jpeg2000 eee",
"ccc 5",
};
Pattern splitter = Pattern.compile("(\\d+|\\D+)");
public class InternalNumberComparator implements Comparator
{
public int compare(Object o1, Object o2)
{
// I deliberately use the Java 1.4 syntax,
// all this can be improved with 1.5's generics
String s1 = (String)o1, s2 = (String)o2;
// We split each string as runs of number/non-number strings
ArrayList sa1 = split(s1);
ArrayList sa2 = split(s2);
// Nothing or different structure
if (sa1.size() == 0 || sa1.size() != sa2.size())
{
// Just compare the original strings
return s1.compareTo(s2);
}
int i = 0;
String si1 = "";
String si2 = "";
// Compare beginning of string
for (; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
break; // Until we find a difference
}
// No difference found?
if (i == sa1.size())
return 0; // Same strings!
// Try to convert the different run of characters to number
int val1, val2;
try
{
val1 = Integer.parseInt(si1);
val2 = Integer.parseInt(si2);
}
catch (NumberFormatException e)
{
return s1.compareTo(s2); // Strings differ on a non-number
}
// Compare remainder of string
for (i++; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
{
return s1.compareTo(s2); // Strings differ
}
}
// Here, the strings differ only on a number
return val1 < val2 ? -1 : 1;
}
ArrayList split(String s)
{
ArrayList r = new ArrayList();
Matcher matcher = splitter.matcher(s);
while (matcher.find())
{
String m = matcher.group(1);
r.add(m);
}
return r;
}
}
Arrays.sort(strs, new InternalNumberComparator());
This algorithm need much more testing, but it seems to behave rather nicely.
[EDIT] I added some more comments to be clearer. I see there are much more answers than when I started to code this... But I hope I provided a good starting base and/or some ideas.
Ian Griffiths of Microsoft has a C# implementation he calls Natural Sorting. Porting to Java should be fairly easy, easier than from C anyway!
UPDATE: There seems to be a Java example on eekboom that does this, see the "compareNatural" and use that as your comparer to sorts.
The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.
This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits.
public static final int compareNatural (String s1, String s2)
{
// Skip all identical characters
int len1 = s1.length();
int len2 = s2.length();
int i;
char c1, c2;
for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);
// Check end of string
if (c1 == c2)
return(len1 - len2);
// Check digit in first string
if (Character.isDigit(c1))
{
// Check digit only in first string
if (!Character.isDigit(c2))
return(1);
// Scan all integer digits
int x1, x2;
for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);
// Longer integer wins, first digit otherwise
return(x2 == x1 ? c1 - c2 : x1 - x2);
}
// Check digit only in second string
if (Character.isDigit(c2))
return(-1);
// No digits
return(c1 - c2);
}
I came up with a quite simple implementation in Java using regular expressions:
public static Comparator<String> naturalOrdering() {
final Pattern compile = Pattern.compile("(\\d+)|(\\D+)");
return (s1, s2) -> {
final Matcher matcher1 = compile.matcher(s1);
final Matcher matcher2 = compile.matcher(s2);
while (true) {
final boolean found1 = matcher1.find();
final boolean found2 = matcher2.find();
if (!found1 || !found2) {
return Boolean.compare(found1, found2);
} else if (!matcher1.group().equals(matcher2.group())) {
if (matcher1.group(1) == null || matcher2.group(1) == null) {
return matcher1.group().compareTo(matcher2.group());
} else {
return Integer.valueOf(matcher1.group(1)).compareTo(Integer.valueOf(matcher2.group(1)));
}
}
}
};
}
Here is how it works:
final List<String> strings = Arrays.asList("x15", "xa", "y16", "x2a", "y11", "z", "z5", "x2b", "z");
strings.sort(naturalOrdering());
System.out.println(strings);
[x2a, x2b, x15, xa, y11, y16, z, z, z5]
I realize you're in java, but you can take a look at how StrCmpLogicalW works. It's what Explorer uses to sort filenames in Windows. You can look at the WINE implementation here.
Split the string into runs of letters and numbers, so "foo 12 bar" becomes the list ("foo", 12, "bar"), then use the list as the sort key. This way the numbers will be ordered in numerical order, not alphabetical.
Here is the solution with the following advantages over Alphanum Algorithm:
3.25x times faster (tested on the data from 'Epilogue' chapter of Alphanum description)
Does not consume extra memory (no string splitting, no numbers parsing)
Processes leading zeros correctly (e.g. "0001" equals "1", "01234" is less than "4567")
public class NumberAwareComparator implements Comparator<String>
{
#Override
public int compare(String s1, String s2)
{
int len1 = s1.length();
int len2 = s2.length();
int i1 = 0;
int i2 = 0;
while (true)
{
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (Character.isDigit(ch1) && Character.isDigit(ch2))
{
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && Character.isDigit(s1.charAt(end1)))
end1++;
while (end2 < len2 && Character.isDigit(s2.charAt(end2)))
end2++;
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
// if the lengths are different, then the longer number is bigger
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1)
{
if (s1.charAt(i1) != s2.charAt(i2))
return s1.charAt(i1) - s2.charAt(i2);
i1++;
i2++;
}
}
else
{
// plain characters comparison
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
}
Instead of reinventing the wheel, I'd suggest to use a locale-aware Unicode-compliant string comparator that has built-in number sorting from the ICU4J library.
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
public class CollatorExample {
public static void main(String[] args) {
// Make sure to choose correct locale: in Turkish uppercase of "i" is "İ", not "I"
RuleBasedCollator collator = (RuleBasedCollator) Collator.getInstance(Locale.US);
collator.setNumericCollation(true); // Place "10" after "2"
collator.setStrength(Collator.PRIMARY); // Case-insensitive
List<String> strings = Arrays.asList("10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a"
);
strings.sort(collator);
System.out.println(String.join(", ", strings));
// Output: _1, _01, _2, _200, 01, 001, 1,
// 2, 02, 10, 10, 010, 20, 100, A 02, A01,
// a2, A20, t1A, t1a, t1ab, t1aB, t1Ab, t1AB,
// T010T01, T0010T01
}
}
The Alphanum algrothim is nice, but it did not match requirements for a project I'm working on. I need to be able to sort negative numbers and decimals correctly. Here is the implementation I came up. Any feedback would be much appreciated.
public class StringAsNumberComparator implements Comparator<String> {
public static final Pattern NUMBER_PATTERN = Pattern.compile("(\\-?\\d+\\.\\d+)|(\\-?\\.\\d+)|(\\-?\\d+)");
/**
* Splits strings into parts sorting each instance of a number as a number if there is
* a matching number in the other String.
*
* For example A1B, A2B, A11B, A11B1, A11B2, A11B11 will be sorted in that order instead
* of alphabetically which will sort A1B and A11B together.
*/
public int compare(String str1, String str2) {
if(str1 == str2) return 0;
else if(str1 == null) return 1;
else if(str2 == null) return -1;
List<String> split1 = split(str1);
List<String> split2 = split(str2);
int diff = 0;
for(int i = 0; diff == 0 && i < split1.size() && i < split2.size(); i++) {
String token1 = split1.get(i);
String token2 = split2.get(i);
if((NUMBER_PATTERN.matcher(token1).matches() && NUMBER_PATTERN.matcher(token2).matches()) {
diff = (int) Math.signum(Double.parseDouble(token1) - Double.parseDouble(token2));
} else {
diff = token1.compareToIgnoreCase(token2);
}
}
if(diff != 0) {
return diff;
} else {
return split1.size() - split2.size();
}
}
/**
* Splits a string into strings and number tokens.
*/
private List<String> split(String s) {
List<String> list = new ArrayList<String>();
try (Scanner scanner = new Scanner(s)) {
int index = 0;
String num = null;
while ((num = scanner.findInLine(NUMBER_PATTERN)) != null) {
int indexOfNumber = s.indexOf(num, index);
if (indexOfNumber > index) {
list.add(s.substring(index, indexOfNumber));
}
list.add(num);
index = indexOfNumber + num.length();
}
if (index < s.length()) {
list.add(s.substring(index));
}
}
return list;
}
}
PS. I wanted to use the java.lang.String.split() method and use "lookahead/lookbehind" to keep the tokens, but I could not get it to work with the regular expression I was using.
interesting problem, and here my proposed solution:
import java.util.Collections;
import java.util.Vector;
public class CompareToken implements Comparable<CompareToken>
{
int valN;
String valS;
String repr;
public String toString() {
return repr;
}
public CompareToken(String s) {
int l = 0;
char data[] = new char[s.length()];
repr = s;
valN = 0;
for (char c : s.toCharArray()) {
if(Character.isDigit(c))
valN = valN * 10 + (c - '0');
else
data[l++] = c;
}
valS = new String(data, 0, l);
}
public int compareTo(CompareToken b) {
int r = valS.compareTo(b.valS);
if (r != 0)
return r;
return valN - b.valN;
}
public static void main(String [] args) {
String [] strings = {
"aaa",
"bbb3ccc",
"bbb12ccc",
"ccc 11",
"ddd",
"eee3dddjpeg2000eee",
"eee12dddjpeg2000eee"
};
Vector<CompareToken> data = new Vector<CompareToken>();
for(String s : strings)
data.add(new CompareToken(s));
Collections.shuffle(data);
Collections.sort(data);
for (CompareToken c : data)
System.out.println ("" + c);
}
}
Prior to discovering this thread, I implemented a similar solution in javascript. Perhaps my strategy will find you well, despite different syntax. Similar to above, I parse the two strings being compared, and split them both into arrays, dividing the strings at continuous numbers.
...
var regex = /(\d+)/g,
str1Components = str1.split(regex),
str2Components = str2.split(regex),
...
I.e., 'hello22goodbye 33' => ['hello', 22, 'goodbye ', 33]; Thus, you can walk through the arrays' elements in pairs between string1 and string2, do some type coercion (such as, is this element really a number?), and compare as you walk.
Working example here: http://jsfiddle.net/F46s6/3/
Note, I currently only support integer types, though handling decimal values wouldn't be too hard of a modification.
My 2 cents.Is working well for me. I am mainly using it for filenames.
private final boolean isDigit(char ch)
{
return ch >= 48 && ch <= 57;
}
private int compareNumericalString(String s1,String s2){
int s1Counter=0;
int s2Counter=0;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
char currentChar1=s1.charAt(s1Counter++);
char currentChar2=s2.charAt(s2Counter++);
if(isDigit(currentChar1) &&isDigit(currentChar2)){
String digitString1=""+currentChar1;
String digitString2=""+currentChar2;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
if(isDigit(s1.charAt(s1Counter))){
digitString1+=s1.charAt(s1Counter);
s1Counter++;
}
if(isDigit(s2.charAt(s2Counter))){
digitString2+=s2.charAt(s2Counter);
s2Counter++;
}
if((!isDigit(s1.charAt(s1Counter))) && (!isDigit(s2.charAt(s2Counter)))){
currentChar1=s1.charAt(s1Counter);
currentChar2=s2.charAt(s2Counter);
break;
}
}
if(!digitString1.equals(digitString2)){
return Integer.parseInt(digitString1)-Integer.parseInt(digitString2);
}
}
if(currentChar1!=currentChar2){
return currentChar1-currentChar2;
}
}
return s1.compareTo(s2);
}
I created a project to compare the different implementations. It is far from complete, but it is a starting point.
Adding on to the answer made by #stanislav.
A few problems I faced while using the answer provided was:
Capital and small letters are separated by the characters between their ASCII codes. This breaks the flow when the strings being sorted have _ or other characters which are between small letters and capital letters in ASCII.
If two strings are the same except for the leading zeroes count being different, the function returns 0 which will make the sort depend on the original positions of the string in the list.
These two issues have been fixed in the new code. And I made a few function instead of a few repetitive set of code. The differentCaseCompared variable keeps track of whether if two strings are the same except for the cases being different. If so the value of the first different case characters subtracted is returned. This is done to avoid the issue of having two strings differing by case returned as 0.
public class NaturalSortingComparator implements Comparator<String> {
#Override
public int compare(String string1, String string2) {
int lengthOfString1 = string1.length();
int lengthOfString2 = string2.length();
int iteratorOfString1 = 0;
int iteratorOfString2 = 0;
int differentCaseCompared = 0;
while (true) {
if (iteratorOfString1 == lengthOfString1) {
if (iteratorOfString2 == lengthOfString2) {
if (lengthOfString1 == lengthOfString2) {
// If both strings are the same except for the different cases, the differentCaseCompared will be returned
return differentCaseCompared;
}
//If the characters are the same at the point, returns the difference between length of the strings
else {
return lengthOfString1 - lengthOfString2;
}
}
//If String2 is bigger than String1
else
return -1;
}
//Check if String1 is bigger than string2
if (iteratorOfString2 == lengthOfString2) {
return 1;
}
char ch1 = string1.charAt(iteratorOfString1);
char ch2 = string2.charAt(iteratorOfString2);
if (Character.isDigit(ch1) && Character.isDigit(ch2)) {
// skip leading zeros
iteratorOfString1 = skipLeadingZeroes(string1, lengthOfString1, iteratorOfString1);
iteratorOfString2 = skipLeadingZeroes(string2, lengthOfString2, iteratorOfString2);
// find the ends of the numbers
int endPositionOfNumbersInString1 = findEndPositionOfNumber(string1, lengthOfString1, iteratorOfString1);
int endPositionOfNumbersInString2 = findEndPositionOfNumber(string2, lengthOfString2, iteratorOfString2);
int lengthOfDigitsInString1 = endPositionOfNumbersInString1 - iteratorOfString1;
int lengthOfDigitsInString2 = endPositionOfNumbersInString2 - iteratorOfString2;
// if the lengths are different, then the longer number is bigger
if (lengthOfDigitsInString1 != lengthOfDigitsInString2)
return lengthOfDigitsInString1 - lengthOfDigitsInString2;
// compare numbers digit by digit
while (iteratorOfString1 < endPositionOfNumbersInString1) {
if (string1.charAt(iteratorOfString1) != string2.charAt(iteratorOfString2))
return string1.charAt(iteratorOfString1) - string2.charAt(iteratorOfString2);
iteratorOfString1++;
iteratorOfString2++;
}
} else {
// plain characters comparison
if (ch1 != ch2) {
if (!ignoreCharacterCaseEquals(ch1, ch2))
return Character.toLowerCase(ch1) - Character.toLowerCase(ch2);
// Set a differentCaseCompared if the characters being compared are different case.
// Should be done only once, hence the check with 0
if (differentCaseCompared == 0) {
differentCaseCompared = ch1 - ch2;
}
}
iteratorOfString1++;
iteratorOfString2++;
}
}
}
private boolean ignoreCharacterCaseEquals(char character1, char character2) {
return Character.toLowerCase(character1) == Character.toLowerCase(character2);
}
private int findEndPositionOfNumber(String string, int lengthOfString, int end) {
while (end < lengthOfString && Character.isDigit(string.charAt(end)))
end++;
return end;
}
private int skipLeadingZeroes(String string, int lengthOfString, int iteratorOfString) {
while (iteratorOfString < lengthOfString && string.charAt(iteratorOfString) == '0')
iteratorOfString++;
return iteratorOfString;
}
}
The following is a unit test I used.
public class NaturalSortingComparatorTest {
private int NUMBER_OF_TEST_CASES = 100000;
#Test
public void compare() {
NaturalSortingComparator naturalSortingComparator = new NaturalSortingComparator();
List<String> expectedStringList = getCorrectStringList();
List<String> testListOfStrings = createTestListOfStrings();
runTestCases(expectedStringList, testListOfStrings, NUMBER_OF_TEST_CASES, naturalSortingComparator);
}
private void runTestCases(List<String> expectedStringList, List<String> testListOfStrings,
int numberOfTestCases, Comparator<String> comparator) {
for (int testCase = 0; testCase < numberOfTestCases; testCase++) {
Collections.shuffle(testListOfStrings);
testListOfStrings.sort(comparator);
Assert.assertEquals(expectedStringList, testListOfStrings);
}
}
private List<String> getCorrectStringList() {
return Arrays.asList(
"1", "01", "001", "2", "02", "10", "10", "010",
"20", "100", "_1", "_01", "_2", "_200", "A 02",
"A01", "a2", "A20", "t1A", "t1a", "t1AB", "t1Ab",
"t1aB", "t1ab", "T010T01", "T0010T01");
}
private List<String> createTestListOfStrings() {
return Arrays.asList(
"10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a");
}
}
Suggestions welcome! I am not sure whether adding the functions changes anything other than the readability part of things.
P.S: Sorry to add another answer to this question. But I don't have enough reps to comment on the answer which I modified for my use.
modification of this answer
case insensitive order (1000a is less than 1000X)
nulls handling
implementation:
import static java.lang.Math.pow;
import java.util.Comparator;
public class AlphanumComparator implements Comparator<String> {
public static final AlphanumComparator ALPHANUM_COMPARATOR = new AlphanumComparator();
private static char[] upperCaseCache = new char[(int) pow(2, 16)];
private boolean nullIsLess;
public AlphanumComparator() {
}
public AlphanumComparator(boolean nullIsLess) {
this.nullIsLess = nullIsLess;
}
#Override
public int compare(String s1, String s2) {
if (s1 == s2)
return 0;
if (s1 == null)
return nullIsLess ? -1 : 1;
if (s2 == null)
return nullIsLess ? 1 : -1;
int i1 = 0;
int i2 = 0;
int len1 = s1.length();
int len2 = s2.length();
while (true) {
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (isDigit(ch1) && isDigit(ch2)) {
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && isDigit(s1.charAt(end1)))
end1++;
while (end2 != len2 && isDigit(s2.charAt(end2)))
end2++;
// if the lengths are different, then the longer number is bigger
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1) {
ch1 = s1.charAt(i1);
ch2 = s2.charAt(i2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
} else {
ch1 = toUpperCase(ch1);
ch2 = toUpperCase(ch2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
private boolean isDigit(char ch) {
return ch >= 48 && ch <= 57;
}
private char toUpperCase(char ch) {
char cached = upperCaseCache[ch];
if (cached == 0) {
cached = Character.toUpperCase(ch);
upperCaseCache[ch] = cached;
}
return cached;
}
}
I think you'll have to do the comparison on a character-by-character fashion. Grab a character, if it's a number character, keep grabbing, then reassemble to characters into a single number string and convert it into an int. Repeat on the other string, and only then do the comparison.
Short answer: based on the context, I can't tell whether this is just some quick-and-dirty code for personal use, or a key part of Goldman Sachs' latest internal accounting software, so I'll open by saying: eww. That's a rather funky sorting algorithm; try to use something a bit less "twisty" if you can.
Long answer:
The two issues that immediately come to mind in your case are performance, and correctness. Informally, make sure it's fast, and make sure your algorithm is a total ordering.
(Of course, if you're not sorting more than about 100 items, you can probably disregard this paragraph.) Performance matters, as the speed of the comparator will be the largest factor in the speed of your sort (assuming the sort algorithm is "ideal" to the typical list). In your case, the comparator's speed will depend mainly on the size of the string. The strings seem to be fairly short, so they probably won't dominate as much as the size of your list.
Turning each string into a string-number-string tuple and then sorting this list of tuples, as suggested in another answer, will fail in some of your cases, since you apparently will have strings with multiple numbers appearing.
The other problem is correctness. Specifically, if the algorithm you described will ever permit A > B > ... > A, then your sort will be non-deterministic. In your case, I fear that it might, though I can't prove it. Consider some parsing cases such as:
aa 0 aa
aa 23aa
aa 2a3aa
aa 113aa
aa 113 aa
a 1-2 a
a 13 a
a 12 a
a 2-3 a
a 21 a
a 2.3 a
Although the question asked a java solution, for anyone who wants a scala solution:
object Alphanum {
private[this] val regex = "((?<=[0-9])(?=[^0-9]))|((?<=[^0-9])(?=[0-9]))"
private[this] val alphaNum: Ordering[String] = Ordering.fromLessThan((ss1: String, ss2: String) => (ss1, ss2) match {
case (sss1, sss2) if sss1.matches("[0-9]+") && sss2.matches("[0-9]+") => sss1.toLong < sss2.toLong
case (sss1, sss2) => sss1 < sss2
})
def ordering: Ordering[String] = Ordering.fromLessThan((s1: String, s2: String) => {
import Ordering.Implicits.infixOrderingOps
implicit val ord: Ordering[List[String]] = Ordering.Implicits.seqDerivedOrdering(alphaNum)
s1.split(regex).toList < s2.split(regex).toList
})
}
My problem was that I have lists consisting of a combination of alpha numeric strings (eg C22, C3, C5 etc), alpha strings (eg A, H, R etc) and just digits (eg 99, 45 etc) that need sorting in the order A, C3, C5, C22, H, R, 45, 99. I also have duplicates that need removing so I only get a single entry.
I'm also not just working with Strings, I'm ordering an Object and using a specific field within the Object to get the correct order.
A solution that seems to work for me is :
SortedSet<Code> codeSet;
codeSet = new TreeSet<Code>(new Comparator<Code>() {
private boolean isThereAnyNumber(String a, String b) {
return isNumber(a) || isNumber(b);
}
private boolean isNumber(String s) {
return s.matches("[-+]?\\d*\\.?\\d+");
}
private String extractChars(String s) {
String chars = s.replaceAll("\\d", "");
return chars;
}
private int extractInt(String s) {
String num = s.replaceAll("\\D", "");
return num.isEmpty() ? 0 : Integer.parseInt(num);
}
private int compareStrings(String o1, String o2) {
if (!extractChars(o1).equals(extractChars(o2))) {
return o1.compareTo(o2);
} else
return extractInt(o1) - extractInt(o2);
}
#Override
public int compare(Code a, Code b) {
return isThereAnyNumber(a.getPrimaryCode(), b.getPrimaryCode())
? isNumber(a.getPrimaryCode()) ? 1 : -1
: compareStrings(a.getPrimaryCode(), b.getPrimaryCode());
}
});
It 'borrows' some code that I found here on Stackoverflow plus some tweaks of my own to get it working just how I needed it too.
Due to trying to order Objects, needing a comparator as well as duplicate removal, one negative fudge I had to employ was I first have to write my Objects to a TreeMap before writing them to a Treeset. It may impact performance a little but given that the lists will be a max of about 80 Codes, it shouldn't be a problem.
I had a similar problem where my strings had space-separated segments inside. I solved it in this way:
public class StringWithNumberComparator implements Comparator<MyClass> {
#Override
public int compare(MyClass o1, MyClass o2) {
if (o1.getStringToCompare().equals(o2.getStringToCompare())) {
return 0;
}
String[] first = o1.getStringToCompare().split(" ");
String[] second = o2.getStringToCompare().split(" ");
if (first.length == second.length) {
for (int i = 0; i < first.length; i++) {
int segmentCompare = StringUtils.compare(first[i], second[i]);
if (StringUtils.isNumeric(first[i]) && StringUtils.isNumeric(second[i])) {
segmentCompare = NumberUtils.compare(Integer.valueOf(first[i]), Integer.valueOf(second[i]));
if (0 != segmentCompare) {
// return only if uneven numbers in case there are more segments to be checked
return segmentCompare;
}
}
if (0 != segmentCompare) {
return segmentCompare;
}
}
} else {
return StringUtils.compare(o1.getDenominazione(), o2.getDenominazione());
}
return 0;
}
As you can see I have used Apaches StringUtils.compare() and NumberUtils.compere() as a standard help.
In your given example, the numbers you want to compare have spaces around them while the other numbers do not, so why would a regular expression not work?
bbb 12 ccc
vs.
eee 12 ddd jpeg2000 eee
If you're writing a comparator class, you should implement your own compare method that will compare two strings character by character. This compare method should check if you're dealing with alphabetic characters, numeric characters, or mixed types (including spaces). You'll have to define how you want a mixed type to act, whether numbers come before alphabetic characters or after, and where spaces fit in etc.
On Linux glibc provides strverscmp(), it's also available from gnulib for portability. However truly "human" sorting has lots of other quirks like "The Beatles" being sorted as "Beatles, The". There is no simple solution to this generic problem.
Encoding format: introduce * to indicate "repeat from beginning". Example. Input-{a,b,a,b,c,a,b,a,b,c,d} can be written as {a , b, * ,c, * , d}. Output:5; E.g 2: ABCABCE, output- 5.
Here * means repeat from beginning. For example if given String is ABCABCABCABC , it will return ABC**, another example is if String is ABCABCABC, it will return ABC*ABC.
I have the below code but this code assumes that the string will contain the repetitive pattern only and no other characters, I want to modify it to check :
1. Which pattern is repeating
2. Ignore non repeating patterns
2. encode that pattern according to the problem statement
import java.util.Scanner;
public class Magicpotion {
public static void main(String args[]) {
Scanner sc = new Scanner(System.in);
System.out.println("Enter the string:");
String str = sc.nextLine();
int len = str.length();
if (len != 0) {
int lenby3 = len / 3;
int starcount = ( int).(Math.log(lenby3) / Math.log(2));
int leftstring = (lenby3 - (int) Math.pow(2, starcount));
int resultlen = (1 * 3) + starcount + (leftstring * 3);
System.out.println("ResultLength: " + resultlen);
System.out.print("ABC");
for (int i = 0; i < starcount; i++) {
System.out.print("*");
}
for (int i = 0; i < leftstring; i++) {
System.out.print("ABC");
}
} else
System.out.println("ResultLength: " + 0);
}
}
Here my assumption is that ABC will always be repeating pattern , hence I have divided the length by 3. I want to generalise it such that I find the repeating pattern which can be a AB or BC or ABCD and proceed accordingly.
This looks like homework. So instead of a full solution just some hints:
You can process the input string character by character and encode as you go. If you have at some point already read k characters and the next k characters are exactly the same, output a * and advance to position 2k.
Otherwise, output the next input character and advance position to k+1.
As mentioned by dyukha this algorithm does not always result in the shortest possible encoding. If this is required some more effort has to be put into the search.
This problem can be solved using dynamic programming.
Assume that you processed your stay at some position i. You want to understand what it the minimal length of encoding of str[0..i]. Let's call it ans[i]. You have two options:
Just add i-th character to the encoding. So the length is ans[i-1] + 1.
You may write *, when possible. In this case the length is ans[i / 2] + 1 or something like this.
The final length is in ans[n-1]. You can store how you obtained ans[i] to recover the encoding itself.
Checking whether you can write * can be optimized, using some hashing (to obtain O(n) solution instead of O(n^2)).
The difference with Henry's solution is that he always applies * when it's possible. It's not clear to me that it results into the minimal length (if I understood correctly, aaaaaa is a counterexample), so I'm giving a solution I'm sure about.
/**
* #author mohamed ali
* https://www.linkedin.com/in/oo0shaheen0oo/
*/
public class Magic_potion_encoding
{
private static int minimalSteps( String ingredients )
{
StringBuilder sb = new StringBuilder(ingredients);
for(int i =0;i<sb.length();i++)
{
char startChar = sb.charAt(i);
int walkingIndex1=i;
int startIndex2 =sb.toString().indexOf(startChar,i+1);
int walkingIndex2=startIndex2;
while(walkingIndex2 !=-1 && walkingIndex2<sb.length() && sb.charAt(walkingIndex1) == sb.charAt(walkingIndex2) )
{
if(walkingIndex1+1==startIndex2)
{
String subStringToBeEncoded = sb.substring(i,walkingIndex2+1);//substring the string found and the original "substring does not include the last index hence the +1
int matchStartIndex = sb.indexOf(subStringToBeEncoded,walkingIndex2+1);// look for first match for the whole string matched
int matchEndeIndex= matchStartIndex+subStringToBeEncoded.length();
int origStartIndex=i;
int origEndIndex = i+subStringToBeEncoded.length();
if (matchStartIndex!=-1 )
{
if(origEndIndex==matchStartIndex)
{
sb.replace(matchStartIndex,matchEndeIndex,"*");
}
else
{
while(matchStartIndex!=-1 && matchEndeIndex<sb.length() && sb.charAt(origEndIndex) == sb.charAt(matchEndeIndex) )
{
if(origEndIndex==matchStartIndex-1)// if the index of the 2 strings are right behind one another
{
sb.replace(matchStartIndex,matchEndeIndex+1,"*");
}
else
{
origEndIndex++;
matchEndeIndex++;
}
}
}
}
sb.replace(startIndex2,walkingIndex2+1,"*");
break;
}
walkingIndex1++;
walkingIndex2++;
}
}
System.out.println("orig= " + ingredients + " encoded = " + sb);
return sb.length();
}
public static void main( String[] args )
{
if ( minimalSteps("ABCABCE") == 5 &&
minimalSteps("ABCABCEA") == 6 &&
minimalSteps("abbbbabbbb") == 5 &&
minimalSteps("abcde") == 5 &&
minimalSteps("abcbcbcbcd") == 6 &&
minimalSteps("ababcababce") == 6 &&
minimalSteps("ababababxx") == 6 &&
minimalSteps("aabbccbbccaabbccbbcc") == 8)
{
System.out.println( "Pass" );
}
else
{
System.out.println( "Fail" );
}
}
}
Given that the repetitions are from the beginning, every such repeating substring will have the very first character of the given string. [Every repetition needs to be represented by a "star". (i.e ABCABCABC ans = ABC** ) . If all sequential repetitions are to be represented with one "star". (i.e ABCABCABC and = ABC* ), a slight modification to (2) will do the thing (i.e remove the if case where the just a star is added)]
Divide the given string to substrings based on the first character.
Eg. Given String = "ABABCABD"
Sub Strings = {"AB", "ABC", "AB", "ABD"}
Just traverse through the list of substrings and get the required result. I've used a map here, to make the search easy.
Just a rough write up.
SS = {"AB", "ABC", "AB", "ABD"};
result = SS[0];
Map<string, bool> map;
map.put(SS[0],true);
for (i = 1; i < SS.length; i++){
if (map.hasKey(SS[i])){
result += "*";
}
else {
res = nonRepeatingPart(SS[i], map);
result += "*" + res;
map.put(SS[i], true);
}
}
String nonRepeatingPart(str, map){
for (j = str.length-1; j >= 0; j--){
if (map.hasKey(str.subString(0, j))){
return str.subString(j, str.length-1);
}
}
return throwException("Wrong Input");
}
string getCompressed(string str){
string res;
res += str[0];
int i=1;
while(i<str.size()){
//check if current char is the first char in res
char curr = str[i];
if(res[0]==curr){
if(str.substr(0,i)==str.substr(i,i)){
res += '*';
i+=i; continue;
}else{
res += curr;
i++; continue;
}
}else {
res += curr;
i++; continue;
}
}
return res;
}
int main()
{
string s = "ABCABCABC";
string res = getCompressed(s);
cout<<res.size();
return 0;
}
I'm just starting out and I'm completely lost on how to do this.
I want to be able to check a string for a smaller string and return true if the string contains the letters of the string in order.
I'm not sure how to go about making sure the letters of the second string are in order, even if there are other letters between them.
An example would be that "chemistry" would return true for the string "hit".
It would return false for the string "him" though.
Any help would be greatly appreciated.
EDIT: Thank you, I changed the word "substring" to string. As I said, I'm just beginning and wasn't aware that meant something else. I really appreciate all the help. It should get me moving in the right direction.
The general approach is to iterate over the characters of the longer string ("chemistry"), always keeping track of the index of the next required character from the shorter string ("hit" — first 0, then 1 once you find h, then 2 once you find i, and then when you find t you're done). For example:
public static boolean containsSubsequence(
final String sequence, final String subsequence) {
if (subsequence.isEmpty()) {
return true;
}
int subsequenceIndex = 0;
for (int i = 0; i < sequence.length(); ++i) {
if (sequence.charAt(i) == subsequence.charAt(subsequenceIndex)) {
++subsequenceIndex;
if (subsequenceIndex == subsequence.length()) {
return true;
}
}
}
return false;
}
SInce you haven't posted any code, I will just explain what would I do.
Iterate over all the substring letter by letter, on your example "hit"
Check if the current letter (iterate 0 is h) is on the string, if/when it find you remove the ocurrences before it and let the string from that (emistry)
Do this process for all left substrings
use a control boolean variable to see if it has found or not.
if in any pass of the iteration you did not find you return false.
Well, you could go through both strings simultaneously, advancing your index to the "substring" (the correct term is subsequence - "mist" is a substring of "chemistry", but "hit" is only a subsequence) string only if its current character matches the current character in the outer string. I.e., for "chemistry" and "hit", you start with indices i = 0, j = 0. You increase the index i into the first string until you encounter s1.charAt(i) == s2.charAt(j), which is the case for i = 1 (second character in chemistry is h). Then you increase j, and are now increasing i again until you hit an "i" (i = 4). The second string is contained as a subsequence in the first if at the end, j == s2.length() holds. Note that here - unlike for more complex problems, such as testing if the second string really is a substring - a greedy strategy works, meaning you don't have to worry about which of multiple occurrences of the same character you match against the current in one in the second string; you can always "greedily" choose the first one you see.
Alternatively, you can use regexes: convert the second (search) string into the regex pattern String pat = ".*h.*i.*t.*", and test s1.matches(pat).
You could do the following (not sure how efficient it is):
Consider your search string "hit" as an array of char: ["h","i","t"]
Use indexOf(c) to determine whether the first character in the array can be found.
Repeat the search on the remaining substring.
Here's the code:
public class SearchString {
public static void main(String[] args) {
String searchSpace = "this is where to search?";
String needle = "tweus?";
char[] chars = needle.toCharArray();
int index = 0;
boolean found = true;
int startIndex = 0;
while (found && index < chars.length){
searchSpace = searchSpace.substring(startIndex);
startIndex = searchSpace.indexOf(chars[index]);
found = (startIndex != -1);
index++;
}
if (index==chars.length && found){
System.out.println("Found it");
} else {
System.out.println("Nothing here");
}
}
}
A slightly modified answer based on #ruakh's solution:
public static boolean containsSubsequence(final String sequence, final String subsequence) {
if (Objects.requireNonNull(sequence).isEmpty() || Objects.requireNonNull(subsequence).isEmpty() || subsequence.length() > sequence.length()) {
return false;
}
int index = 0;
for (int i = 0; i < sequence.length(); i++) {
if (sequence.charAt(i) == subsequence.charAt(index) && ++index == subsequence.length()) {
return true;
}
}
return false;
}
Objects.requireNonNull() is from Java 7, remember to substitute for something similar (from Apache Commons's StringUtils?) if you're not on Java 7. The validation assumes returning false is suitable for an empty sequence or subsequence, or you may want to consider throwing something like IllegalArgumentException.
The two if statements have been combined into a single clause for compactness.
edit:
If one is mathematically inclined, or following #ruakh's original solution, any sequence should contain an empty subsequence. The only reason why my code above is doing it differently is because I prefer to imagine an empty argument as a form of invalid argument, thus returning false. It really depends on how this method is used, and how 'severe' an empty argument is.
I know this was asked as a Java question. But just for reference, here is a version of it in C.\
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
int find_str_in_str(const char* const base, const char* const sub)
{
int base_len = strlen(base);
int sub_len = strlen(sub);
char *tmp_sub = NULL;
/* allocate enough mem for the max string length */
if(base_len > sub_len) {
tmp_sub = malloc(base_len + 1);
} else {
tmp_sub = malloc(sub_len + 1);
}
if(NULL == tmp_sub) {
fprintf(stderr, "Runtime error (malloc)\n");
exit(1);
}
int i = 0;
int j = 0;
for(; i < sub_len; i++) {
for(; j < base_len; j++) {
if(base[j] == sub[i]) {
tmp_sub[i] = base[j];
/* the first occurance was found */
break;
}
}
}
tmp_sub[i++] = '\0';
if(0 == strcmp(sub, tmp_sub)) {
free(tmp_sub);
return 1;
} else {
free(tmp_sub);
return 0;
}
}
int main(int argc, char **argv)
{
if(argc < 3) {
fprintf(stderr, "Usage: %s %s %s\n", argv[0], "base", "derived");
return EXIT_FAILURE;
}
if(1 == find_str_in_str(argv[1], argv[2])) {
printf("true\n");
} else {
printf("false\n");
}
return EXIT_SUCCESS;
}
to compile: gcc -Wall -Wextra main.c -o main
main self elf
main chemistry try
main chemistry hit
main chemistry tim
So I am trying to figure out if two strings when combined together are a substring of a permutation of another string.
I have what I believe to be a working solution but it is failing some of the JUnit test cases and I dont have access to the ones that it is failing on.
here is my code with one test case
String a="tommarvoloriddle";
String b="lord";
String c="voldemort";
String b= b+c;
char[] w= a.toCharArray();
char[] k= b.toCharArray();
Arrays.sort(k);
Arrays.sort(w);
pw.println(isPermuation(w,k)?"YES":"NO");
static boolean isPermuation(char[] w, char[] k)
{
boolean found=false;
for(int i=0; i<k.length; i++)
{
for(int j=i; j<w.length; j++)
{
if(k[i]==w[j])
{
j=w.length;
found=true;
}
else
found=false;
}
}
return found;
}
any help getting this to always produce the correct answer would be awesome and help making it more efficient would be great too
What you have is not a working solution. However, you don't explain why you thought it might be, so it's hard to figure out what you intended. I will point out that your code updates found unconditionally for each inner loop, so isPermutation() will always return the result of the last comparison (which is certainly not what you want).
You did the right thing in sorting the two arrays in the first place -- this is a classic step which should allow you to efficiently evaluate them in one pass. But then, instead of a single pass, you use a nested loop -- what did you intend here?
A single pass implementation might be something like:
static boolean isPermutation(char[] w, char[] k) {
int k_idx=0;
for(w_idx=0; w_idx < w.length; ++w_idx) {
if(k_idx == k.length)
return true; // all characters in k are present in w
if( w[w_idx] > k[k_idx] )
return false; // found character in k not present in w
if( w[w_idx] == k[k_idx] )
++k_idx; // character from k corresponds to character from w
}
// any remaining characters in k are not present in w
return k_idx == k.length;
}
So we are only interested in whether the two combined strings are a subset of a permutation of another string, meaning that the lengths can in fact differ. So let's say we have:
String a = "tommarvoloriddle";
String b = "lord";
String c = "voldemort";
char[] master = a.ToCharArray();
char[] combined = (b + c).ToCharArray();
Arrays.Sort(master);
Arrays.Sort(combined);
System.out.println(IsPermutation(master, combined) ? "YES" : "NO");
Then our method is:
static boolean IsPermutation(char[] masterString, char[] combinedString)
{
int combinedStringIndex = 0;
int charsFound = 0;
int result = 0;
for (int i = 0; i < masterString.Length; ++i) {
result = combinedString[combinedStringIndex].CompareTo(masterString[i]);
if (result == 0) {
charsFound++;
combinedStringIndex++;
}
else if (result < 0) {
return false;
}
}
return (charsFound == combinedString.Length);
}
What the above method does: it starts comparing characters of the two strings. If we have a mismatch, that is, the character at the current masterString index does not match the character at the current combinedString index, then we simply look at the next character of masterString and see if that matches. At the end, we tally the total number of characters matched from our combinedString, and, if they are equal to the total number of characters in combinedString (its length), then we have established that it is indeed a permutation of masterString. If at any point, the current character in masterString is numerically greater than the current character in combinedString then it means that we will never be able to match the current character, so we give up. Hope that helps.
If two Strings are a permuation of the other you should be able to do this
public static boolean isPermuted(Strign s1, String s2) {
if (s1.length() != s2.length()) return false;
char[] chars1 = s1.toCharArray();
char[] chars2 = s2.toCharArray();
Arrays.sort(chars1);
Arrays.sort(chars2);
return Arrays.equals(chars1, chars2);
}
This means that when sorted the characters are the same, in the same number.