I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the string are the same, and the middle part that differs is an integer, then compare based on the numeric values of those integers. For example, I want the following strings to end up in order they're shown:
aaa
bbb 3 ccc
bbb 12 ccc
ccc 11
ddd
eee 3 ddd jpeg2000 eee
eee 12 ddd jpeg2000 eee
As you can see, there might be other integers in the string, so I can't just use regular expressions to break out any integer. I'm thinking of just walking the strings from the beginning until I find a bit that doesn't match, then walking in from the end until I find a bit that doesn't match, and then comparing the bit in the middle to the regular expression "[0-9]+", and if it compares, then doing a numeric comparison, otherwise doing a lexical comparison.
Is there a better way?
Update I don't think I can guarantee that the other numbers in the string, the ones that may match, don't have spaces around them, or that the ones that differ do have spaces.
The Alphanum Algorithm
From the website
"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."
Edit: Here's a link to the Java Comparator Implementation from that site.
Interesting little challenge, I enjoyed solving it.
Here is my take at the problem:
String[] strs =
{
"eee 5 ddd jpeg2001 eee",
"eee 123 ddd jpeg2000 eee",
"ddd",
"aaa 5 yy 6",
"ccc 555",
"bbb 3 ccc",
"bbb 9 a",
"",
"eee 4 ddd jpeg2001 eee",
"ccc 11",
"bbb 12 ccc",
"aaa 5 yy 22",
"aaa",
"eee 3 ddd jpeg2000 eee",
"ccc 5",
};
Pattern splitter = Pattern.compile("(\\d+|\\D+)");
public class InternalNumberComparator implements Comparator
{
public int compare(Object o1, Object o2)
{
// I deliberately use the Java 1.4 syntax,
// all this can be improved with 1.5's generics
String s1 = (String)o1, s2 = (String)o2;
// We split each string as runs of number/non-number strings
ArrayList sa1 = split(s1);
ArrayList sa2 = split(s2);
// Nothing or different structure
if (sa1.size() == 0 || sa1.size() != sa2.size())
{
// Just compare the original strings
return s1.compareTo(s2);
}
int i = 0;
String si1 = "";
String si2 = "";
// Compare beginning of string
for (; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
break; // Until we find a difference
}
// No difference found?
if (i == sa1.size())
return 0; // Same strings!
// Try to convert the different run of characters to number
int val1, val2;
try
{
val1 = Integer.parseInt(si1);
val2 = Integer.parseInt(si2);
}
catch (NumberFormatException e)
{
return s1.compareTo(s2); // Strings differ on a non-number
}
// Compare remainder of string
for (i++; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
{
return s1.compareTo(s2); // Strings differ
}
}
// Here, the strings differ only on a number
return val1 < val2 ? -1 : 1;
}
ArrayList split(String s)
{
ArrayList r = new ArrayList();
Matcher matcher = splitter.matcher(s);
while (matcher.find())
{
String m = matcher.group(1);
r.add(m);
}
return r;
}
}
Arrays.sort(strs, new InternalNumberComparator());
This algorithm need much more testing, but it seems to behave rather nicely.
[EDIT] I added some more comments to be clearer. I see there are much more answers than when I started to code this... But I hope I provided a good starting base and/or some ideas.
Ian Griffiths of Microsoft has a C# implementation he calls Natural Sorting. Porting to Java should be fairly easy, easier than from C anyway!
UPDATE: There seems to be a Java example on eekboom that does this, see the "compareNatural" and use that as your comparer to sorts.
The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.
This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits.
public static final int compareNatural (String s1, String s2)
{
// Skip all identical characters
int len1 = s1.length();
int len2 = s2.length();
int i;
char c1, c2;
for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);
// Check end of string
if (c1 == c2)
return(len1 - len2);
// Check digit in first string
if (Character.isDigit(c1))
{
// Check digit only in first string
if (!Character.isDigit(c2))
return(1);
// Scan all integer digits
int x1, x2;
for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);
// Longer integer wins, first digit otherwise
return(x2 == x1 ? c1 - c2 : x1 - x2);
}
// Check digit only in second string
if (Character.isDigit(c2))
return(-1);
// No digits
return(c1 - c2);
}
I came up with a quite simple implementation in Java using regular expressions:
public static Comparator<String> naturalOrdering() {
final Pattern compile = Pattern.compile("(\\d+)|(\\D+)");
return (s1, s2) -> {
final Matcher matcher1 = compile.matcher(s1);
final Matcher matcher2 = compile.matcher(s2);
while (true) {
final boolean found1 = matcher1.find();
final boolean found2 = matcher2.find();
if (!found1 || !found2) {
return Boolean.compare(found1, found2);
} else if (!matcher1.group().equals(matcher2.group())) {
if (matcher1.group(1) == null || matcher2.group(1) == null) {
return matcher1.group().compareTo(matcher2.group());
} else {
return Integer.valueOf(matcher1.group(1)).compareTo(Integer.valueOf(matcher2.group(1)));
}
}
}
};
}
Here is how it works:
final List<String> strings = Arrays.asList("x15", "xa", "y16", "x2a", "y11", "z", "z5", "x2b", "z");
strings.sort(naturalOrdering());
System.out.println(strings);
[x2a, x2b, x15, xa, y11, y16, z, z, z5]
I realize you're in java, but you can take a look at how StrCmpLogicalW works. It's what Explorer uses to sort filenames in Windows. You can look at the WINE implementation here.
Split the string into runs of letters and numbers, so "foo 12 bar" becomes the list ("foo", 12, "bar"), then use the list as the sort key. This way the numbers will be ordered in numerical order, not alphabetical.
Here is the solution with the following advantages over Alphanum Algorithm:
3.25x times faster (tested on the data from 'Epilogue' chapter of Alphanum description)
Does not consume extra memory (no string splitting, no numbers parsing)
Processes leading zeros correctly (e.g. "0001" equals "1", "01234" is less than "4567")
public class NumberAwareComparator implements Comparator<String>
{
#Override
public int compare(String s1, String s2)
{
int len1 = s1.length();
int len2 = s2.length();
int i1 = 0;
int i2 = 0;
while (true)
{
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (Character.isDigit(ch1) && Character.isDigit(ch2))
{
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && Character.isDigit(s1.charAt(end1)))
end1++;
while (end2 < len2 && Character.isDigit(s2.charAt(end2)))
end2++;
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
// if the lengths are different, then the longer number is bigger
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1)
{
if (s1.charAt(i1) != s2.charAt(i2))
return s1.charAt(i1) - s2.charAt(i2);
i1++;
i2++;
}
}
else
{
// plain characters comparison
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
}
Instead of reinventing the wheel, I'd suggest to use a locale-aware Unicode-compliant string comparator that has built-in number sorting from the ICU4J library.
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
public class CollatorExample {
public static void main(String[] args) {
// Make sure to choose correct locale: in Turkish uppercase of "i" is "İ", not "I"
RuleBasedCollator collator = (RuleBasedCollator) Collator.getInstance(Locale.US);
collator.setNumericCollation(true); // Place "10" after "2"
collator.setStrength(Collator.PRIMARY); // Case-insensitive
List<String> strings = Arrays.asList("10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a"
);
strings.sort(collator);
System.out.println(String.join(", ", strings));
// Output: _1, _01, _2, _200, 01, 001, 1,
// 2, 02, 10, 10, 010, 20, 100, A 02, A01,
// a2, A20, t1A, t1a, t1ab, t1aB, t1Ab, t1AB,
// T010T01, T0010T01
}
}
The Alphanum algrothim is nice, but it did not match requirements for a project I'm working on. I need to be able to sort negative numbers and decimals correctly. Here is the implementation I came up. Any feedback would be much appreciated.
public class StringAsNumberComparator implements Comparator<String> {
public static final Pattern NUMBER_PATTERN = Pattern.compile("(\\-?\\d+\\.\\d+)|(\\-?\\.\\d+)|(\\-?\\d+)");
/**
* Splits strings into parts sorting each instance of a number as a number if there is
* a matching number in the other String.
*
* For example A1B, A2B, A11B, A11B1, A11B2, A11B11 will be sorted in that order instead
* of alphabetically which will sort A1B and A11B together.
*/
public int compare(String str1, String str2) {
if(str1 == str2) return 0;
else if(str1 == null) return 1;
else if(str2 == null) return -1;
List<String> split1 = split(str1);
List<String> split2 = split(str2);
int diff = 0;
for(int i = 0; diff == 0 && i < split1.size() && i < split2.size(); i++) {
String token1 = split1.get(i);
String token2 = split2.get(i);
if((NUMBER_PATTERN.matcher(token1).matches() && NUMBER_PATTERN.matcher(token2).matches()) {
diff = (int) Math.signum(Double.parseDouble(token1) - Double.parseDouble(token2));
} else {
diff = token1.compareToIgnoreCase(token2);
}
}
if(diff != 0) {
return diff;
} else {
return split1.size() - split2.size();
}
}
/**
* Splits a string into strings and number tokens.
*/
private List<String> split(String s) {
List<String> list = new ArrayList<String>();
try (Scanner scanner = new Scanner(s)) {
int index = 0;
String num = null;
while ((num = scanner.findInLine(NUMBER_PATTERN)) != null) {
int indexOfNumber = s.indexOf(num, index);
if (indexOfNumber > index) {
list.add(s.substring(index, indexOfNumber));
}
list.add(num);
index = indexOfNumber + num.length();
}
if (index < s.length()) {
list.add(s.substring(index));
}
}
return list;
}
}
PS. I wanted to use the java.lang.String.split() method and use "lookahead/lookbehind" to keep the tokens, but I could not get it to work with the regular expression I was using.
interesting problem, and here my proposed solution:
import java.util.Collections;
import java.util.Vector;
public class CompareToken implements Comparable<CompareToken>
{
int valN;
String valS;
String repr;
public String toString() {
return repr;
}
public CompareToken(String s) {
int l = 0;
char data[] = new char[s.length()];
repr = s;
valN = 0;
for (char c : s.toCharArray()) {
if(Character.isDigit(c))
valN = valN * 10 + (c - '0');
else
data[l++] = c;
}
valS = new String(data, 0, l);
}
public int compareTo(CompareToken b) {
int r = valS.compareTo(b.valS);
if (r != 0)
return r;
return valN - b.valN;
}
public static void main(String [] args) {
String [] strings = {
"aaa",
"bbb3ccc",
"bbb12ccc",
"ccc 11",
"ddd",
"eee3dddjpeg2000eee",
"eee12dddjpeg2000eee"
};
Vector<CompareToken> data = new Vector<CompareToken>();
for(String s : strings)
data.add(new CompareToken(s));
Collections.shuffle(data);
Collections.sort(data);
for (CompareToken c : data)
System.out.println ("" + c);
}
}
Prior to discovering this thread, I implemented a similar solution in javascript. Perhaps my strategy will find you well, despite different syntax. Similar to above, I parse the two strings being compared, and split them both into arrays, dividing the strings at continuous numbers.
...
var regex = /(\d+)/g,
str1Components = str1.split(regex),
str2Components = str2.split(regex),
...
I.e., 'hello22goodbye 33' => ['hello', 22, 'goodbye ', 33]; Thus, you can walk through the arrays' elements in pairs between string1 and string2, do some type coercion (such as, is this element really a number?), and compare as you walk.
Working example here: http://jsfiddle.net/F46s6/3/
Note, I currently only support integer types, though handling decimal values wouldn't be too hard of a modification.
My 2 cents.Is working well for me. I am mainly using it for filenames.
private final boolean isDigit(char ch)
{
return ch >= 48 && ch <= 57;
}
private int compareNumericalString(String s1,String s2){
int s1Counter=0;
int s2Counter=0;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
char currentChar1=s1.charAt(s1Counter++);
char currentChar2=s2.charAt(s2Counter++);
if(isDigit(currentChar1) &&isDigit(currentChar2)){
String digitString1=""+currentChar1;
String digitString2=""+currentChar2;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
if(isDigit(s1.charAt(s1Counter))){
digitString1+=s1.charAt(s1Counter);
s1Counter++;
}
if(isDigit(s2.charAt(s2Counter))){
digitString2+=s2.charAt(s2Counter);
s2Counter++;
}
if((!isDigit(s1.charAt(s1Counter))) && (!isDigit(s2.charAt(s2Counter)))){
currentChar1=s1.charAt(s1Counter);
currentChar2=s2.charAt(s2Counter);
break;
}
}
if(!digitString1.equals(digitString2)){
return Integer.parseInt(digitString1)-Integer.parseInt(digitString2);
}
}
if(currentChar1!=currentChar2){
return currentChar1-currentChar2;
}
}
return s1.compareTo(s2);
}
I created a project to compare the different implementations. It is far from complete, but it is a starting point.
Adding on to the answer made by #stanislav.
A few problems I faced while using the answer provided was:
Capital and small letters are separated by the characters between their ASCII codes. This breaks the flow when the strings being sorted have _ or other characters which are between small letters and capital letters in ASCII.
If two strings are the same except for the leading zeroes count being different, the function returns 0 which will make the sort depend on the original positions of the string in the list.
These two issues have been fixed in the new code. And I made a few function instead of a few repetitive set of code. The differentCaseCompared variable keeps track of whether if two strings are the same except for the cases being different. If so the value of the first different case characters subtracted is returned. This is done to avoid the issue of having two strings differing by case returned as 0.
public class NaturalSortingComparator implements Comparator<String> {
#Override
public int compare(String string1, String string2) {
int lengthOfString1 = string1.length();
int lengthOfString2 = string2.length();
int iteratorOfString1 = 0;
int iteratorOfString2 = 0;
int differentCaseCompared = 0;
while (true) {
if (iteratorOfString1 == lengthOfString1) {
if (iteratorOfString2 == lengthOfString2) {
if (lengthOfString1 == lengthOfString2) {
// If both strings are the same except for the different cases, the differentCaseCompared will be returned
return differentCaseCompared;
}
//If the characters are the same at the point, returns the difference between length of the strings
else {
return lengthOfString1 - lengthOfString2;
}
}
//If String2 is bigger than String1
else
return -1;
}
//Check if String1 is bigger than string2
if (iteratorOfString2 == lengthOfString2) {
return 1;
}
char ch1 = string1.charAt(iteratorOfString1);
char ch2 = string2.charAt(iteratorOfString2);
if (Character.isDigit(ch1) && Character.isDigit(ch2)) {
// skip leading zeros
iteratorOfString1 = skipLeadingZeroes(string1, lengthOfString1, iteratorOfString1);
iteratorOfString2 = skipLeadingZeroes(string2, lengthOfString2, iteratorOfString2);
// find the ends of the numbers
int endPositionOfNumbersInString1 = findEndPositionOfNumber(string1, lengthOfString1, iteratorOfString1);
int endPositionOfNumbersInString2 = findEndPositionOfNumber(string2, lengthOfString2, iteratorOfString2);
int lengthOfDigitsInString1 = endPositionOfNumbersInString1 - iteratorOfString1;
int lengthOfDigitsInString2 = endPositionOfNumbersInString2 - iteratorOfString2;
// if the lengths are different, then the longer number is bigger
if (lengthOfDigitsInString1 != lengthOfDigitsInString2)
return lengthOfDigitsInString1 - lengthOfDigitsInString2;
// compare numbers digit by digit
while (iteratorOfString1 < endPositionOfNumbersInString1) {
if (string1.charAt(iteratorOfString1) != string2.charAt(iteratorOfString2))
return string1.charAt(iteratorOfString1) - string2.charAt(iteratorOfString2);
iteratorOfString1++;
iteratorOfString2++;
}
} else {
// plain characters comparison
if (ch1 != ch2) {
if (!ignoreCharacterCaseEquals(ch1, ch2))
return Character.toLowerCase(ch1) - Character.toLowerCase(ch2);
// Set a differentCaseCompared if the characters being compared are different case.
// Should be done only once, hence the check with 0
if (differentCaseCompared == 0) {
differentCaseCompared = ch1 - ch2;
}
}
iteratorOfString1++;
iteratorOfString2++;
}
}
}
private boolean ignoreCharacterCaseEquals(char character1, char character2) {
return Character.toLowerCase(character1) == Character.toLowerCase(character2);
}
private int findEndPositionOfNumber(String string, int lengthOfString, int end) {
while (end < lengthOfString && Character.isDigit(string.charAt(end)))
end++;
return end;
}
private int skipLeadingZeroes(String string, int lengthOfString, int iteratorOfString) {
while (iteratorOfString < lengthOfString && string.charAt(iteratorOfString) == '0')
iteratorOfString++;
return iteratorOfString;
}
}
The following is a unit test I used.
public class NaturalSortingComparatorTest {
private int NUMBER_OF_TEST_CASES = 100000;
#Test
public void compare() {
NaturalSortingComparator naturalSortingComparator = new NaturalSortingComparator();
List<String> expectedStringList = getCorrectStringList();
List<String> testListOfStrings = createTestListOfStrings();
runTestCases(expectedStringList, testListOfStrings, NUMBER_OF_TEST_CASES, naturalSortingComparator);
}
private void runTestCases(List<String> expectedStringList, List<String> testListOfStrings,
int numberOfTestCases, Comparator<String> comparator) {
for (int testCase = 0; testCase < numberOfTestCases; testCase++) {
Collections.shuffle(testListOfStrings);
testListOfStrings.sort(comparator);
Assert.assertEquals(expectedStringList, testListOfStrings);
}
}
private List<String> getCorrectStringList() {
return Arrays.asList(
"1", "01", "001", "2", "02", "10", "10", "010",
"20", "100", "_1", "_01", "_2", "_200", "A 02",
"A01", "a2", "A20", "t1A", "t1a", "t1AB", "t1Ab",
"t1aB", "t1ab", "T010T01", "T0010T01");
}
private List<String> createTestListOfStrings() {
return Arrays.asList(
"10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a");
}
}
Suggestions welcome! I am not sure whether adding the functions changes anything other than the readability part of things.
P.S: Sorry to add another answer to this question. But I don't have enough reps to comment on the answer which I modified for my use.
modification of this answer
case insensitive order (1000a is less than 1000X)
nulls handling
implementation:
import static java.lang.Math.pow;
import java.util.Comparator;
public class AlphanumComparator implements Comparator<String> {
public static final AlphanumComparator ALPHANUM_COMPARATOR = new AlphanumComparator();
private static char[] upperCaseCache = new char[(int) pow(2, 16)];
private boolean nullIsLess;
public AlphanumComparator() {
}
public AlphanumComparator(boolean nullIsLess) {
this.nullIsLess = nullIsLess;
}
#Override
public int compare(String s1, String s2) {
if (s1 == s2)
return 0;
if (s1 == null)
return nullIsLess ? -1 : 1;
if (s2 == null)
return nullIsLess ? 1 : -1;
int i1 = 0;
int i2 = 0;
int len1 = s1.length();
int len2 = s2.length();
while (true) {
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (isDigit(ch1) && isDigit(ch2)) {
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && isDigit(s1.charAt(end1)))
end1++;
while (end2 != len2 && isDigit(s2.charAt(end2)))
end2++;
// if the lengths are different, then the longer number is bigger
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1) {
ch1 = s1.charAt(i1);
ch2 = s2.charAt(i2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
} else {
ch1 = toUpperCase(ch1);
ch2 = toUpperCase(ch2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
private boolean isDigit(char ch) {
return ch >= 48 && ch <= 57;
}
private char toUpperCase(char ch) {
char cached = upperCaseCache[ch];
if (cached == 0) {
cached = Character.toUpperCase(ch);
upperCaseCache[ch] = cached;
}
return cached;
}
}
I think you'll have to do the comparison on a character-by-character fashion. Grab a character, if it's a number character, keep grabbing, then reassemble to characters into a single number string and convert it into an int. Repeat on the other string, and only then do the comparison.
Short answer: based on the context, I can't tell whether this is just some quick-and-dirty code for personal use, or a key part of Goldman Sachs' latest internal accounting software, so I'll open by saying: eww. That's a rather funky sorting algorithm; try to use something a bit less "twisty" if you can.
Long answer:
The two issues that immediately come to mind in your case are performance, and correctness. Informally, make sure it's fast, and make sure your algorithm is a total ordering.
(Of course, if you're not sorting more than about 100 items, you can probably disregard this paragraph.) Performance matters, as the speed of the comparator will be the largest factor in the speed of your sort (assuming the sort algorithm is "ideal" to the typical list). In your case, the comparator's speed will depend mainly on the size of the string. The strings seem to be fairly short, so they probably won't dominate as much as the size of your list.
Turning each string into a string-number-string tuple and then sorting this list of tuples, as suggested in another answer, will fail in some of your cases, since you apparently will have strings with multiple numbers appearing.
The other problem is correctness. Specifically, if the algorithm you described will ever permit A > B > ... > A, then your sort will be non-deterministic. In your case, I fear that it might, though I can't prove it. Consider some parsing cases such as:
aa 0 aa
aa 23aa
aa 2a3aa
aa 113aa
aa 113 aa
a 1-2 a
a 13 a
a 12 a
a 2-3 a
a 21 a
a 2.3 a
Although the question asked a java solution, for anyone who wants a scala solution:
object Alphanum {
private[this] val regex = "((?<=[0-9])(?=[^0-9]))|((?<=[^0-9])(?=[0-9]))"
private[this] val alphaNum: Ordering[String] = Ordering.fromLessThan((ss1: String, ss2: String) => (ss1, ss2) match {
case (sss1, sss2) if sss1.matches("[0-9]+") && sss2.matches("[0-9]+") => sss1.toLong < sss2.toLong
case (sss1, sss2) => sss1 < sss2
})
def ordering: Ordering[String] = Ordering.fromLessThan((s1: String, s2: String) => {
import Ordering.Implicits.infixOrderingOps
implicit val ord: Ordering[List[String]] = Ordering.Implicits.seqDerivedOrdering(alphaNum)
s1.split(regex).toList < s2.split(regex).toList
})
}
My problem was that I have lists consisting of a combination of alpha numeric strings (eg C22, C3, C5 etc), alpha strings (eg A, H, R etc) and just digits (eg 99, 45 etc) that need sorting in the order A, C3, C5, C22, H, R, 45, 99. I also have duplicates that need removing so I only get a single entry.
I'm also not just working with Strings, I'm ordering an Object and using a specific field within the Object to get the correct order.
A solution that seems to work for me is :
SortedSet<Code> codeSet;
codeSet = new TreeSet<Code>(new Comparator<Code>() {
private boolean isThereAnyNumber(String a, String b) {
return isNumber(a) || isNumber(b);
}
private boolean isNumber(String s) {
return s.matches("[-+]?\\d*\\.?\\d+");
}
private String extractChars(String s) {
String chars = s.replaceAll("\\d", "");
return chars;
}
private int extractInt(String s) {
String num = s.replaceAll("\\D", "");
return num.isEmpty() ? 0 : Integer.parseInt(num);
}
private int compareStrings(String o1, String o2) {
if (!extractChars(o1).equals(extractChars(o2))) {
return o1.compareTo(o2);
} else
return extractInt(o1) - extractInt(o2);
}
#Override
public int compare(Code a, Code b) {
return isThereAnyNumber(a.getPrimaryCode(), b.getPrimaryCode())
? isNumber(a.getPrimaryCode()) ? 1 : -1
: compareStrings(a.getPrimaryCode(), b.getPrimaryCode());
}
});
It 'borrows' some code that I found here on Stackoverflow plus some tweaks of my own to get it working just how I needed it too.
Due to trying to order Objects, needing a comparator as well as duplicate removal, one negative fudge I had to employ was I first have to write my Objects to a TreeMap before writing them to a Treeset. It may impact performance a little but given that the lists will be a max of about 80 Codes, it shouldn't be a problem.
I had a similar problem where my strings had space-separated segments inside. I solved it in this way:
public class StringWithNumberComparator implements Comparator<MyClass> {
#Override
public int compare(MyClass o1, MyClass o2) {
if (o1.getStringToCompare().equals(o2.getStringToCompare())) {
return 0;
}
String[] first = o1.getStringToCompare().split(" ");
String[] second = o2.getStringToCompare().split(" ");
if (first.length == second.length) {
for (int i = 0; i < first.length; i++) {
int segmentCompare = StringUtils.compare(first[i], second[i]);
if (StringUtils.isNumeric(first[i]) && StringUtils.isNumeric(second[i])) {
segmentCompare = NumberUtils.compare(Integer.valueOf(first[i]), Integer.valueOf(second[i]));
if (0 != segmentCompare) {
// return only if uneven numbers in case there are more segments to be checked
return segmentCompare;
}
}
if (0 != segmentCompare) {
return segmentCompare;
}
}
} else {
return StringUtils.compare(o1.getDenominazione(), o2.getDenominazione());
}
return 0;
}
As you can see I have used Apaches StringUtils.compare() and NumberUtils.compere() as a standard help.
In your given example, the numbers you want to compare have spaces around them while the other numbers do not, so why would a regular expression not work?
bbb 12 ccc
vs.
eee 12 ddd jpeg2000 eee
If you're writing a comparator class, you should implement your own compare method that will compare two strings character by character. This compare method should check if you're dealing with alphabetic characters, numeric characters, or mixed types (including spaces). You'll have to define how you want a mixed type to act, whether numbers come before alphabetic characters or after, and where spaces fit in etc.
On Linux glibc provides strverscmp(), it's also available from gnulib for portability. However truly "human" sorting has lots of other quirks like "The Beatles" being sorted as "Beatles, The". There is no simple solution to this generic problem.
Related
I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the string are the same, and the middle part that differs is an integer, then compare based on the numeric values of those integers. For example, I want the following strings to end up in order they're shown:
aaa
bbb 3 ccc
bbb 12 ccc
ccc 11
ddd
eee 3 ddd jpeg2000 eee
eee 12 ddd jpeg2000 eee
As you can see, there might be other integers in the string, so I can't just use regular expressions to break out any integer. I'm thinking of just walking the strings from the beginning until I find a bit that doesn't match, then walking in from the end until I find a bit that doesn't match, and then comparing the bit in the middle to the regular expression "[0-9]+", and if it compares, then doing a numeric comparison, otherwise doing a lexical comparison.
Is there a better way?
Update I don't think I can guarantee that the other numbers in the string, the ones that may match, don't have spaces around them, or that the ones that differ do have spaces.
The Alphanum Algorithm
From the website
"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."
Edit: Here's a link to the Java Comparator Implementation from that site.
Interesting little challenge, I enjoyed solving it.
Here is my take at the problem:
String[] strs =
{
"eee 5 ddd jpeg2001 eee",
"eee 123 ddd jpeg2000 eee",
"ddd",
"aaa 5 yy 6",
"ccc 555",
"bbb 3 ccc",
"bbb 9 a",
"",
"eee 4 ddd jpeg2001 eee",
"ccc 11",
"bbb 12 ccc",
"aaa 5 yy 22",
"aaa",
"eee 3 ddd jpeg2000 eee",
"ccc 5",
};
Pattern splitter = Pattern.compile("(\\d+|\\D+)");
public class InternalNumberComparator implements Comparator
{
public int compare(Object o1, Object o2)
{
// I deliberately use the Java 1.4 syntax,
// all this can be improved with 1.5's generics
String s1 = (String)o1, s2 = (String)o2;
// We split each string as runs of number/non-number strings
ArrayList sa1 = split(s1);
ArrayList sa2 = split(s2);
// Nothing or different structure
if (sa1.size() == 0 || sa1.size() != sa2.size())
{
// Just compare the original strings
return s1.compareTo(s2);
}
int i = 0;
String si1 = "";
String si2 = "";
// Compare beginning of string
for (; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
break; // Until we find a difference
}
// No difference found?
if (i == sa1.size())
return 0; // Same strings!
// Try to convert the different run of characters to number
int val1, val2;
try
{
val1 = Integer.parseInt(si1);
val2 = Integer.parseInt(si2);
}
catch (NumberFormatException e)
{
return s1.compareTo(s2); // Strings differ on a non-number
}
// Compare remainder of string
for (i++; i < sa1.size(); i++)
{
si1 = (String)sa1.get(i);
si2 = (String)sa2.get(i);
if (!si1.equals(si2))
{
return s1.compareTo(s2); // Strings differ
}
}
// Here, the strings differ only on a number
return val1 < val2 ? -1 : 1;
}
ArrayList split(String s)
{
ArrayList r = new ArrayList();
Matcher matcher = splitter.matcher(s);
while (matcher.find())
{
String m = matcher.group(1);
r.add(m);
}
return r;
}
}
Arrays.sort(strs, new InternalNumberComparator());
This algorithm need much more testing, but it seems to behave rather nicely.
[EDIT] I added some more comments to be clearer. I see there are much more answers than when I started to code this... But I hope I provided a good starting base and/or some ideas.
Ian Griffiths of Microsoft has a C# implementation he calls Natural Sorting. Porting to Java should be fairly easy, easier than from C anyway!
UPDATE: There seems to be a Java example on eekboom that does this, see the "compareNatural" and use that as your comparer to sorts.
The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.
This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits.
public static final int compareNatural (String s1, String s2)
{
// Skip all identical characters
int len1 = s1.length();
int len2 = s2.length();
int i;
char c1, c2;
for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);
// Check end of string
if (c1 == c2)
return(len1 - len2);
// Check digit in first string
if (Character.isDigit(c1))
{
// Check digit only in first string
if (!Character.isDigit(c2))
return(1);
// Scan all integer digits
int x1, x2;
for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);
// Longer integer wins, first digit otherwise
return(x2 == x1 ? c1 - c2 : x1 - x2);
}
// Check digit only in second string
if (Character.isDigit(c2))
return(-1);
// No digits
return(c1 - c2);
}
I came up with a quite simple implementation in Java using regular expressions:
public static Comparator<String> naturalOrdering() {
final Pattern compile = Pattern.compile("(\\d+)|(\\D+)");
return (s1, s2) -> {
final Matcher matcher1 = compile.matcher(s1);
final Matcher matcher2 = compile.matcher(s2);
while (true) {
final boolean found1 = matcher1.find();
final boolean found2 = matcher2.find();
if (!found1 || !found2) {
return Boolean.compare(found1, found2);
} else if (!matcher1.group().equals(matcher2.group())) {
if (matcher1.group(1) == null || matcher2.group(1) == null) {
return matcher1.group().compareTo(matcher2.group());
} else {
return Integer.valueOf(matcher1.group(1)).compareTo(Integer.valueOf(matcher2.group(1)));
}
}
}
};
}
Here is how it works:
final List<String> strings = Arrays.asList("x15", "xa", "y16", "x2a", "y11", "z", "z5", "x2b", "z");
strings.sort(naturalOrdering());
System.out.println(strings);
[x2a, x2b, x15, xa, y11, y16, z, z, z5]
I realize you're in java, but you can take a look at how StrCmpLogicalW works. It's what Explorer uses to sort filenames in Windows. You can look at the WINE implementation here.
Split the string into runs of letters and numbers, so "foo 12 bar" becomes the list ("foo", 12, "bar"), then use the list as the sort key. This way the numbers will be ordered in numerical order, not alphabetical.
Here is the solution with the following advantages over Alphanum Algorithm:
3.25x times faster (tested on the data from 'Epilogue' chapter of Alphanum description)
Does not consume extra memory (no string splitting, no numbers parsing)
Processes leading zeros correctly (e.g. "0001" equals "1", "01234" is less than "4567")
public class NumberAwareComparator implements Comparator<String>
{
#Override
public int compare(String s1, String s2)
{
int len1 = s1.length();
int len2 = s2.length();
int i1 = 0;
int i2 = 0;
while (true)
{
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (Character.isDigit(ch1) && Character.isDigit(ch2))
{
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && Character.isDigit(s1.charAt(end1)))
end1++;
while (end2 < len2 && Character.isDigit(s2.charAt(end2)))
end2++;
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
// if the lengths are different, then the longer number is bigger
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1)
{
if (s1.charAt(i1) != s2.charAt(i2))
return s1.charAt(i1) - s2.charAt(i2);
i1++;
i2++;
}
}
else
{
// plain characters comparison
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
}
Instead of reinventing the wheel, I'd suggest to use a locale-aware Unicode-compliant string comparator that has built-in number sorting from the ICU4J library.
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
public class CollatorExample {
public static void main(String[] args) {
// Make sure to choose correct locale: in Turkish uppercase of "i" is "İ", not "I"
RuleBasedCollator collator = (RuleBasedCollator) Collator.getInstance(Locale.US);
collator.setNumericCollation(true); // Place "10" after "2"
collator.setStrength(Collator.PRIMARY); // Case-insensitive
List<String> strings = Arrays.asList("10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a"
);
strings.sort(collator);
System.out.println(String.join(", ", strings));
// Output: _1, _01, _2, _200, 01, 001, 1,
// 2, 02, 10, 10, 010, 20, 100, A 02, A01,
// a2, A20, t1A, t1a, t1ab, t1aB, t1Ab, t1AB,
// T010T01, T0010T01
}
}
The Alphanum algrothim is nice, but it did not match requirements for a project I'm working on. I need to be able to sort negative numbers and decimals correctly. Here is the implementation I came up. Any feedback would be much appreciated.
public class StringAsNumberComparator implements Comparator<String> {
public static final Pattern NUMBER_PATTERN = Pattern.compile("(\\-?\\d+\\.\\d+)|(\\-?\\.\\d+)|(\\-?\\d+)");
/**
* Splits strings into parts sorting each instance of a number as a number if there is
* a matching number in the other String.
*
* For example A1B, A2B, A11B, A11B1, A11B2, A11B11 will be sorted in that order instead
* of alphabetically which will sort A1B and A11B together.
*/
public int compare(String str1, String str2) {
if(str1 == str2) return 0;
else if(str1 == null) return 1;
else if(str2 == null) return -1;
List<String> split1 = split(str1);
List<String> split2 = split(str2);
int diff = 0;
for(int i = 0; diff == 0 && i < split1.size() && i < split2.size(); i++) {
String token1 = split1.get(i);
String token2 = split2.get(i);
if((NUMBER_PATTERN.matcher(token1).matches() && NUMBER_PATTERN.matcher(token2).matches()) {
diff = (int) Math.signum(Double.parseDouble(token1) - Double.parseDouble(token2));
} else {
diff = token1.compareToIgnoreCase(token2);
}
}
if(diff != 0) {
return diff;
} else {
return split1.size() - split2.size();
}
}
/**
* Splits a string into strings and number tokens.
*/
private List<String> split(String s) {
List<String> list = new ArrayList<String>();
try (Scanner scanner = new Scanner(s)) {
int index = 0;
String num = null;
while ((num = scanner.findInLine(NUMBER_PATTERN)) != null) {
int indexOfNumber = s.indexOf(num, index);
if (indexOfNumber > index) {
list.add(s.substring(index, indexOfNumber));
}
list.add(num);
index = indexOfNumber + num.length();
}
if (index < s.length()) {
list.add(s.substring(index));
}
}
return list;
}
}
PS. I wanted to use the java.lang.String.split() method and use "lookahead/lookbehind" to keep the tokens, but I could not get it to work with the regular expression I was using.
interesting problem, and here my proposed solution:
import java.util.Collections;
import java.util.Vector;
public class CompareToken implements Comparable<CompareToken>
{
int valN;
String valS;
String repr;
public String toString() {
return repr;
}
public CompareToken(String s) {
int l = 0;
char data[] = new char[s.length()];
repr = s;
valN = 0;
for (char c : s.toCharArray()) {
if(Character.isDigit(c))
valN = valN * 10 + (c - '0');
else
data[l++] = c;
}
valS = new String(data, 0, l);
}
public int compareTo(CompareToken b) {
int r = valS.compareTo(b.valS);
if (r != 0)
return r;
return valN - b.valN;
}
public static void main(String [] args) {
String [] strings = {
"aaa",
"bbb3ccc",
"bbb12ccc",
"ccc 11",
"ddd",
"eee3dddjpeg2000eee",
"eee12dddjpeg2000eee"
};
Vector<CompareToken> data = new Vector<CompareToken>();
for(String s : strings)
data.add(new CompareToken(s));
Collections.shuffle(data);
Collections.sort(data);
for (CompareToken c : data)
System.out.println ("" + c);
}
}
Prior to discovering this thread, I implemented a similar solution in javascript. Perhaps my strategy will find you well, despite different syntax. Similar to above, I parse the two strings being compared, and split them both into arrays, dividing the strings at continuous numbers.
...
var regex = /(\d+)/g,
str1Components = str1.split(regex),
str2Components = str2.split(regex),
...
I.e., 'hello22goodbye 33' => ['hello', 22, 'goodbye ', 33]; Thus, you can walk through the arrays' elements in pairs between string1 and string2, do some type coercion (such as, is this element really a number?), and compare as you walk.
Working example here: http://jsfiddle.net/F46s6/3/
Note, I currently only support integer types, though handling decimal values wouldn't be too hard of a modification.
My 2 cents.Is working well for me. I am mainly using it for filenames.
private final boolean isDigit(char ch)
{
return ch >= 48 && ch <= 57;
}
private int compareNumericalString(String s1,String s2){
int s1Counter=0;
int s2Counter=0;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
char currentChar1=s1.charAt(s1Counter++);
char currentChar2=s2.charAt(s2Counter++);
if(isDigit(currentChar1) &&isDigit(currentChar2)){
String digitString1=""+currentChar1;
String digitString2=""+currentChar2;
while(true){
if(s1Counter>=s1.length()){
break;
}
if(s2Counter>=s2.length()){
break;
}
if(isDigit(s1.charAt(s1Counter))){
digitString1+=s1.charAt(s1Counter);
s1Counter++;
}
if(isDigit(s2.charAt(s2Counter))){
digitString2+=s2.charAt(s2Counter);
s2Counter++;
}
if((!isDigit(s1.charAt(s1Counter))) && (!isDigit(s2.charAt(s2Counter)))){
currentChar1=s1.charAt(s1Counter);
currentChar2=s2.charAt(s2Counter);
break;
}
}
if(!digitString1.equals(digitString2)){
return Integer.parseInt(digitString1)-Integer.parseInt(digitString2);
}
}
if(currentChar1!=currentChar2){
return currentChar1-currentChar2;
}
}
return s1.compareTo(s2);
}
I created a project to compare the different implementations. It is far from complete, but it is a starting point.
Adding on to the answer made by #stanislav.
A few problems I faced while using the answer provided was:
Capital and small letters are separated by the characters between their ASCII codes. This breaks the flow when the strings being sorted have _ or other characters which are between small letters and capital letters in ASCII.
If two strings are the same except for the leading zeroes count being different, the function returns 0 which will make the sort depend on the original positions of the string in the list.
These two issues have been fixed in the new code. And I made a few function instead of a few repetitive set of code. The differentCaseCompared variable keeps track of whether if two strings are the same except for the cases being different. If so the value of the first different case characters subtracted is returned. This is done to avoid the issue of having two strings differing by case returned as 0.
public class NaturalSortingComparator implements Comparator<String> {
#Override
public int compare(String string1, String string2) {
int lengthOfString1 = string1.length();
int lengthOfString2 = string2.length();
int iteratorOfString1 = 0;
int iteratorOfString2 = 0;
int differentCaseCompared = 0;
while (true) {
if (iteratorOfString1 == lengthOfString1) {
if (iteratorOfString2 == lengthOfString2) {
if (lengthOfString1 == lengthOfString2) {
// If both strings are the same except for the different cases, the differentCaseCompared will be returned
return differentCaseCompared;
}
//If the characters are the same at the point, returns the difference between length of the strings
else {
return lengthOfString1 - lengthOfString2;
}
}
//If String2 is bigger than String1
else
return -1;
}
//Check if String1 is bigger than string2
if (iteratorOfString2 == lengthOfString2) {
return 1;
}
char ch1 = string1.charAt(iteratorOfString1);
char ch2 = string2.charAt(iteratorOfString2);
if (Character.isDigit(ch1) && Character.isDigit(ch2)) {
// skip leading zeros
iteratorOfString1 = skipLeadingZeroes(string1, lengthOfString1, iteratorOfString1);
iteratorOfString2 = skipLeadingZeroes(string2, lengthOfString2, iteratorOfString2);
// find the ends of the numbers
int endPositionOfNumbersInString1 = findEndPositionOfNumber(string1, lengthOfString1, iteratorOfString1);
int endPositionOfNumbersInString2 = findEndPositionOfNumber(string2, lengthOfString2, iteratorOfString2);
int lengthOfDigitsInString1 = endPositionOfNumbersInString1 - iteratorOfString1;
int lengthOfDigitsInString2 = endPositionOfNumbersInString2 - iteratorOfString2;
// if the lengths are different, then the longer number is bigger
if (lengthOfDigitsInString1 != lengthOfDigitsInString2)
return lengthOfDigitsInString1 - lengthOfDigitsInString2;
// compare numbers digit by digit
while (iteratorOfString1 < endPositionOfNumbersInString1) {
if (string1.charAt(iteratorOfString1) != string2.charAt(iteratorOfString2))
return string1.charAt(iteratorOfString1) - string2.charAt(iteratorOfString2);
iteratorOfString1++;
iteratorOfString2++;
}
} else {
// plain characters comparison
if (ch1 != ch2) {
if (!ignoreCharacterCaseEquals(ch1, ch2))
return Character.toLowerCase(ch1) - Character.toLowerCase(ch2);
// Set a differentCaseCompared if the characters being compared are different case.
// Should be done only once, hence the check with 0
if (differentCaseCompared == 0) {
differentCaseCompared = ch1 - ch2;
}
}
iteratorOfString1++;
iteratorOfString2++;
}
}
}
private boolean ignoreCharacterCaseEquals(char character1, char character2) {
return Character.toLowerCase(character1) == Character.toLowerCase(character2);
}
private int findEndPositionOfNumber(String string, int lengthOfString, int end) {
while (end < lengthOfString && Character.isDigit(string.charAt(end)))
end++;
return end;
}
private int skipLeadingZeroes(String string, int lengthOfString, int iteratorOfString) {
while (iteratorOfString < lengthOfString && string.charAt(iteratorOfString) == '0')
iteratorOfString++;
return iteratorOfString;
}
}
The following is a unit test I used.
public class NaturalSortingComparatorTest {
private int NUMBER_OF_TEST_CASES = 100000;
#Test
public void compare() {
NaturalSortingComparator naturalSortingComparator = new NaturalSortingComparator();
List<String> expectedStringList = getCorrectStringList();
List<String> testListOfStrings = createTestListOfStrings();
runTestCases(expectedStringList, testListOfStrings, NUMBER_OF_TEST_CASES, naturalSortingComparator);
}
private void runTestCases(List<String> expectedStringList, List<String> testListOfStrings,
int numberOfTestCases, Comparator<String> comparator) {
for (int testCase = 0; testCase < numberOfTestCases; testCase++) {
Collections.shuffle(testListOfStrings);
testListOfStrings.sort(comparator);
Assert.assertEquals(expectedStringList, testListOfStrings);
}
}
private List<String> getCorrectStringList() {
return Arrays.asList(
"1", "01", "001", "2", "02", "10", "10", "010",
"20", "100", "_1", "_01", "_2", "_200", "A 02",
"A01", "a2", "A20", "t1A", "t1a", "t1AB", "t1Ab",
"t1aB", "t1ab", "T010T01", "T0010T01");
}
private List<String> createTestListOfStrings() {
return Arrays.asList(
"10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a");
}
}
Suggestions welcome! I am not sure whether adding the functions changes anything other than the readability part of things.
P.S: Sorry to add another answer to this question. But I don't have enough reps to comment on the answer which I modified for my use.
modification of this answer
case insensitive order (1000a is less than 1000X)
nulls handling
implementation:
import static java.lang.Math.pow;
import java.util.Comparator;
public class AlphanumComparator implements Comparator<String> {
public static final AlphanumComparator ALPHANUM_COMPARATOR = new AlphanumComparator();
private static char[] upperCaseCache = new char[(int) pow(2, 16)];
private boolean nullIsLess;
public AlphanumComparator() {
}
public AlphanumComparator(boolean nullIsLess) {
this.nullIsLess = nullIsLess;
}
#Override
public int compare(String s1, String s2) {
if (s1 == s2)
return 0;
if (s1 == null)
return nullIsLess ? -1 : 1;
if (s2 == null)
return nullIsLess ? 1 : -1;
int i1 = 0;
int i2 = 0;
int len1 = s1.length();
int len2 = s2.length();
while (true) {
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (isDigit(ch1) && isDigit(ch2)) {
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && isDigit(s1.charAt(end1)))
end1++;
while (end2 != len2 && isDigit(s2.charAt(end2)))
end2++;
// if the lengths are different, then the longer number is bigger
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1) {
ch1 = s1.charAt(i1);
ch2 = s2.charAt(i2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
} else {
ch1 = toUpperCase(ch1);
ch2 = toUpperCase(ch2);
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
private boolean isDigit(char ch) {
return ch >= 48 && ch <= 57;
}
private char toUpperCase(char ch) {
char cached = upperCaseCache[ch];
if (cached == 0) {
cached = Character.toUpperCase(ch);
upperCaseCache[ch] = cached;
}
return cached;
}
}
I think you'll have to do the comparison on a character-by-character fashion. Grab a character, if it's a number character, keep grabbing, then reassemble to characters into a single number string and convert it into an int. Repeat on the other string, and only then do the comparison.
Short answer: based on the context, I can't tell whether this is just some quick-and-dirty code for personal use, or a key part of Goldman Sachs' latest internal accounting software, so I'll open by saying: eww. That's a rather funky sorting algorithm; try to use something a bit less "twisty" if you can.
Long answer:
The two issues that immediately come to mind in your case are performance, and correctness. Informally, make sure it's fast, and make sure your algorithm is a total ordering.
(Of course, if you're not sorting more than about 100 items, you can probably disregard this paragraph.) Performance matters, as the speed of the comparator will be the largest factor in the speed of your sort (assuming the sort algorithm is "ideal" to the typical list). In your case, the comparator's speed will depend mainly on the size of the string. The strings seem to be fairly short, so they probably won't dominate as much as the size of your list.
Turning each string into a string-number-string tuple and then sorting this list of tuples, as suggested in another answer, will fail in some of your cases, since you apparently will have strings with multiple numbers appearing.
The other problem is correctness. Specifically, if the algorithm you described will ever permit A > B > ... > A, then your sort will be non-deterministic. In your case, I fear that it might, though I can't prove it. Consider some parsing cases such as:
aa 0 aa
aa 23aa
aa 2a3aa
aa 113aa
aa 113 aa
a 1-2 a
a 13 a
a 12 a
a 2-3 a
a 21 a
a 2.3 a
Although the question asked a java solution, for anyone who wants a scala solution:
object Alphanum {
private[this] val regex = "((?<=[0-9])(?=[^0-9]))|((?<=[^0-9])(?=[0-9]))"
private[this] val alphaNum: Ordering[String] = Ordering.fromLessThan((ss1: String, ss2: String) => (ss1, ss2) match {
case (sss1, sss2) if sss1.matches("[0-9]+") && sss2.matches("[0-9]+") => sss1.toLong < sss2.toLong
case (sss1, sss2) => sss1 < sss2
})
def ordering: Ordering[String] = Ordering.fromLessThan((s1: String, s2: String) => {
import Ordering.Implicits.infixOrderingOps
implicit val ord: Ordering[List[String]] = Ordering.Implicits.seqDerivedOrdering(alphaNum)
s1.split(regex).toList < s2.split(regex).toList
})
}
My problem was that I have lists consisting of a combination of alpha numeric strings (eg C22, C3, C5 etc), alpha strings (eg A, H, R etc) and just digits (eg 99, 45 etc) that need sorting in the order A, C3, C5, C22, H, R, 45, 99. I also have duplicates that need removing so I only get a single entry.
I'm also not just working with Strings, I'm ordering an Object and using a specific field within the Object to get the correct order.
A solution that seems to work for me is :
SortedSet<Code> codeSet;
codeSet = new TreeSet<Code>(new Comparator<Code>() {
private boolean isThereAnyNumber(String a, String b) {
return isNumber(a) || isNumber(b);
}
private boolean isNumber(String s) {
return s.matches("[-+]?\\d*\\.?\\d+");
}
private String extractChars(String s) {
String chars = s.replaceAll("\\d", "");
return chars;
}
private int extractInt(String s) {
String num = s.replaceAll("\\D", "");
return num.isEmpty() ? 0 : Integer.parseInt(num);
}
private int compareStrings(String o1, String o2) {
if (!extractChars(o1).equals(extractChars(o2))) {
return o1.compareTo(o2);
} else
return extractInt(o1) - extractInt(o2);
}
#Override
public int compare(Code a, Code b) {
return isThereAnyNumber(a.getPrimaryCode(), b.getPrimaryCode())
? isNumber(a.getPrimaryCode()) ? 1 : -1
: compareStrings(a.getPrimaryCode(), b.getPrimaryCode());
}
});
It 'borrows' some code that I found here on Stackoverflow plus some tweaks of my own to get it working just how I needed it too.
Due to trying to order Objects, needing a comparator as well as duplicate removal, one negative fudge I had to employ was I first have to write my Objects to a TreeMap before writing them to a Treeset. It may impact performance a little but given that the lists will be a max of about 80 Codes, it shouldn't be a problem.
I had a similar problem where my strings had space-separated segments inside. I solved it in this way:
public class StringWithNumberComparator implements Comparator<MyClass> {
#Override
public int compare(MyClass o1, MyClass o2) {
if (o1.getStringToCompare().equals(o2.getStringToCompare())) {
return 0;
}
String[] first = o1.getStringToCompare().split(" ");
String[] second = o2.getStringToCompare().split(" ");
if (first.length == second.length) {
for (int i = 0; i < first.length; i++) {
int segmentCompare = StringUtils.compare(first[i], second[i]);
if (StringUtils.isNumeric(first[i]) && StringUtils.isNumeric(second[i])) {
segmentCompare = NumberUtils.compare(Integer.valueOf(first[i]), Integer.valueOf(second[i]));
if (0 != segmentCompare) {
// return only if uneven numbers in case there are more segments to be checked
return segmentCompare;
}
}
if (0 != segmentCompare) {
return segmentCompare;
}
}
} else {
return StringUtils.compare(o1.getDenominazione(), o2.getDenominazione());
}
return 0;
}
As you can see I have used Apaches StringUtils.compare() and NumberUtils.compere() as a standard help.
In your given example, the numbers you want to compare have spaces around them while the other numbers do not, so why would a regular expression not work?
bbb 12 ccc
vs.
eee 12 ddd jpeg2000 eee
If you're writing a comparator class, you should implement your own compare method that will compare two strings character by character. This compare method should check if you're dealing with alphabetic characters, numeric characters, or mixed types (including spaces). You'll have to define how you want a mixed type to act, whether numbers come before alphabetic characters or after, and where spaces fit in etc.
On Linux glibc provides strverscmp(), it's also available from gnulib for portability. However truly "human" sorting has lots of other quirks like "The Beatles" being sorted as "Beatles, The". There is no simple solution to this generic problem.
Here is my code for whether two strings are anagrams or not
static boolean isAnagram(String a, String b) {
if (a.length() != b.length()) return false;
a = a.toLowerCase();
b = b.toLowerCase();
int m1=0;
for(int i=0;i<a.length();i++){
m1 += (int)a.charAt(i);
m1 -= (int)b.charAt(i);
}
return m1==0;
}
My code fails for two test cases
case 1: String a="xyzw";and String b="xyxy";
case 2: String a="bbcc"; and String b="dabc";
can anyone help me passing the above two cases?
I think your code doesn't work because you sum up the code of characters but maybe answer is zero however their are not equal, for example: "ad" "bc"
the better way is to do this is to sort characters of strings, if they has same array length and same order, so two string are anagram.
static boolean isAnagram(String str1, String str2) {
int[] str1Chars = str1.toLowerCase().chars().sorted().toArray();
int[] str2Chars = str2.toLowerCase().chars().sorted().toArray();
return Arrays.equals(str1Chars, str2Chars);
}
I hope this help you. (it is a little hard because I use stream to create and sort array of characters)
Try this:
import java.io.*;
class GFG{
/* function to check whether two strings are
anagram of each other */
static boolean areAnagram(char[] str1, char[] str2)
{
// Get lenghts of both strings
int n1 = str1.length;
int n2 = str2.length;
// If length of both strings is not same,
// then they cannot be anagram
if (n1 != n2)
return false;
// Sort both strings
quickSort(str1, 0, n1 - 1);
quickSort(str2, 0, n2 - 1);
// Compare sorted strings
for (int i = 0; i < n1; i++)
if (str1[i] != str2[i])
return false;
return true;
}
// Following functions (exchange and partition
// are needed for quickSort)
static void exchange(char A[],int a, int b)
{
char temp;
temp = A[a];
A[a] = A[b];
A[b] = temp;
}
static int partition(char A[], int si, int ei)
{
char x = A[ei];
int i = (si - 1);
int j;
for (j = si; j <= ei - 1; j++)
{
if(A[j] <= x)
{
i++;
exchange(A, i, j);
}
}
exchange (A, i+1 , ei);
return (i + 1);
}
/* Implementation of Quick Sort
A[] --> Array to be sorted
si --> Starting index
ei --> Ending index
*/
static void quickSort(char A[], int si, int ei)
{
int pi; /* Partitioning index */
if(si < ei)
{
pi = partition(A, si, ei);
quickSort(A, si, pi - 1);
quickSort(A, pi + 1, ei);
}
}
/* Driver program to test to print printDups*/
public static void main(String args[])
{
char str1[] = {'t','e','s','t'};
char str2[] = {'t','t','e','w'};
if (areAnagram(str1, str2))
System.out.println("The two strings are"+
" anagram of each other");
else
System.out.println("The two strings are not"+
" anagram of each other");
}
}
The implementation isn't correct. While a pair of anagrams will always have the same length and the same sum of characters, this is not a sufficient condition. There are many pairs of strings that have the same length and the same sum of characters and are not anagrams. E.g., "ad" and "bc".
A better implementation would count the number of times each character appears in each string and compare them. E.g.:
public static boolean isAnagram(String a, String b) {
return charCounts(a).equals(charCounts(b));
}
private static Map<Integer, Long> charCounts(String s) {
return s.chars()
.boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
static boolean isAnagram(String a, String b) {
if (a.length() != b.length())
return false;
a = a.toLowerCase();
b = b.toLowerCase();
HashMap<Integer, Integer> m1 = new HashMap<>(); // Key is ascii number, Value is count. For String a
HashMap<Integer, Integer> m2 = new HashMap<>(); // Key is ascii number, Value is count. For String b
for (int i = 0; i < a.length(); i++) {
int an = (int) (a.charAt(i));
int bn = (int) (b.charAt(i));
// Add 1 to current ascii number. String a.
if (m1.containsKey(an)) {
m1.put(an, m1.get(an) + 1);
}else {
m1.put(an, 1);
}
// Add 1 to current ascii number. String b.
if (m2.containsKey(bn)) {
m2.put(bn, m2.get(bn) + 1);
}else {
m2.put(bn, 1);
}
}
//Check both count equals().
return m1.equals(m2);
}
you should check per every letters.
If (ascii of a[0] == ascii of b[0] + 1) and (ascii of a[1] == ascii of b[1] - 1) It will return true because 1 - 1 is zero.
Sorry for very very complex code.
Adding character values is error prone logic, because A+C and B+B generate same number. The best option with this case is using Arrays. Look at the code below -
static boolean isAnagram(String a, String b) {
if (a.length() != b.length()) return false;
a = a.toLowerCase();
b = b.toLowerCase();
char[] charA = a.toCharArray();
Arrays.sort(charA);
char[] charB = b.toCharArray();
Arrays.sort(charB);
return Arrays.equals(charA, charB);
}
This should give you what you want.
Try this. It will execute in the O(word.length).
public boolean checkForAnagram(String str1, String str2) {
if (str1 == null || str2 == null || str1.length() != str2.length()) {
return false;
}
return Arrays.equals(getCharFrequencyTable(str1), getCharFrequencyTable(str2));
}
private int[] getCharFrequencyTable(String str) {
int[] frequencyTable = new int[256]; //I am using array instead of hashmap to make you realize that its a constant time operation.
char[] charArrayOfStr = str.toLowerCase().toCharArray();
for(char c : charArrayOfStr) {
frequencyTable[c] = frequencyTable[c]+1;
}
return frequencyTable;
}
Check out below methods :
/**
* Java program - String Anagram Example.
* This program checks if two Strings are anagrams or not
*/
public class AnagramCheck {
/*
* One way to find if two Strings are anagram in Java. This method
* assumes both arguments are not null and in lowercase.
*
* #return true, if both String are anagram
*/
public static boolean isAnagram(String word, String anagram){
if(word.length() != anagram.length()){
return false;
}
char[] chars = word.toCharArray();
for(char c : chars){
int index = anagram.indexOf(c);
if(index != -1){
anagram = anagram.substring(0,index) + anagram.substring(index +1, anagram.length());
}else{
return false;
}
}
return anagram.isEmpty();
}
/*
* Another way to check if two Strings are anagram or not in Java
* This method assumes that both word and anagram are not null and lowercase
* #return true, if both Strings are anagram.
*/
public static boolean iAnagram(String word, String anagram){
char[] charFromWord = word.toCharArray();
char[] charFromAnagram = anagram.toCharArray();
Arrays.sort(charFromWord);
Arrays.sort(charFromAnagram);
return Arrays.equals(charFromWord, charFromAnagram);
}
public static boolean checkAnagram(String first, String second){
char[] characters = first.toCharArray();
StringBuilder sbSecond = new StringBuilder(second);
for(char ch : characters){
int index = sbSecond.indexOf("" + ch);
if(index != -1){
sbSecond.deleteCharAt(index);
}else{
return false;
}
}
return sbSecond.length()==0 ? true : false;
}
}
You are adding the ascii values of characters in given strings and comparing them, which will not always give you correct results. Consider this:
String a="acd" and String b="ccb"
both of them will give you a sum of 296 but these are not anagrams.
You can count of occurrences of characters in both the string and compare them. In above example, it will give you {"a":1,"c":1,"d":1} and {"c":2,"b":1}.
Also,you can associate a prime number with each of the character set [a-z] where 'a' matches 2, 'b' matches 3, 'c' matches 5 and so on.
Next, you can calculate the multiplication of the prime numbers associated with characters in the given string. The multiplication follows associativity rules (xy = yx).
Example:
abc --> 2*3*5 = 30
cba --> 5*3*2 = 30
Note: If the string size is huge, this might not be the best approach as you might encounter overflow issues.
I have a program that shows you whether two words are anagrams of one another. There are a few examples that will not work properly and I would appreciate any help, although if it were not advanced that would be great, as I am a 1st year programmer. "schoolmaster" and "theclassroom" are anagrams of one another, however when I change "theclassroom" to "theclafsroom" it still says they are anagrams, what am I doing wrong?
import java.util.ArrayList;
public class AnagramCheck {
public static void main(String args[]) {
String phrase1 = "tbeclassroom";
phrase1 = (phrase1.toLowerCase()).trim();
char[] phrase1Arr = phrase1.toCharArray();
String phrase2 = "schoolmaster";
phrase2 = (phrase2.toLowerCase()).trim();
ArrayList<Character> phrase2ArrList = convertStringToArraylist(phrase2);
if (phrase1.length() != phrase2.length()) {
System.out.print("There is no anagram present.");
} else {
boolean isFound = true;
for (int i = 0; i < phrase1Arr.length; i++) {
for (int j = 0; j < phrase2ArrList.size(); j++) {
if (phrase1Arr[i] == phrase2ArrList.get(j)) {
System.out.print("There is a common element.\n");
isFound =;
phrase2ArrList.remove(j);
}
}
if (isFound == false) {
System.out.print("There are no anagrams present.");
return;
}
}
System.out.printf("%s is an anagram of %s", phrase1, phrase2);
}
}
public static ArrayList<Character> convertStringToArraylist(String str) {
ArrayList<Character> charList = new ArrayList<Character>();
for (int i = 0; i < str.length(); i++) {
charList.add(str.charAt(i));
}
return charList;
}
}
Two words are anagrams of each other if they contain the same number of characters and the same characters. You should only need to sort the characters in lexicographic order, and determine if all the characters in one string are equal to and in the same order as all of the characters in the other string.
Here's a code example. Look into Arrays in the API to understand what's going on here.
public boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
Fastest algorithm would be to map each of the 26 English characters to a unique prime number. Then calculate the product of the string. By the fundamental theorem of arithmetic, 2 strings are anagrams if and only if their products are the same.
If you sort either array, the solution becomes O(n log n). but if you use a hashmap, it's O(n). tested and working.
char[] word1 = "test".toCharArray();
char[] word2 = "tes".toCharArray();
Map<Character, Integer> lettersInWord1 = new HashMap<Character, Integer>();
for (char c : word1) {
int count = 1;
if (lettersInWord1.containsKey(c)) {
count = lettersInWord1.get(c) + 1;
}
lettersInWord1.put(c, count);
}
for (char c : word2) {
int count = -1;
if (lettersInWord1.containsKey(c)) {
count = lettersInWord1.get(c) - 1;
}
lettersInWord1.put(c, count);
}
for (char c : lettersInWord1.keySet()) {
if (lettersInWord1.get(c) != 0) {
return false;
}
}
return true;
Here's a simple fast O(n) solution without using sorting or multiple loops or hash maps. We increment the count of each character in the first array and decrement the count of each character in the second array. If the resulting counts array is full of zeros, the strings are anagrams. Can be expanded to include other characters by increasing the size of the counts array.
class AnagramsFaster{
private static boolean compare(String a, String b){
char[] aArr = a.toLowerCase().toCharArray(), bArr = b.toLowerCase().toCharArray();
if (aArr.length != bArr.length)
return false;
int[] counts = new int[26]; // An array to hold the number of occurrences of each character
for (int i = 0; i < aArr.length; i++){
counts[aArr[i]-97]++; // Increment the count of the character at i
counts[bArr[i]-97]--; // Decrement the count of the character at i
}
// If the strings are anagrams, the counts array will be full of zeros
for (int i = 0; i<26; i++)
if (counts[i] != 0)
return false;
return true;
}
public static void main(String[] args){
System.out.println(compare(args[0], args[1]));
}
}
Lots of people have presented solutions, but I just want to talk about the algorithmic complexity of some of the common approaches:
The simple "sort the characters using Arrays.sort()" approach is going to be O(N log N).
If you use radix sorting, that reduces to O(N) with O(M) space, where M is the number of distinct characters in the alphabet. (That is 26 in English ... but in theory we ought to consider multi-lingual anagrams.)
The "count the characters" using an array of counts is also O(N) ... and faster than radix sort because you don't need to reconstruct the sorted string. Space usage will be O(M).
A "count the characters" using a dictionary, hashmap, treemap, or equivalent will be slower that the array approach, unless the alphabet is huge.
The elegant "product-of-primes" approach is unfortunately O(N^2) in the worst case This is because for long-enough words or phrases, the product of the primes won't fit into a long. That means that you'd need to use BigInteger, and N times multiplying a BigInteger by a small constant is O(N^2).
For a hypothetical large alphabet, the scaling factor is going to be large. The worst-case space usage to hold the product of the primes as a BigInteger is (I think) O(N*logM).
A hashcode based approach is usually O(N) if the words are not anagrams. If the hashcodes are equal, then you still need to do a proper anagram test. So this is not a complete solution.
I would also like to highlight that most of the posted answers assume that each code-point in the input string is represented as a single char value. This is not a valid assumption for code-points outside of the BMP (plane 0); e.g. if an input string contains emojis.
The solutions that make the invalid assumption will probably work most of the time anyway. A code-point outside of the BMP will represented in the string as two char values: a low surrogate and a high surrogate. If the strings contain only one such code-point, we can get away with treating the char values as if they were code-points. However, we can get into trouble when the strings being tested contain 2 or more code-points. Then the faulty algorithms will fail to distinguish some cases. For example, [SH1, SL1, SH2, SL2] versus [SH1, SL2, SH2, SL1] where the SH<n> and SL<2> denote high and low surrogates respectively. The net result will be false anagrams.
Alex Salauyou's answer gives a couple of solutions that will work for all valid Unicode code-points.
O(n) solution without any kind of sorting and using only one map.
public boolean isAnagram(String leftString, String rightString) {
if (leftString == null || rightString == null) {
return false;
} else if (leftString.length() != rightString.length()) {
return false;
}
Map<Character, Integer> occurrencesMap = new HashMap<>();
for(int i = 0; i < leftString.length(); i++){
char charFromLeft = leftString.charAt(i);
int nrOfCharsInLeft = occurrencesMap.containsKey(charFromLeft) ? occurrencesMap.get(charFromLeft) : 0;
occurrencesMap.put(charFromLeft, ++nrOfCharsInLeft);
char charFromRight = rightString.charAt(i);
int nrOfCharsInRight = occurrencesMap.containsKey(charFromRight) ? occurrencesMap.get(charFromRight) : 0;
occurrencesMap.put(charFromRight, --nrOfCharsInRight);
}
for(int occurrencesNr : occurrencesMap.values()){
if(occurrencesNr != 0){
return false;
}
}
return true;
}
and less generic solution but a little bit faster one. You have to place your alphabet here:
public boolean isAnagram(String leftString, String rightString) {
if (leftString == null || rightString == null) {
return false;
} else if (leftString.length() != rightString.length()) {
return false;
}
char letters[] = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'};
Map<Character, Integer> occurrencesMap = new HashMap<>();
for (char l : letters) {
occurrencesMap.put(l, 0);
}
for(int i = 0; i < leftString.length(); i++){
char charFromLeft = leftString.charAt(i);
Integer nrOfCharsInLeft = occurrencesMap.get(charFromLeft);
occurrencesMap.put(charFromLeft, ++nrOfCharsInLeft);
char charFromRight = rightString.charAt(i);
Integer nrOfCharsInRight = occurrencesMap.get(charFromRight);
occurrencesMap.put(charFromRight, --nrOfCharsInRight);
}
for(Integer occurrencesNr : occurrencesMap.values()){
if(occurrencesNr != 0){
return false;
}
}
return true;
}
We're walking two equal length strings and tracking the differences between them. We don't care what the differences are, we just want to know if they have the same characters or not. We can do this in O(n/2) without any post processing (or a lot of primes).
public class TestAnagram {
public static boolean isAnagram(String first, String second) {
String positive = first.toLowerCase();
String negative = second.toLowerCase();
if (positive.length() != negative.length()) {
return false;
}
int[] counts = new int[26];
int diff = 0;
for (int i = 0; i < positive.length(); i++) {
int pos = (int) positive.charAt(i) - 97; // convert the char into an array index
if (counts[pos] >= 0) { // the other string doesn't have this
diff++; // an increase in differences
} else { // it does have it
diff--; // a decrease in differences
}
counts[pos]++; // track it
int neg = (int) negative.charAt(i) - 97;
if (counts[neg] <= 0) { // the other string doesn't have this
diff++; // an increase in differences
} else { // it does have it
diff--; // a decrease in differences
}
counts[neg]--; // track it
}
return diff == 0;
}
public static void main(String[] args) {
System.out.println(isAnagram("zMarry", "zArmry")); // true
System.out.println(isAnagram("basiparachromatin", "marsipobranchiata")); // true
System.out.println(isAnagram("hydroxydeoxycorticosterones", "hydroxydesoxycorticosterone")); // true
System.out.println(isAnagram("hydroxydeoxycorticosterones", "hydroxydesoxycorticosterons")); // false
System.out.println(isAnagram("zArmcy", "zArmry")); // false
}
}
Yes this code is dependent on the ASCII English character set of lowercase characters but it shouldn't be hard to modify to other languages. You can always use a Map[Character, Int] to track the same information, it'll just be slower.
By using more memory (an HashMap of at most N/2 elements)we do not need to sort the strings.
public static boolean areAnagrams(String one, String two) {
if (one.length() == two.length()) {
String s0 = one.toLowerCase();
String s1 = two.toLowerCase();
HashMap<Character, Integer> chars = new HashMap<Character, Integer>(one.length());
Integer count;
for (char c : s0.toCharArray()) {
count = chars.get(c);
count = Integer.valueOf(count != null ? count + 1 : 1);
chars.put(c, count);
}
for (char c : s1.toCharArray()) {
count = chars.get(c);
if (count == null) {
return false;
} else {
count--;
chars.put(c, count);
}
}
for (Integer i : chars.values()) {
if (i != 0) {
return false;
}
}
return true;
} else {
return false;
}
}
This function is actually running in O(N) ... instead of O(NlogN) for the solution that sorts the strings. If I were to assume that you are going to use only alphabetic characters I could only use an array of 26 ints (from a to z without accents or decorations) instead of the hashmap.
If we define that :
N = |one| + |two|
we do one iteration over N (once over one to increment the counters, and once to decrement them over two).
Then to check the totals we iterate over at mose N/2.
The other algorithms described have one advantage: they do not use extra memory assuming that Arrays.sort uses inplace versions of QuickSort or merge sort. But since we are talking about anagrams I will assume that we are talking about human languages, thus words should not be long enough to give memory issues.
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package Algorithms;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import javax.swing.JOptionPane;
/**
*
* #author Mokhtar
*/
public class Anagrams {
//Write aprogram to check if two words are anagrams
public static void main(String[] args) {
Anagrams an=new Anagrams();
ArrayList<String> l=new ArrayList<String>();
String result=JOptionPane.showInputDialog("How many words to test anagrams");
if(Integer.parseInt(result) >1)
{
for(int i=0;i<Integer.parseInt(result);i++)
{
String word=JOptionPane.showInputDialog("Enter word #"+i);
l.add(word);
}
System.out.println(an.isanagrams(l));
}
else
{
JOptionPane.showMessageDialog(null, "Can not be tested, \nYou can test two words or more");
}
}
private static String sortString( String w )
{
char[] ch = w.toCharArray();
Arrays.sort(ch);
return new String(ch);
}
public boolean isanagrams(ArrayList<String> l)
{
boolean isanagrams=true;
ArrayList<String> anagrams = null;
HashMap<String, ArrayList<String>> map = new HashMap<String, ArrayList<String>>();
for(int i=0;i<l.size();i++)
{
String word = l.get(i);
String sortedWord = sortString(word);
anagrams = map.get( sortedWord );
if( anagrams == null ) anagrams = new ArrayList<String>();
anagrams.add(word);
map.put(sortedWord, anagrams);
}
for(int h=0;h<l.size();h++)
{
if(!anagrams.contains(l.get(h)))
{
isanagrams=false;
break;
}
}
return isanagrams;
//}
}
}
I am a C++ developer and the code below is in C++. I believe the fastest and easiest way to go about it would be the following:
Create a vector of ints of size 26, with all slots initialized to 0, and place each character of the string into the appropriate position in the vector. Remember, the vector is in alphabetical order and so if the first letter in the string is z, it would go in myvector[26]. Note: This can be done using ASCII characters so essentially your code will look something like this:
string s = zadg;
for(int i =0; i < s.size(); ++i){
myvector[s[i] - 'a'] = myvector['s[i] - 'a'] + 1;
}
So inserting all the elements would take O(n) time as you would only traverse the list once. You can now do the exact same thing for the second string and that too would take O(n) time. You can then compare the two vectors by checking to see if the counters in each slot are the same. If they are, that means you had the same number of EACH character in both the strings and thus they are anagrams. The comparing of the two vectors should also take O(n) time as you are only traversing through it once.
Note: The code only works for a single word of characters. If you have spaces, and numbers and symbols, you can just create a vector of size 96 (ASCII characters 32-127) and instead of saying - 'a' you would say - ' ' as the space character is the first one in the ASCII list of characters.
I hope that helps. If i have made a mistake somewhere, please leave a comment.
So far all proposed solutions work with separate char items, not code points. I'd like to propose two solutions to properly handle surrogate pairs as well (those are characters from U+10000 to U+10FFFF, composed of two char items).
1) One-line O(n logn) solution which utilizes Java 8 CharSequence.codePoints() stream:
static boolean areAnagrams(CharSequence a, CharSequence b) {
return Arrays.equals(a.codePoints().sorted().toArray(),
b.codePoints().sorted().toArray());
}
2) Less elegant O(n) solution (in fact, it will be faster only for long strings with low chances to be anagrams):
static boolean areAnagrams(CharSequence a, CharSequence b) {
int len = a.length();
if (len != b.length())
return false;
// collect codepoint occurrences in "a"
Map<Integer, Integer> ocr = new HashMap<>(64);
a.codePoints().forEach(c -> ocr.merge(c, 1, Integer::sum));
// for each codepoint in "b", look for matching occurrence
for (int i = 0, c = 0; i < len; i += Character.charCount(c)) {
int cc = ocr.getOrDefault((c = Character.codePointAt(b, i)), 0);
if (cc == 0)
return false;
ocr.put(c, cc - 1);
}
return true;
}
Thanks for pointing out to make comment, while making comment I found that there was incorrect logic. I corrected the logic and added comment for each piece of code.
// Time complexity: O(N) where N is number of character in String
// Required space :constant space.
// will work for string that contains ASCII chars
private static boolean isAnagram(String s1, String s2) {
// if length of both string's are not equal then they are not anagram of each other
if(s1.length() != s2.length())return false;
// array to store the presence of a character with number of occurrences.
int []seen = new int[256];
// initialize the array with zero. Do not need to initialize specifically since by default element will initialized by 0.
// Added this is just increase the readability of the code.
Arrays.fill(seen, 0);
// convert each string to lower case if you want to make ABC and aBC as anagram, other wise no need to change the case.
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
// iterate through the first string and count the occurrences of each character
for(int i =0; i < s1.length(); i++){
seen[s1.charAt(i)] = seen[s1.charAt(i)] +1;
}
// iterate through second string and if any char has 0 occurrence then return false, it mean some char in s2 is there that is not present in s1.
// other wise reduce the occurrences by one every time .
for(int i =0; i < s2.length(); i++){
if(seen[s2.charAt(i)] ==0)return false;
seen[s2.charAt(i)] = seen[s2.charAt(i)]-1;
}
// now if both string have same occurrence of each character then the seen array must contains all element as zero. if any one has non zero element return false mean there are
// some character that either does not appear in one of the string or/and mismatch in occurrences
for(int i = 0; i < 256; i++){
if(seen[i] != 0)return false;
}
return true;
}
IMHO, the most efficient solution was provided by #Siguza, I have extended it to cover strings with space e.g: "William Shakespeare", "I am a weakish speller", "School master", "The classroom"
public int getAnagramScore(String word, String anagram) {
if (word == null || anagram == null) {
throw new NullPointerException("Both, word and anagram, must be non-null");
}
char[] wordArray = word.trim().toLowerCase().toCharArray();
char[] anagramArray = anagram.trim().toLowerCase().toCharArray();
int[] alphabetCountArray = new int[26];
int reference = 'a';
for (int i = 0; i < wordArray.length; i++) {
if (!Character.isWhitespace(wordArray[i])) {
alphabetCountArray[wordArray[i] - reference]++;
}
}
for (int i = 0; i < anagramArray.length; i++) {
if (!Character.isWhitespace(anagramArray[i])) {
alphabetCountArray[anagramArray[i] - reference]--;
}
}
for (int i = 0; i < 26; i++)
if (alphabetCountArray[i] != 0)
return 0;
return word.length();
}
// When this method returns 0 means strings are Anagram, else Not.
public static int isAnagram(String str1, String str2) {
int value = 0;
if (str1.length() == str2.length()) {
for (int i = 0; i < str1.length(); i++) {
value = value + str1.charAt(i);
value = value - str2.charAt(i);
}
} else {
value = -1;
}
return value;
}
Many complicated answers here. Base on the accepted answer and the comment mentioning the 'ac'-'bb' issue assuming A=65 B=66 C=67, we could simply use the square of each integer that represent a char and solve the problem:
public boolean anagram(String s, String t) {
if(s.length() != t.length())
return false;
int value = 0;
for(int i = 0; i < s.length(); i++){
value += ((int)s.charAt(i))^2;
value -= ((int)t.charAt(i))^2;
}
return value == 0;
}
A similar answer may have been posted in C++, here it is again in Java. Note that the most elegant way would be to use a Trie to store the characters in sorted order, however, that's a more complex solution. One way is to use a hashset to store all the words we are comparing and then compare them one by one. To compare them, make an array of characters with the index representing the ANCII value of the characters (using a normalizer since ie. ANCII value of 'a' is 97) and the value representing the occurrence count of that character. This will run in O(n) time and use O(m*z) space where m is the size of the currentWord and z the size for the storedWord, both for which we create a Char[].
public static boolean makeAnagram(String currentWord, String storedWord){
if(currentWord.length() != storedWord.length()) return false;//words must be same length
Integer[] currentWordChars = new Integer[totalAlphabets];
Integer[] storedWordChars = new Integer[totalAlphabets];
//create a temp Arrays to compare the words
storeWordCharacterInArray(currentWordChars, currentWord);
storeWordCharacterInArray(storedWordChars, storedWord);
for(int i = 0; i < totalAlphabets; i++){
//compare the new word to the current charList to see if anagram is possible
if(currentWordChars[i] != storedWordChars[i]) return false;
}
return true;//and store this word in the HashSet of word in the Heap
}
//for each word store its characters
public static void storeWordCharacterInArray(Integer[] characterList, String word){
char[] charCheck = word.toCharArray();
for(char c: charCheck){
Character cc = c;
int index = cc.charValue()-indexNormalizer;
characterList[index] += 1;
}
}
How a mathematician might think about the problem before writing any code:
The relation "are anagrams" between strings is an equivalence relation, so partitions the set of all strings into equivalence classes.
Suppose we had a rule to choose a representative (crib) from each class, then it's easy to test whether two classes are the same by comparing their representatives.
An obvious representative for a set of strings is "the smallest element by lexicographic order", which is easy to compute from any element by sorting. For example, the representative of the anagram class containing 'hat' is 'aht'.
In your example "schoolmaster" and "theclassroom" are anagrams because they are both in the anagram class with crib "acehlmoorsst".
In pseudocode:
>>> def crib(word):
... return sorted(word)
...
>>> crib("schoolmaster") == crib("theclassroom")
True
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
/**
* Check if Anagram by Prime Number Logic
* #author Pallav
*
*/
public class Anagram {
public static void main(String args[]) {
System.out.println(isAnagram(args[0].toUpperCase(),
args[1].toUpperCase()));
}
/**
*
* #param word : The String 1
* #param anagram_word : The String 2 with which Anagram to be verified
* #return true or false based on Anagram
*/
public static Boolean isAnagram(String word, String anagram_word) {
//If length is different return false
if (word.length() != anagram_word.length()) {
return false;
}
char[] words_char = word.toCharArray();//Get the Char Array of First String
char[] anagram_word_char = anagram_word.toCharArray();//Get the Char Array of Second String
int words_char_num = 1;//Initialize Multiplication Factor to 1
int anagram_word_num = 1;//Initialize Multiplication Factor to 1 for String 2
Map<Character, Integer> wordPrimeMap = wordPrimeMap();//Get the Prime numbers Mapped to each alphabets in English
for (int i = 0; i < words_char.length; i++) {
words_char_num *= wordPrimeMap.get(words_char[i]);//get Multiplication value for String 1
}
for (int i = 0; i < anagram_word_char.length; i++) {
anagram_word_num *= wordPrimeMap.get(anagram_word_char[i]);//get Multiplication value for String 2
}
return anagram_word_num == words_char_num;
}
/**
* Get the Prime numbers Mapped to each alphabets in English
* #return
*/
public static Map<Character, Integer> wordPrimeMap() {
List<Integer> primes = primes(26);
int k = 65;
Map<Character, Integer> map = new TreeMap<Character, Integer>();
for (int i = 0; i < primes.size(); i++) {
Character character = (char) k;
map.put(character, primes.get(i));
k++;
}
// System.out.println(map);
return map;
}
/**
* get first N prime Numbers where Number is greater than 2
* #param N : Number of Prime Numbers
* #return
*/
public static List<Integer> primes(Integer N) {
List<Integer> primes = new ArrayList<Integer>();
primes.add(2);
primes.add(3);
int n = 5;
int k = 0;
do {
boolean is_prime = true;
for (int i = 2; i <= Math.sqrt(n); i++) {
if (n % i == 0) {
is_prime = false;
break;
}
}
if (is_prime == true) {
primes.add(n);
}
n++;
// System.out.println(k);
} while (primes.size() < N);
// }
return primes;
}
}
Here is my solution.First explode the strings into char arrays then sort them and then comparing if they are equal or not. I guess time complexity of this code is O(a+b).if a=b we can say O(2A)
public boolean isAnagram(String s1, String s2) {
StringBuilder sb1 = new StringBuilder();
StringBuilder sb2 = new StringBuilder();
if (s1.length() != s2.length())
return false;
char arr1[] = s1.toCharArray();
char arr2[] = s2.toCharArray();
Arrays.sort(arr1);
Arrays.sort(arr2);
for (char c : arr1) {
sb1.append(c);
}
for (char c : arr2) {
sb2.append(c);
}
System.out.println(sb1.toString());
System.out.println(sb2.toString());
if (sb1.toString().equals(sb2.toString()))
return true;
else
return false;
}
There are 3 solution i can think of :
Using sorting
# O(NlogN) + O(MlogM) time, O(1) space
def solve_by_sort(word1, word2):
return sorted(word1) == sorted(word2)
Using letter frequency count
# O(N+M) time, O(N+M) space
def solve_by_letter_frequency(word1, word2):
from collections import Counter
return Counter(word1) == Counter(word2)
Using the concept of prime factorization. (assign primes to each letter)
import operator
from functools import reduce
# O(N) time, O(1) space - prime factorization
def solve_by_prime_number_hash(word1, word2):
return get_prime_number_hash(word1) == get_prime_number_hash(word2)
def get_prime_number_hash(word):
letter_code = {'a': 2, 'b': 3, 'c': 5, 'd': 7, 'e': 11, 'f': 13, 'g': 17, 'h': 19, 'i': 23, 'j': 29, 'k': 31,'l': 37, 'm': 41, 'n': 43,'o': 47, 'p': 53, 'q': 59, 'r': 61, 's': 67, 't': 71, 'u': 73, 'v': 79, 'w': 83, 'x': 89, 'y': 97,'z': 101}
return 0 if not word else reduce(operator.mul, [letter_code[letter] for letter in word])
I have put more detailed analysis of these in my medium story.
Sorting approach is not the best one. It takes O(n) space and O(nlogn) time. Instead, make a hash map of characters and count them (increment characters that appear in the first string and decrement characters that appear in the second string). When some count reaches zero, remove it from hash. Finally, if two strings are anagrams, then the hash table will be empty in the end - otherwise it will not be empty.
Couple of important notes: (1) Ignore letter case and (2) Ignore white space.
Here is the detailed analysis and implementation in C#: Testing If Two Strings are Anagrams
Some other solution without sorting.
public static boolean isAnagram(String s1, String s2){
//case insensitive anagram
StringBuffer sb = new StringBuffer(s2.toLowerCase());
for (char c: s1.toLowerCase().toCharArray()){
if (Character.isLetter(c)){
int index = sb.indexOf(String.valueOf(c));
if (index == -1){
//char does not exist in other s2
return false;
}
sb.deleteCharAt(index);
}
}
for (char c: sb.toString().toCharArray()){
//only allow whitespace as left overs
if (!Character.isWhitespace(c)){
return false;
}
}
return true;
}
A simple method to figure out whether the testString is an anagram of the baseString.
private static boolean isAnagram(String baseString, String testString){
//Assume that there are no empty spaces in either string.
if(baseString.length() != testString.length()){
System.out.println("The 2 given words cannot be anagram since their lengths are different");
return false;
}
else{
if(baseString.length() == testString.length()){
if(baseString.equalsIgnoreCase(testString)){
System.out.println("The 2 given words are anagram since they are identical.");
return true;
}
else{
List<Character> list = new ArrayList<>();
for(Character ch : baseString.toLowerCase().toCharArray()){
list.add(ch);
}
System.out.println("List is : "+ list);
for(Character ch : testString.toLowerCase().toCharArray()){
if(list.contains(ch)){
list.remove(ch);
}
}
if(list.isEmpty()){
System.out.println("The 2 words are anagrams");
return true;
}
}
}
}
return false;
}
Sorry, the solution is in C#, but I think the different elements used to arrive at the solution is quite intuitive. Slight tweak required for hyphenated words but for normal words it should work fine.
internal bool isAnagram(string input1,string input2)
{
Dictionary<char, int> outChars = AddToDict(input2.ToLower().Replace(" ", ""));
input1 = input1.ToLower().Replace(" ","");
foreach(char c in input1)
{
if (outChars.ContainsKey(c))
{
if (outChars[c] > 1)
outChars[c] -= 1;
else
outChars.Remove(c);
}
}
return outChars.Count == 0;
}
private Dictionary<char, int> AddToDict(string input)
{
Dictionary<char, int> inputChars = new Dictionary<char, int>();
foreach(char c in input)
{
if(inputChars.ContainsKey(c))
{
inputChars[c] += 1;
}
else
{
inputChars.Add(c, 1);
}
}
return inputChars;
}
I saw that no one has used the "hashcode" approach to find out the anagrams. I found my approach little different than the approaches discussed above hence thought of sharing it. I wrote the below code to find the anagrams which works in O(n).
/**
* This class performs the logic of finding anagrams
* #author ripudam
*
*/
public class AnagramTest {
public static boolean isAnagram(final String word1, final String word2) {
if (word1 == null || word2 == null || word1.length() != word2.length()) {
return false;
}
if (word1.equals(word2)) {
return true;
}
final AnagramWrapper word1Obj = new AnagramWrapper(word1);
final AnagramWrapper word2Obj = new AnagramWrapper(word2);
if (word1Obj.equals(word2Obj)) {
return true;
}
return false;
}
/*
* Inner class to wrap the string received for anagram check to find the
* hash
*/
static class AnagramWrapper {
String word;
public AnagramWrapper(final String word) {
this.word = word;
}
#Override
public boolean equals(final Object obj) {
return hashCode() == obj.hashCode();
}
#Override
public int hashCode() {
final char[] array = word.toCharArray();
int hashcode = 0;
for (final char c : array) {
hashcode = hashcode + (c * c);
}
return hashcode;
}
}
}
Here is another approach using HashMap in Java
public static boolean isAnagram(String first, String second) {
if (first == null || second == null) {
return false;
}
if (first.length() != second.length()) {
return false;
}
return doCheckAnagramUsingHashMap(first.toLowerCase(), second.toLowerCase());
}
private static boolean doCheckAnagramUsingHashMap(final String first, final String second) {
Map<Character, Integer> counter = populateMap(first, second);
return validateMap(counter);
}
private static boolean validateMap(Map<Character, Integer> counter) {
for (int val : counter.values()) {
if (val != 0) {
return false;
}
}
return true;
}
Here is the test case
#Test
public void anagramTest() {
assertTrue(StringUtil.isAnagram("keep" , "PeeK"));
assertFalse(StringUtil.isAnagram("Hello", "hell"));
assertTrue(StringUtil.isAnagram("SiLeNt caT", "LisTen cat"));
}
private static boolean checkAnagram(String s1, String s2) {
if (s1 == null || s2 == null) {
return false;
} else if (s1.length() != s2.length()) {
return false;
}
char[] a1 = s1.toCharArray();
char[] a2 = s2.toCharArray();
int length = s2.length();
int s1Count = 0;
int s2Count = 0;
for (int i = 0; i < length; i++) {
s1Count+=a1[i];
s2Count+=a2[i];
}
return s2Count == s1Count ? true : false;
}
The simplest solution with complexity O(N) is using Map.
public static Boolean checkAnagram(String string1, String string2) {
Boolean anagram = true;
Map<Character, Integer> map1 = new HashMap<>();
Map<Character, Integer> map2 = new HashMap<>();
char[] chars1 = string1.toCharArray();
char[] chars2 = string2.toCharArray();
for(int i=0; i<chars1.length; i++) {
if(map1.get(chars1[i]) == null) {
map1.put(chars1[i], 1);
} else {
map1.put(chars1[i], map1.get(chars1[i])+1);
}
if(map2.get(chars2[i]) == null) {
map2.put(chars2[i], 1);
} else {
map2.put(chars2[i], map2.get(chars2[i])+1);
}
}
Set<Map.Entry<Character, Integer>> entrySet1 = map1.entrySet();
Set<Map.Entry<Character, Integer>> entrySet2 = map2.entrySet();
for(Map.Entry<Character, Integer> entry:entrySet1) {
if(entry.getValue() != map2.get(entry.getKey())) {
anagram = false;
break;
}
}
return anagram;
}
let's take a question: Given two strings s and t, write a function to determine if t is an anagram of s.
For example,
s = "anagram", t = "nagaram", return true.
s = "rat", t = "car", return false.
Method 1(Using HashMap ):
public class Method1 {
public static void main(String[] args) {
String a = "protijayi";
String b = "jayiproti";
System.out.println(isAnagram(a, b ));// output => true
}
private static boolean isAnagram(String a, String b) {
Map<Character ,Integer> map = new HashMap<>();
for( char c : a.toCharArray()) {
map.put(c, map.getOrDefault(c, 0 ) + 1 );
}
for(char c : b.toCharArray()) {
int count = map.getOrDefault(c, 0);
if(count == 0 ) {return false ; }
else {map.put(c, count - 1 ) ; }
}
return true;
}
}
Method 2 :
public class Method2 {
public static void main(String[] args) {
String a = "protijayi";
String b = "jayiproti";
System.out.println(isAnagram(a, b));// output=> true
}
private static boolean isAnagram(String a, String b) {
int[] alphabet = new int[26];
for(int i = 0 ; i < a.length() ;i++) {
alphabet[a.charAt(i) - 'a']++ ;
}
for (int i = 0; i < b.length(); i++) {
alphabet[b.charAt(i) - 'a']-- ;
}
for( int w : alphabet ) {
if(w != 0 ) {return false;}
}
return true;
}
}
Method 3 :
public class Method3 {
public static void main(String[] args) {
String a = "protijayi";
String b = "jayiproti";
System.out.println(isAnagram(a, b ));// output => true
}
private static boolean isAnagram(String a, String b) {
char[] ca = a.toCharArray() ;
char[] cb = b.toCharArray();
Arrays.sort( ca );
Arrays.sort( cb );
return Arrays.equals(ca , cb );
}
}
Method 4 :
public class AnagramsOrNot {
public static void main(String[] args) {
String a = "Protijayi";
String b = "jayiProti";
isAnagram(a, b);
}
private static void isAnagram(String a, String b) {
Map<Integer, Integer> map = new LinkedHashMap<>();
a.codePoints().forEach(code -> map.put(code, map.getOrDefault(code, 0) + 1));
System.out.println(map);
b.codePoints().forEach(code -> map.put(code, map.getOrDefault(code, 0) - 1));
System.out.println(map);
if (map.values().contains(0)) {
System.out.println("Anagrams");
} else {
System.out.println("Not Anagrams");
}
}
}
In Python:
def areAnagram(a, b):
if len(a) != len(b): return False
count1 = [0] * 256
count2 = [0] * 256
for i in a:count1[ord(i)] += 1
for i in b:count2[ord(i)] += 1
for i in range(256):
if(count1[i] != count2[i]):return False
return True
str1 = "Giniiii"
str2 = "Protijayi"
print(areAnagram(str1, str2))
Let's take another famous Interview Question: Group the Anagrams from a given String:
public class GroupAnagrams {
public static void main(String[] args) {
String a = "Gini Gina Protijayi iGin aGin jayiProti Soudipta";
Map<String, List<String>> map = Arrays.stream(a.split(" ")).collect(Collectors.groupingBy(GroupAnagrams::sortedString));
System.out.println("MAP => " + map);
map.forEach((k,v) -> System.out.println(k +" and the anagrams are =>" + v ));
/*
Look at the Map output:
MAP => {Giin=[Gini, iGin], Paiijorty=[Protijayi, jayiProti], Sadioptu=[Soudipta], Gain=[Gina, aGin]}
As we can see, there are multiple Lists. Hence, we have to use a flatMap(List::stream)
Now, Look at the output:
Paiijorty and the anagrams are =>[Protijayi, jayiProti]
Now, look at this output:
Sadioptu and the anagrams are =>[Soudipta]
List contains only word. No anagrams.
That means we have to work with map.values(). List contains all the anagrams.
*/
String stringFromMapHavingListofLists = map.values().stream().flatMap(List::stream).collect(Collectors.joining(" "));
System.out.println(stringFromMapHavingListofLists);
}
public static String sortedString(String a) {
String sortedString = a.chars().sorted()
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString();
return sortedString;
}
/*
* The output : Gini iGin Protijayi jayiProti Soudipta Gina aGin
* All the anagrams are side by side.
*/
}
Now to Group Anagrams in Python is again easy.We have to :
Sort the lists. Then, Create a dictionary. Now dictionary will tell us where are those anagrams are( Indices of Dictionary). Then values of the dictionary is the actual indices of the anagrams.
def groupAnagrams(words):
# sort each word in the list
A = [''.join(sorted(word)) for word in words]
dict = {}
for indexofsamewords, names in enumerate(A):
dict.setdefault(names, []).append(indexofsamewords)
print(dict)
#{'AOOPR': [0, 2, 5, 11, 13], 'ABTU': [1, 3, 4], 'Sorry': [6], 'adnopr': [7], 'Sadioptu': [8, 16], ' KPaaehiklry': [9], 'Taeggllnouy': [10], 'Leov': [12], 'Paiijorty': [14, 18], 'Paaaikpr': [15], 'Saaaabhmryz': [17], ' CNaachlortttu': [19], 'Saaaaborvz': [20]}
for index in dict.values():
print([words[i] for i in index])
if __name__ == '__main__':
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP", "Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
groupAnagrams(words)
The Output :
['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']
Another Important Anagram Question : Find the Anagram occuring Max. number of times.
In the Example, ROOPA is the word which has occured maximum number of times.
Hence, ['ROOPA' 'OOPAR' 'PAROO' 'AROOP' 'AOORP'] will be the final output.
from sqlite3 import collections
from statistics import mode, mean
import numpy as np
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP",
"Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
print(".....Method 1....... ")
sortedwords = [''.join(sorted(word)) for word in words]
print(sortedwords)
print("...........")
LongestAnagram = np.array(words)[np.array(sortedwords) == mode(sortedwords)]
# Longest anagram
print("Longest anagram by Method 1:")
print(LongestAnagram)
print(".....................................................")
print(".....Method 2....... ")
A = [''.join(sorted(word)) for word in words]
dict = {}
for indexofsamewords,samewords in enumerate(A):
dict.setdefault(samewords,[]).append(samewords)
#print(dict)
#{'AOOPR': ['AOOPR', 'AOOPR', 'AOOPR', 'AOOPR', 'AOOPR'], 'ABTU': ['ABTU', 'ABTU', 'ABTU'], 'Sadioptu': ['Sadioptu', 'Sadioptu'], ' KPaaehiklry': [' KPaaehiklry'], 'Taeggllnouy': ['Taeggllnouy'], 'Leov': ['Leov'], 'Paiijorty': ['Paiijorty', 'Paiijorty'], 'Paaaikpr': ['Paaaikpr'], 'Saaaabhmryz': ['Saaaabhmryz'], ' CNaachlortttu': [' CNaachlortttu'], 'Saaaaborvz': ['Saaaaborvz']}
aa = max(dict.items() , key = lambda x : len(x[1]))
print("aa => " , aa)
word, anagrams = aa
print("Longest anagram by Method 2:")
print(" ".join(anagrams))
The Output :
.....Method 1.......
['AOOPR', 'ABTU', 'AOOPR', 'ABTU', 'ABTU', 'AOOPR', 'Sadioptu', ' KPaaehiklry', 'Taeggllnouy', 'AOOPR', 'Leov', 'AOOPR', 'Paiijorty', 'Paaaikpr', 'Sadioptu', 'Saaaabhmryz', 'Paiijorty', ' CNaachlortttu', 'Saaaaborvz']
...........
Longest anagram by Method 1:
['ROOPA' 'OOPAR' 'PAROO' 'AROOP' 'AOORP']
.....................................................
.....Method 2.......
aa => ('AOOPR', ['AOOPR', 'AOOPR', 'AOOPR', 'AOOPR', 'AOOPR'])
Longest anagram by Method 2:
AOOPR AOOPR AOOPR AOOPR AOOPR
This could be the simple function call
A mix of functional Code and Imperative style of code
static boolean isAnagram(String a, String b) {
String sortedA = "";
Object[] aArr = a.toLowerCase().chars().sorted().mapToObj(i -> (char) i).toArray();
for (Object o: aArr) {
sortedA = sortedA.concat(o.toString());
}
String sortedB = "";
Object[] bArr = b.toLowerCase().chars().sorted().mapToObj(i -> (char) i).toArray();
for (Object o: bArr) {
sortedB = sortedB.concat(o.toString());
}
if(sortedA.equals(sortedB))
return true;
else
return false;
}
I need to concatenate two string in another one without their intersection (in terms of last/first words).
In example:
"Some little d" + "little dogs are so pretty" = "Some little dogs are so pretty"
"I love you" + "love" = "I love youlove"
What is the most efficient way to do this in Java?
Here we go - if the first doesn't even contain the first letter of the second string, just return the concatenation. Otherwise, go from longest to shortest on the second string, seeing if the first ends with it. If so, return the non-overlapping parts, otherwise try one letter shorter.
public static String docat(String f, String s) {
if (!f.contains(s.substring(0,1)))
return f + s;
int idx = s.length();
try {
while (!f.endsWith(s.substring(0, idx--))) ;
} catch (Exception e) { }
return f + s.substring(idx + 1);
}
docat("Some little d", "little dogs are so pretty");
-> "Some little dogs are so pretty"
docat("Hello World", "World")
-> "Hello World"
docat("Hello", "World")
-> "HelloWorld"
EDIT: In response to the comment, here is a method using arrays. I don't know how to stress test these properly, but none of them took over 1ms in my testing.
public static String docat2(String first, String second) {
char[] f = first.toCharArray();
char[] s = second.toCharArray();
if (!first.contains("" + s[0]))
return first + second;
int idx = 0;
try {
while (!matches(f, s, idx)) idx++;
} catch (Exception e) { }
return first.substring(0, idx) + second;
}
private static boolean matches(char[] f, char[] s, int idx) {
for (int i = idx; i <= f.length; i++) {
if (f[i] != s[i - idx])
return false;
}
return true;
}
Easiest: iterate over the first string taking suffixes ("Some little d", "ome little d", "me little d"...) and test the second string with .startsWith. When you find a match, concatenate the prefix of the first string with the second string.
Here's the code:
String overlappingConcat(String a, String b) {
int i;
int l = a.length();
for (i = 0; i < l; i++) {
if (b.startsWith(a.substring(i))) {
return a.substring(0, i) + b;
}
}
return a + b;
}
The biggest efficiency problem here is the creation of new strings at substring. Implementing a custom stringMatchFrom(a, b, aOffset) should improve it, and is trivial.
You can avoid creating unnecessary substrings with the regionMatches() method.
public static String intersecting_concatenate(String a, String b) {
// Concatenate two strings, but if there is overlap at the intersection,
// include the intersection/overlap only once.
// find length of maximum possible match
int len_a = a.length();
int len_b = b.length();
int max_match = (len_a > len_b) ? len_b : len_a;
// search down from maximum match size, to get longest possible intersection
for (int size=max_match; size>0; size--) {
if (a.regionMatches(len_a - size, b, 0, size)) {
return a + b.substring(size, len_b);
}
}
// Didn't find any intersection. Fall back to straight concatenation.
return a + b;
}
isBlank(CharSequence), join(T...) and left(String, int) are methods from Apache Commons.
public static String joinOverlap(String s1, String s2) {
if(isBlank(s1) || isBlank(s2)) { //empty or null input -> normal join
return join(s1, s2);
}
int start = Math.max(0, s1.length() - s2.length());
for(int i = start; i < s1.length(); i++) { //this loop is for start point
for(int j = i; s1.charAt(j) == s2.charAt(j-i); j++) { //iterate until mismatch
if(j == s1.length() - 1) { //was it s1's last char?
return join(left(s1, i), s2);
}
}
}
return join(s1, s2); //no overlapping; do normal join
}
Create a suffix tree of the first String, then traverse the tree from the root taking characters from the beginning of the second String and keeping track of the longest suffix found.
This should be the longest suffix of the first String that is a prefix of the second String. Remove the suffix, then append the second String.
This should all be possible in linear time instead of the quadratic time required to loop through and compare all suffixes.
The following code seems to work for the first example. I did not test it extensively, but you get the point. It basically searches for all occurrences of the first char of the secondString in the firstString since these are the only possible places where overlap can occur. Then it checks whether the rest of the first string is the start of the second string. Probably the code contains some errors when no overlap is found, ... but it was more an illustration of my answer
String firstString = "Some little d";
String secondString = "little dogs are so pretty";
String startChar = secondString.substring( 0, 1 );
int index = Math.max( 0, firstString.length() - secondString.length() );
int length = firstString.length();
int searchedIndex = -1;
while ( searchedIndex == -1 && ( index = firstString.indexOf( startChar, index ) )!= -1 ){
if ( secondString.startsWith( firstString.substring( index, length ) ) ){
searchedIndex = index;
}
}
String result = firstString.substring( 0, searchedIndex ) + secondString;
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do you compare two version Strings in Java?
I've 2 strings which contains version information as shown below:
str1 = "1.2"
str2 = "1.1.2"
Now, can any one tell me the efficient way to compare these versions inside strings in Java & return 0 , if they're equal, -1, if str1 < str2 & 1 if str1>str2.
Requires commons-lang3-3.8.1.jar for string operations.
/**
* Compares two version strings.
*
* Use this instead of String.compareTo() for a non-lexicographical
* comparison that works for version strings. e.g. "1.10".compareTo("1.6").
*
* #param v1 a string of alpha numerals separated by decimal points.
* #param v2 a string of alpha numerals separated by decimal points.
* #return The result is 1 if v1 is greater than v2.
* The result is 2 if v2 is greater than v1.
* The result is -1 if the version format is unrecognized.
* The result is zero if the strings are equal.
*/
public int VersionCompare(String v1,String v2)
{
int v1Len=StringUtils.countMatches(v1,".");
int v2Len=StringUtils.countMatches(v2,".");
if(v1Len!=v2Len)
{
int count=Math.abs(v1Len-v2Len);
if(v1Len>v2Len)
for(int i=1;i<=count;i++)
v2+=".0";
else
for(int i=1;i<=count;i++)
v1+=".0";
}
if(v1.equals(v2))
return 0;
String[] v1Str=StringUtils.split(v1, ".");
String[] v2Str=StringUtils.split(v2, ".");
for(int i=0;i<v1Str.length;i++)
{
String str1="",str2="";
for (char c : v1Str[i].toCharArray()) {
if(Character.isLetter(c))
{
int u=c-'a'+1;
if(u<10)
str1+=String.valueOf("0"+u);
else
str1+=String.valueOf(u);
}
else
str1+=String.valueOf(c);
}
for (char c : v2Str[i].toCharArray()) {
if(Character.isLetter(c))
{
int u=c-'a'+1;
if(u<10)
str2+=String.valueOf("0"+u);
else
str2+=String.valueOf(u);
}
else
str2+=String.valueOf(c);
}
v1Str[i]="1"+str1;
v2Str[i]="1"+str2;
int num1=Integer.parseInt(v1Str[i]);
int num2=Integer.parseInt(v2Str[i]);
if(num1!=num2)
{
if(num1>num2)
return 1;
else
return 2;
}
}
return -1;
}
As others have pointed out, String.split() is a very easy way to do the comparison you want, and Mike Deck makes the excellent point that with such (likely) short strings, it probably won't matter much, but what the hey! If you want to make the comparison without manually parsing the string, and have the option of quitting early, you could try the java.util.Scanner class.
public static int versionCompare(String str1, String str2) {
try ( Scanner s1 = new Scanner(str1);
Scanner s2 = new Scanner(str2);) {
s1.useDelimiter("\\.");
s2.useDelimiter("\\.");
while (s1.hasNextInt() && s2.hasNextInt()) {
int v1 = s1.nextInt();
int v2 = s2.nextInt();
if (v1 < v2) {
return -1;
} else if (v1 > v2) {
return 1;
}
}
if (s1.hasNextInt() && s1.nextInt() != 0)
return 1; //str1 has an additional lower-level version number
if (s2.hasNextInt() && s2.nextInt() != 0)
return -1; //str2 has an additional lower-level version
return 0;
} // end of try-with-resources
}
This is almost certainly not the most efficient way to do it, but given that version number strings will almost always be only a few characters long I don't think it's worth optimizing further:
public static int compareVersions(String v1, String v2) {
String[] components1 = v1.split("\\.");
String[] components2 = v2.split("\\.");
int length = Math.min(components1.length, components2.length);
for(int i = 0; i < length; i++) {
int result = new Integer(components1[i]).compareTo(Integer.parseInt(components2[i]));
if(result != 0) {
return result;
}
}
return Integer.compare(components1.length, components2.length);
}
I was looking to do this myself and I see three different approaches to doing this, and so far pretty much everyone is splitting the version strings. I do not see doing that as being efficient, though code size wise it reads well and looks good.
Approaches:
Assume an upper limit to the number of sections (ordinals) in a version string as well as a limit to the value represented there. Often 4 dots max, and 999 maximum for any ordinal. You can see where this is going, and it's going towards transforming the version to fit into a string like: "1.0" => "001000000000" with string format or some other way to pad each ordinal. Then do a string compare.
Split the strings on the ordinal separator ('.') and iterate over them and compare a parsed version. This is the approach demonstrated well by Alex Gitelman.
Comparing the ordinals as you parse them out of the version strings in question. If all strings were really just pointers to arrays of characters as in C then this would be the clear approach (where you'd replace a '.' with a null terminator as it's found and move some 2 or 4 pointers around.
Thoughts on the three approaches:
There was a blog post linked that showed how to go with 1. The limitations are in version string length, number of sections and maximum value of the section. I don't think it's crazy to have such a string that breaks 10,000 at one point. Additionally most implementations still end up splitting the string.
Splitting the strings in advance is clear to read and think about, but we are going through each string about twice to do this. I'd like to compare how it times with the next approach.
Comparing the string as you split it give you the advantage of being able to stop splitting very early in a comparison of: "2.1001.100101.9999998" to "1.0.0.0.0.0.1.0.0.0.1". If this were C and not Java the advantages could go on to limit the amount of memory allocated for new strings for each section of each version, but it is not.
I didn't see anyone giving an example of this third approach, so I'd like to add it here as an answer going for efficiency.
public class VersionHelper {
/**
* Compares one version string to another version string by dotted ordinals.
* eg. "1.0" > "0.09" ; "0.9.5" < "0.10",
* also "1.0" < "1.0.0" but "1.0" == "01.00"
*
* #param left the left hand version string
* #param right the right hand version string
* #return 0 if equal, -1 if thisVersion < comparedVersion and 1 otherwise.
*/
public static int compare(#NotNull String left, #NotNull String right) {
if (left.equals(right)) {
return 0;
}
int leftStart = 0, rightStart = 0, result;
do {
int leftEnd = left.indexOf('.', leftStart);
int rightEnd = right.indexOf('.', rightStart);
Integer leftValue = Integer.parseInt(leftEnd < 0
? left.substring(leftStart)
: left.substring(leftStart, leftEnd));
Integer rightValue = Integer.parseInt(rightEnd < 0
? right.substring(rightStart)
: right.substring(rightStart, rightEnd));
result = leftValue.compareTo(rightValue);
leftStart = leftEnd + 1;
rightStart = rightEnd + 1;
} while (result == 0 && leftStart > 0 && rightStart > 0);
if (result == 0) {
if (leftStart > rightStart) {
return containsNonZeroValue(left, leftStart) ? 1 : 0;
}
if (leftStart < rightStart) {
return containsNonZeroValue(right, rightStart) ? -1 : 0;
}
}
return result;
}
private static boolean containsNonZeroValue(String str, int beginIndex) {
for (int i = beginIndex; i < str.length(); i++) {
char c = str.charAt(i);
if (c != '0' && c != '.') {
return true;
}
}
return false;
}
}
Unit test demonstrating expected output.
public class VersionHelperTest {
#Test
public void testCompare() throws Exception {
assertEquals(1, VersionHelper.compare("1", "0.9"));
assertEquals(1, VersionHelper.compare("0.0.0.2", "0.0.0.1"));
assertEquals(1, VersionHelper.compare("1.0", "0.9"));
assertEquals(1, VersionHelper.compare("2.0.1", "2.0.0"));
assertEquals(1, VersionHelper.compare("2.0.1", "2.0"));
assertEquals(1, VersionHelper.compare("2.0.1", "2"));
assertEquals(1, VersionHelper.compare("0.9.1", "0.9.0"));
assertEquals(1, VersionHelper.compare("0.9.2", "0.9.1"));
assertEquals(1, VersionHelper.compare("0.9.11", "0.9.2"));
assertEquals(1, VersionHelper.compare("0.9.12", "0.9.11"));
assertEquals(1, VersionHelper.compare("0.10", "0.9"));
assertEquals(0, VersionHelper.compare("0.10", "0.10"));
assertEquals(-1, VersionHelper.compare("2.10", "2.10.1"));
assertEquals(-1, VersionHelper.compare("0.0.0.2", "0.1"));
assertEquals(1, VersionHelper.compare("1.0", "0.9.2"));
assertEquals(1, VersionHelper.compare("1.10", "1.6"));
assertEquals(0, VersionHelper.compare("1.10", "1.10.0.0.0.0"));
assertEquals(1, VersionHelper.compare("1.10.0.0.0.1", "1.10"));
assertEquals(0, VersionHelper.compare("1.10.0.0.0.0", "1.10"));
assertEquals(1, VersionHelper.compare("1.10.0.0.0.1", "1.10"));
}
}
Split the String on "." or whatever your delimeter will be, then parse each of those tokens to the Integer value and compare.
int compareStringIntegerValue(String s1, String s2, String delimeter)
{
String[] s1Tokens = s1.split(delimeter);
String[] s2Tokens = s2.split(delimeter);
int returnValue = 0;
if(s1Tokens.length > s2Tokens.length)
{
for(int i = 0; i<s1Tokens.length; i++)
{
int s1Value = Integer.parseString(s1Tokens[i]);
int s2Value = Integer.parseString(s2Tokens[i]);
Integer s1Integer = new Integer(s1Value);
Integer s2Integer = new Integer(s2Value);
returnValue = s1Integer.compareTo(s2Value);
if( 0 == isEqual)
{
continue;
}
return returnValue; //end execution
}
return returnValue; //values are equal
}
I will leave the other if statement as an exercise.
Comparing version strings can be a mess; you're getting unhelpful answers because the only way to make this work is to be very specific about what your ordering convention is. I've seen one relatively short and complete version comparison function on a blog post, with the code placed in the public domain- it isn't in Java but it should be simple to see how to adapt this.
Adapted from Alex Gitelman's answer.
int compareVersions( String str1, String str2 ){
if( str1.equals(str2) ) return 0; // Short circuit when you shoot for efficiency
String[] vals1 = str1.split("\\.");
String[] vals2 = str2.split("\\.");
int i=0;
// Most efficient way to skip past equal version subparts
while( i<vals1.length && i<val2.length && vals[i].equals(vals[i]) ) i++;
// If we didn't reach the end,
if( i<vals1.length && i<val2.length )
// have to use integer comparison to avoid the "10"<"1" problem
return Integer.valueOf(vals1[i]).compareTo( Integer.valueOf(vals2[i]) );
if( i<vals1.length ){ // end of str2, check if str1 is all 0's
boolean allZeros = true;
for( int j = i; allZeros & (j < vals1.length); j++ )
allZeros &= ( Integer.parseInt( vals1[j] ) == 0 );
return allZeros ? 0 : -1;
}
if( i<vals2.length ){ // end of str1, check if str2 is all 0's
boolean allZeros = true;
for( int j = i; allZeros & (j < vals2.length); j++ )
allZeros &= ( Integer.parseInt( vals2[j] ) == 0 );
return allZeros ? 0 : 1;
}
return 0; // Should never happen (identical strings.)
}
So as you can see, not so trivial. Also this fails when you allow leading 0's, but I've never seen a version "1.04.5" or w/e. You would need to use integer comparison in the while loop to fix that. This gets even more complex when you mix letters with numbers in the version strings.
Split them into arrays and then compare.
// check if two strings are equal. If they are return 0;
String[] a1;
String[] a2;
int i = 0;
while (true) {
if (i == a1.length && i < a2.length) return -1;
else if (i < a1.length && i == a2.length) return 1;
if (a1[i].equals(a2[i]) {
i++;
continue;
}
return a1[i].compareTo(a2[i];
}
return 0;
I would divide the problem in two, formating and comparing. If you can assume that the format is correct, then comparing only numbers version is very simple:
final int versionA = Integer.parseInt( "01.02.00".replaceAll( "\\.", "" ) );
final int versionB = Integer.parseInt( "01.12.00".replaceAll( "\\.", "" ) );
Then both versions can be compared as integers. So the "big problem" is the format, but that can have many rules. In my case i just complete a minimum of two pair of digits, so the format is "99.99.99" always, and then i do the above conversion; so in my case the program logic is in the formatting, and not in the version comparison. Now, if you are doing something very specific and maybe you can trust the origin of the version string, maybe you just can check the length of the version string and then just do the int conversion... but i think it's a best practice to make sure the format is as expected.
Step1 : Use StringTokenizer in java with dot as delimiter
StringTokenizer(String str, String delimiters) or
You can use String.split() and Pattern.split(), split on dot and then convert each String to Integer using Integer.parseInt(String str)
Step 2: Compare integer from left to right.