Performance intensive string splitting and manipulation in java - java

What is the most efficient way to split a string by a very simple separator?
Some background:
I am porting a function I wrote in C with a bunch of pointer arithmetic to java and it is incredibly slow(After some optimisation still 5* slower).
Having profiled it, it turns out a lot of that overhead is in String.split
The function in question takes a host name or ip address and makes it generic:
123.123.123.123->*.123.123.123
a.b.c.example.com->*.example.com
This can be run over several million items on a regular basis, so performance is an issue.
Edit: the rules for converting are thus:
If it's an ip address, replace the first part
Otherwise, find the main domain name, and make the preceding part generic.
foo.bar.com-> *.bar.com
foo.bar.co.uk-> *.bar.co.uk
I have now rewritten using lastIndexOf and substring to work myself in from the back and the performance has improved by leaps and bounds.
I'll leave the question open for another 24 hours before settling on the best answer for future reference
Here's what I've come up with now(the ip part is an insignificant check before calling this function)
private static String hostConvert(String in) {
final String [] subs = { "ac", "co", "com", "or", "org", "ne", "net", "ad", "gov", "ed" };
int dotPos = in.lastIndexOf('.');
if(dotPos == -1)
return in;
int prevDotPos = in.lastIndexOf('.', dotPos-1);
if(prevDotPos == -1)
return in;
CharSequence cs = in.subSequence(prevDotPos+1, dotPos);
for(String cur : subs) {
if(cur.contentEquals(cs)) {
int start = in.lastIndexOf('.', prevDotPos-1);
if(start == -1 || start == 0)
return in;
return "*" + in.substring(start);
}
}
return "*" + in.substring(prevDotPos);
}
If there's any space for further improvement it would be good to hear.

Something like this is about as fast as you can make it:
static String starOutFirst(String s) {
final int K = s.indexOf('.');
return "*" + s.substring(K);
}
static String starOutButLastTwo(String s) {
final int K = s.lastIndexOf('.', s.lastIndexOf('.') - 1);
return "*" + s.substring(K);
}
Then you can do:
System.out.println(starOutFirst("123.123.123.123"));
// prints "*.123.123.123"
System.out.println(starOutButLastTwo("a.b.c.example.com"));
// prints "*.example.com"
You may need to use regex to see which of the two method is applicable for any given string.

I'd try using .indexOf("."), and .substring(index)
You didn't elaborate on the exact pattern you wanted to match but if you can avoid split(), it should cut down on the number of new strings it allocates (1 instead of several).

It's unclear from your question exactly what the code is supposed to do. Does it find the first '.' and replace everything up to it with a '*'? Or is there some fancier logic behind it? Maybe everything up to the nth '.' gets replaced by '*'?
If you're trying to find an instance of a particular string, use something like the Boyer-Moore algorithm. It should be able to find the match for you and you can then replace what you want.
Keep in mind that String in Java is immutable. It might be faster to change the sequence in-place. Check out other CharSequence implementations to see what you can do, e.g. StringBuffer and CharBuffer. If concurrency is not needed, StringBuilder might be an option.
By using a mutable CharSequence instead of the methods on String, you avoid a bunch of object churn. If all you're doing is replacing some slice of the underlying character array with a shorter array (i.e. {'*'}), this is likely to yield a speedup since such array copies are fairly optimized. You'll still be doing an array copy at the end of the day, but it may be faster than new String allocations.
UPDATE
All the above is pretty much hogwash. Sure, maybe you can implement your own CharSequence that gives you better slicing and lazily resizes the array (aka doesn't actually truncate anything until it absolutely must), returning Strings based on offsets and whatnot. But StringBuffer and StringBuilder, at least directly, do not perform as well as the solution poly posted. CharBuffer is entirely inapplicable; I didn't realize it was an nio class earlier: it's meant for other things entirely.
There are some interesting things about poly's code, which I wonder whether he/she knew before posting it, namely that changing the "*" on the final lines of the methods to a '*' results in a significant slowdown.
Nevertheless, here is my benchmark. I found one small optimization: declaring the '.' and "*" expressions as constants adds a bit of a speedup as well as using a locally-scoped StringBuilder instead of the binary infix string concatenation operator.
I know the gc() is at best advisory and at worst a no-op, but I figured adding it with a bit of sleep time might let the VM do some cleanup after creating 1M Strings. Someone may correct me if this is totally naïve.
Simple Benchmark
import java.util.ArrayList;
import java.util.Arrays;
public class StringSplitters {
private static final String PREFIX = "*";
private static final char DOT = '.';
public static String starOutFirst(String s) {
final int k = s.indexOf(DOT);
return PREFIX + s.substring(k);
}
public static String starOutFirstSb(String s) {
StringBuilder sb = new StringBuilder();
final int k = s.indexOf(DOT);
return sb.append(PREFIX).append(s.substring(k)).toString();
}
public static void main(String[] args) throws InterruptedException {
double[] firstRates = new double[10];
double[] firstSbRates = new double[10];
double firstAvg = 0;
double firstSbAvg = 0;
double firstMin = Double.POSITIVE_INFINITY;
double firstMax = Double.NEGATIVE_INFINITY;
double firstSbMin = Double.POSITIVE_INFINITY;
double firstSbMax = Double.NEGATIVE_INFINITY;
for (int i = 0; i < 10; i++) {
firstRates[i] = testFirst();
firstAvg += firstRates[i];
if (firstRates[i] < firstMin)
firstMin = firstRates[i];
if (firstRates[i] > firstMax)
firstMax = firstRates[i];
Thread.sleep(100);
System.gc();
Thread.sleep(100);
}
firstAvg /= 10.0d;
for (int i = 0; i < 10; i++) {
firstSbRates[i] = testFirstSb();
firstSbAvg += firstSbRates[i];
if (firstSbRates[i] < firstSbMin)
firstSbMin = firstSbRates[i];
if (firstSbRates[i] > firstSbMax)
firstSbMax = firstSbRates[i];
Thread.sleep(100);
System.gc();
Thread.sleep(100);
}
firstSbAvg /= 10.0d;
System.out.printf("First:\n\tMin:\t%07.3f\tMax:\t%07.3f\tAvg:\t%07.3f\n\tRates:\t%s\n\n", firstMin, firstMax,
firstAvg, Arrays.toString(firstRates));
System.out.printf("FirstSb:\n\tMin:\t%07.3f\tMax:\t%07.3f\tAvg:\t%07.3f\n\tRates:\t%s\n\n", firstSbMin,
firstSbMax, firstSbAvg, Arrays.toString(firstSbRates));
}
private static double testFirst() {
ArrayList<String> strings = new ArrayList<String>(1000000);
for (int i = 0; i < 1000000; i++) {
int first = (int) (Math.random() * 128);
int second = (int) (Math.random() * 128);
int third = (int) (Math.random() * 128);
int fourth = (int) (Math.random() * 128);
strings.add(String.format("%d.%d.%d.%d", first, second, third, fourth));
}
long before = System.currentTimeMillis();
for (String s : strings) {
starOutFirst(s);
}
long after = System.currentTimeMillis();
return 1000000000.0d / (after - before);
}
private static double testFirstSb() {
ArrayList<String> strings = new ArrayList<String>(1000000);
for (int i = 0; i < 1000000; i++) {
int first = (int) (Math.random() * 128);
int second = (int) (Math.random() * 128);
int third = (int) (Math.random() * 128);
int fourth = (int) (Math.random() * 128);
strings.add(String.format("%d.%d.%d.%d", first, second, third, fourth));
}
long before = System.currentTimeMillis();
for (String s : strings) {
starOutFirstSb(s);
}
long after = System.currentTimeMillis();
return 1000000000.0d / (after - before);
}
}
Output
First:
Min: 3802281.369 Max: 5434782.609 Avg: 5185796.131
Rates: [3802281.3688212926, 5181347.150259067, 5291005.291005291, 5376344.086021505, 5291005.291005291, 5235602.094240838, 5434782.608695652, 5405405.405405405, 5434782.608695652, 5405405.405405405]
FirstSb:
Min: 4587155.963 Max: 5747126.437 Avg: 5462087.511
Rates: [4587155.963302752, 5747126.436781609, 5617977.528089887, 5208333.333333333, 5681818.181818182, 5586592.17877095, 5586592.17877095, 5524861.878453039, 5524861.878453039, 5555555.555555556]

Related

Java converting a string binary to integer without using maths pow [duplicate]

This question already has answers here:
How to convert a Binary String to a base 10 integer in Java
(12 answers)
Closed 4 years ago.
Main:
public class Main{
public static void main(String[] args){
System.out.println(Convert.BtoI("101101010101"));
System.out.println(Convert.BtoI("1011110"));
}
}
Sub:
public class Convert{
public static int BtoI(String value){
int no = 0;
for(int i=value.length()-1;i>=0;i--){
if(value.charAt(i)=='1')
no += (???) ;
++;
}
return no;
}
}
How can I convert a string binary to integer without using maths.pow, just using + - * and /, should I implement another for loop for it is int j = 1;i <= example; i*=2){ ?. I am quite confused and want to learn without the usage of maths.pow or any similar codes.
From the beginning of the string until to the end you can just multiply each character with its power and sum up the total which gives you the base ten value:
public static int binaryToInt(String binaryNum) {
int pow = 1;
int total = 0;
for (int i = binaryNum.length(); i > 0; i--) {
if (binaryNum.charAt(i-1) == '1') {
total += pow;
}
pow *= 2;
}
return total;
}
But i think this is way more elegant:
String binaryNum = "1001";
int decimalValue = Integer.parseInt(binaryNum, 2);
How about Integer.parseInt(yourString,2); ?? 2 means you are parsing base2 number.
Starting from your code + some vital style changes:
public class Convert {
public static int binaryToInt(String value) {
int no = 0;
for (int i = 0; i < value.length() - 1; i++) {
no = no * 2; // or no *= 2;
if (value.charAt(i) == '1') {
no = no + 1; // or no++
}
}
return no;
}
}
The meaning of the code I added should be self-evident. If not, I would encourage you to work out what it does as an exercise. (If you are not sure, use a pencil and paper to "hand execute" it ...)
The style changes (roughly in order of importance) are:
Don't start a method name with an uppercase character.
Use camel-case.
Avoid cute / obscure / unnecessary abbreviations. The next guy reading your code should not have to use a magic decoder ring to understand your method names.
Put spaces around operators.
4 character indentation.
Put spaces between ) and { and before / after keywords.
Use curly brackets around the "then" and "else" parts of an if statement, even if they are not strictly necessary. (This is to avoid a problem where indentation mistakes cause you to misread the code.)

Reverse long array to string algorithm

i need to reverse the following algorithm which converts a long array into a string:
public final class LongConverter {
private final long[] l;
public LongConverter(long[] paramArrayOfLong) {
this.l = paramArrayOfLong;
}
private void convertLong(long paramLong, byte[] paramArrayOfByte, int paramInt) {
int i = Math.min(paramArrayOfByte.length, paramInt + 8);
while (paramInt < i) {
paramArrayOfByte[paramInt] = ((byte) (int) paramLong);
paramLong >>= 8;
paramInt++;
}
}
public final String toString() {
int i = this.l.length;
byte[] arrayOfByte = new byte[8 * (i - 1)];
long l1 = this.l[0];
Random localRandom = new Random(l1);
for (int j = 1; j < i; j++) {
long l2 = localRandom.nextLong();
convertLong(this.l[j] ^ l2, arrayOfByte, 8 * (j - 1));
}
String str;
try {
str = new String(arrayOfByte, "UTF8");
} catch (UnsupportedEncodingException localUnsupportedEncodingException) {
throw new AssertionError(localUnsupportedEncodingException);
}
int k = str.indexOf(0);
if (-1 == k) {
return str;
}
return str.substring(0, k);
}
So when I do the following call
System.out.println(new LongConverter(new long[]{-6567892116040843544L, 3433539276790523832L}).toString());
it prints 400 as result.
It would be great if anyone could say what algorithm this is or how i could reverse it.
Thanks for your help
This is not a solvable problem as stated because
you only use l[0] so any additional long values could be anything.
it is guaranteed that there is N << 16 solutions to this problem. While the seed for random is 64-bit in reality the value used internally is 48-bit. This means is there is any solution, there if at least 16K solutions for a long seed.
What you can do is;
find the smallest seed which would generate the string using brute force. For a short strings this won't take long, however if you have 5-6 character this will take a while and for 7+ character there might not be a solution.
instead of generating 8-bit characters where all 8-bit values are equal. You could restrict the range to say space, A-Z, a-z and 0-9. This means you can have ~6-bits of randomness, shorter seeds and slightly longer Strings.
BTW You might find this post interesting where I use contrived random seeds to generate specific sequences. http://vanillajava.blogspot.co.uk/2011/10/randomly-no-so-random.html
If you want a process which ensures you can always re-create the original longs from a String or a byte[], I suggest using encryption. You can encrypt a String which has been UTF-8 encoded or a byte[] into another byte[] which can be base64 encoded to be readable as text. (Or you could skip the encryption and use base64 alone)

Convert hashcode to limited set of string

I know it's one-way function but I want to convert hashcode back to limited set of string (using char between 32 to 126). Is there an efficient way to do this?
It's not only feasible - it's actually pretty simple, given the definition for String.hashCode. You can just create a string of "base 31" characters with some arbitrary starting point to keep everything in the right range, subtracting an offset based on that starting point.
This isn't necessarily the shortest string with the given hash code, but 7 characters is pretty short :)
public class Test {
public static void main(String[] args) {
int hash = 100000;
String sample = getStringForHashCode(hash);
System.out.println(sample); // ASD^TYQ
System.out.println(sample.hashCode()); // 100000
}
private static final int OFFSET = "AAAAAAA".hashCode();
private static String getStringForHashCode(int hash) {
hash -= OFFSET;
// Treat it as an unsigned long, for simplicity.
// This avoids having to worry about negative numbers anywhere.
long longHash = (long) hash & 0xFFFFFFFFL;
System.out.println(longHash);
char[] c = new char[7];
for (int i = 0; i < 7; i++)
{
c[6 - i] = (char) ('A' + (longHash % 31));
longHash /= 31;
}
return new String(c);
}
}
Actually, I only need one string from that hashcode. I want to make Minecraft seed shortener.
The simplest way to turn an int value into a short string is to use
String s = Integer.toString(n, 36); // uses base 36.

Java: Enum vs. Int

When using flags in Java, I have seen two main approaches. One uses int values and a line of if-else statements. The other is to use enums and case-switch statements.
I was wondering if there was a difference in terms of memory usage and speed between using enums vs ints for flags?
Both ints and enums can use both switch or if-then-else, and memory usage is also minimal for both, and speed is similar - there's no significant difference between them on the points you raised.
However, the most important difference is the type checking. Enums are checked, ints are not.
Consider this code:
public class SomeClass {
public static int RED = 1;
public static int BLUE = 2;
public static int YELLOW = 3;
public static int GREEN = 3; // sic
private int color;
public void setColor(int color) {
this.color = color;
}
}
While many clients will use this properly,
new SomeClass().setColor(SomeClass.RED);
There is nothing stopping them from writing this:
new SomeClass().setColor(999);
There are three main problems with using the public static final pattern:
The problem occurs at runtime, not compile time, so it's going to be more expensive to fix, and harder to find the cause
You have to write code to handle bad input - typically a if-then-else with a final else throw new IllegalArgumentException("Unknown color " + color); - again expensive
There is nothing preventing a collision of constants - the above class code will compile even though YELLOW and GREEN both have the same value 3
If you use enums, you address all these problems:
Your code won't compile unless you pass valid values in
No need for any special "bad input" code - the compiler handles that for you
Enum values are unique
Memory usage and speed aren't the considerations that matter. You would not be able to measure a difference either way.
I think enums should be preferred when they apply, because the emphasize the fact that the chosen values go together and comprise a closed set. Readability is improved a great deal, too. Code using enums is more self-documenting than stray int values scattered throughout your code.
Prefer enums.
You may even use Enums to replace those bitwise combined flags like int flags = FLAG_1 | FLAG_2;
Instead you can use a typesafe EnumSet:
Set<FlagEnum> flags = EnumSet.of(FlagEnum.FLAG_1, FlagEnum.FLAG_2);
// then simply test with contains()
if(flags.contains(FlagEnum.FLAG_1)) ...
The documentation states that those classes are internally optimized as bit vectors and that the implementation should be perform well enough to replace the int-based flags.
One of the reasons you will see some code using int flags instead of an enum is that Java did not have enums until Java 1.5
So if you are looking at code that was originally written for an older version of Java, then the int pattern was the only option available.
There are a very small number of places where using int flags is still preferable in modern Java code, but in most cases you should prefer to use an enum, due to the type safety and expressiveness that they offer.
In terms of efficiency, it will depend on exactly how they are used. The JVM handles both types very efficiently, but the int method would likely be slightly more efficient for some use cases (because they are handled as primitive rather than objects), but in other cases, the enum would be more efficient (because it doesn't need to go throw boxing/unboxing).
You would be hard pressed to find a situation in which the efficiency difference would be in any way noticeable in a real world application, so you should make the decision based on the quality of the code (readability and safety), which should lead you to use an enum 99% of the time.
Bear in mind that enums are type-safe, and you can't mix values from one enum with another. That's a good reason to prefer enums over ints for flags.
On the other hand, if you use ints for your constants, you can mix values from unrelated constants, like this:
public static final int SUNDAY = 1;
public static final int JANUARY = 1;
...
// even though this works, it's a mistake:
int firstMonth = SUNDAY;
The memory usage of enums over ints is negligible, and the type safety enums provide makes the minimal overhead acceptable.
Yes, there is a difference. Under modern 64-bit java Enum values are essentially pointers to objects and they either take 64 bits (non-compressed ops) or use additional CPU (compressed ops).
My test showed about 10% performance degradation for enums (1.8u25, AMD FX-4100): 13k ns vs 14k ns
Test source below:
public class Test {
public static enum Enum {
ONE, TWO, THREE
}
static class CEnum {
public Enum e;
}
static class CInt {
public int i;
}
public static void main(String[] args) {
CEnum[] enums = new CEnum[8192];
CInt[] ints = new CInt[8192];
for (int i = 0 ; i < 8192 ; i++) {
enums[i] = new CEnum();
ints[i] = new CInt();
ints[i].i = 1 + (i % 3);
if (i % 3 == 0) {
enums[i].e = Enum.ONE;
} else if (i % 3 == 1) {
enums[i].e = Enum.TWO;
} else {
enums[i].e = Enum.THREE;
}
}
int k=0; //calculate something to prevent tests to be optimized out
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
k+=test1(enums);
System.out.println();
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
k+=test2(ints);
System.out.println(k);
}
private static int test2(CInt[] ints) {
long t;
int k = 0;
for (int i = 0 ; i < 1000 ; i++) {
k+=test(ints);
}
t = System.nanoTime();
k+=test(ints);
System.out.println((System.nanoTime() - t)/100 + "ns");
return k;
}
private static int test1(CEnum[] enums) {
int k = 0;
for (int i = 0 ; i < 1000 ; i++) {
k+=test(enums);
}
long t = System.nanoTime();
k+=test(enums);
System.out.println((System.nanoTime() - t)/100 + "ns");
return k;
}
private static int test(CEnum[] enums) {
int i1 = 0;
int i2 = 0;
int i3 = 0;
for (int j = 100 ; j != 0 ; --j)
for (int i = 0 ; i < 8192 ; i++) {
CEnum c = enums[i];
if (c.e == Enum.ONE) {
i1++;
} else if (c.e == Enum.TWO) {
i2++;
} else {
i3++;
}
}
return i1 + i2*2 + i3*3;
}
private static int test(CInt[] enums) {
int i1 = 0;
int i2 = 0;
int i3 = 0;
for (int j = 100 ; j != 0 ; --j)
for (int i = 0 ; i < 8192 ; i++) {
CInt c = enums[i];
if (c.i == 1) {
i1++;
} else if (c.i == 2) {
i2++;
} else {
i3++;
}
}
return i1 + i2*2 + i3*3;
}
}
Answer to your question: No, the after a negligible time to load the Enum Class, the performance is the same.
As others have stated both types can be used in switch or if else statements. Also, as others have stated, you should favor Enums over int flags, because they were designed to replace that pattern and they provide added safety.
HOWEVER, there is a better pattern that you consider. Providing whatever value your switch statement/if statement was supposed to produce as property.
Look at this link: http://docs.oracle.com/javase/1.5.0/docs/guide/language/enums.html Notice the pattern provided for giving the planets masses and radii. Providing the property in this manner insures that you won't forget to cover a case if you add an enum.
I like using Enums when possible but I had a situation where I was having to compute millions of file offsets for different file types which I had defined in an enum and I had to execute a switch statement tens of millions of times to compute the offset base on on the enum type. I ran the following test:
import java.util.Random;
public class switchTest
{
public enum MyEnum
{
Value1, Value2, Value3, Value4, Value5
};
public static void main(String[] args)
{
final String s1 = "Value1";
final String s2 = "Value2";
final String s3 = "Value3";
final String s4 = "Value4";
final String s5 = "Value5";
String[] strings = new String[]
{
s1, s2, s3, s4, s5
};
Random r = new Random();
long l = 0;
long t1 = System.currentTimeMillis();
for(int i = 0; i < 10_000_000; i++)
{
String s = strings[r.nextInt(5)];
switch(s)
{
case s1:
// make sure the compiler can't optimize the switch out of existence by making the work of each case it does different
l = r.nextInt(5);
break;
case s2:
l = r.nextInt(10);
break;
case s3:
l = r.nextInt(15);
break;
case s4:
l = r.nextInt(20);
break;
case s5:
l = r.nextInt(25);
break;
}
}
long t2 = System.currentTimeMillis();
for(int i = 0; i < 10_000_000; i++)
{
MyEnum e = MyEnum.values()[r.nextInt(5)];
switch(e)
{
case Value1:
// make sure the compiler can't optimize the switch out of existence by making the work of each case it does different
l = r.nextInt(5);
break;
case Value2:
l = r.nextInt(10);
break;
case Value3:
l = r.nextInt(15);
break;
case Value4:
l = r.nextInt(20);
break;
case Value5:
l = r.nextInt(25);
break;
}
}
long t3 = System.currentTimeMillis();
for(int i = 0; i < 10_000_000; i++)
{
int xx = r.nextInt(5);
switch(xx)
{
case 1:
// make sure the compiler can't optimize the switch out of existence by making the work of each case it does different
l = r.nextInt(5);
break;
case 2:
l = r.nextInt(10);
break;
case 3:
l = r.nextInt(15);
break;
case 4:
l = r.nextInt(20);
break;
case 5:
l = r.nextInt(25);
break;
}
}
long t4 = System.currentTimeMillis();
System.out.println("strings:" + (t2 - t1));
System.out.println("enums :" + (t3 - t2));
System.out.println("ints :" + (t4 - t3));
}
}
and got the following results:
strings:442
enums :455
ints :362
So from this I decided that for me enums were efficient enough. When I decreased the loop counts to 1M from 10M the string and enums took about twice as long as the int which indicates that there was some overhead to using strings and enums for the first time as compared to ints.
Even though this question is old, I'd like to point out what you can't do with ints
public interface AttributeProcessor {
public void process(AttributeLexer attributeLexer, char c);
}
public enum ParseArrayEnd implements AttributeProcessor {
State1{
public void process(AttributeLexer attributeLexer, char c) {
.....}},
State2{
public void process(AttributeLexer attributeLexer, char c) {
.....}}
}
And what you can do is make a map of what value is expected as a Key, and the enum as a value,
Map<String, AttributeProcessor> map
map.getOrDefault(key, ParseArrayEnd.State1).process(this, c);

Adaptation of LCS algorithm

new programmer here. I watched a video which displayed a recursive algorithm for LCS(longest common substring). The program only returned an int which was the length of the LCS between the two strings. I decided as an exercise to adapt the algorithm to return the string itself. Here is what I came up with, and it seems to be right, but I need to ask others more experienced if there are any bugs;
const int mAX=1001; //max size for the two strings to be compared
string soFar[mAX][mAX]; //keeps results of strings generated along way to solution
bool Get[mAX][mAX]; //marks what has been seen before(pairs of indexes)
class LCS{ //recursive version,use of global arrays not STL maps
private:
public:
string _getLCS(string s0,int k0, string s1,int k1){
if(k0<=0 || k1<=0){//base case
return "";
}
if(!Get[k0][k1]){ //checking bool memo to see if pair of indexes has been seen before
Get[k0][k1]=true; //mark seen pair of string indexs
if(s0[k0-1]==s1[k1-1]){
soFar[k0][k1]=s0[k0-1]+_getLCS(s0,k0-1,s1,k1-1);//if the char in positions k0 and k1 are equal add common char and move on
}
else{
string a=_getLCS(s0,k0-1,s1,k1);//this string is the result from keeping the k1 position the same and decrementing the k0 position
string b=_getLCS(s0,k0,s1,k1-1);//this string is the result from decrementing the k1 position keeping k0 the same
if(a.length()> b.length())soFar[k0][k1]=a;//the longer string is the one we are interested in
else
soFar[k0][k1]=b;
}
}
return soFar[k0][k1];
}
string LCSnum(string s0,string s1){
memset(Get,0,sizeof(Get));//memset works fine for zero, so no complaints please
string a=_getLCS(s0,s0.length(),s1,s1.length());
reverse(a.begin(),a.end());//because I start from the end of the strings, the result need to be reversed
return a;
}
};
I have only been programming for 6 months so I cant really tell if there is some bugs or cases where this algorithm will not work. It seems to work for two strings of size up to 1001 chars each.
What are the bugs and would the equivalent dynamic programming solution be faster or use less memory for the same result?
Thanks
Your program is not correct. What does it return for LCSnum("aba", "abba")?
string soFar[mAX][mAX] should be a hint that this is not a great solution. A simple dynamic programming solution (which has logic that you almost follow) has an array of size_t which is m*n in size, and no bool Get[mAX][mAX] either. (A better dynamic programming algorithm only has an array of 2*min(m, n).)
Edit: by the way, here is the space-efficient dynamic programming solution in Java. Complexity: time is O(m*n), space is O(min(m, n)), where m and n are the lengths of the strings. The result set is given in alphabetical order.
import java.util.Set;
import java.util.TreeSet;
class LCS {
public static void main(String... args) {
System.out.println(lcs(args[0], args[1]));
}
static Set<String> lcs(String s1, String s2) {
final Set<String> result = new TreeSet<String>();
final String shorter, longer;
if (s1.length() <= s2.length()) {
shorter = s1;
longer = s2;
}else{
shorter = s2;
longer = s1;
}
final int[][] table = new int[2][shorter.length()];
int maxLen = 0;
for (int i = 0; i < longer.length(); i++) {
int[] last = table[i % 2]; // alternate
int[] current = table[(i + 1) % 2];
for (int j = 0; j < shorter.length(); j++) {
if (longer.charAt(i) == shorter.charAt(j)) {
current[j] = (j > 0? last[j - 1] : 0) + 1;
if (current[j] > maxLen) {
maxLen = current[j];
result.clear();
}
if (current[j] == maxLen) {
result.add(shorter.substring(j + 1 - maxLen, j + 1));
}
}
}
}
return result;
}
}

Categories