I wrote a small program to find the first 5 Taxicab numbers (so far only 6 are known) by checking each integer from 2 to 5E+15. The definition of Taxicab numbers is here.
However, my program took 8 minutes just to reach 3E+7. Since Taxicab(3) is in the order of 8E+7, I hesitate to let it run any further without optimizing it first.
I'm using NetBeans 8 on Ubuntu 16.10 on a HP 8560w, i7 2600qm quad core, 16GB RAM. However, Java only uses 1 core, to a maximum of 25% total CPU power, even when given Very High Priority. How do I fix this?
public class Ramanujan
{
public static void main(String[] args)
{
long limit;
//limit = 20;
limit = 500000000000000000L;
int order = 1;
for (long testCase = 2; testCase < limit; testCase++)
{
if (isTaxicab(testCase, order))
{
System.out.printf("Taxicab(%d) = %d*****************************\n",
order, testCase);
order++;
}
else
{
if (testCase%0x186a0 ==0) //Prints very 100000 iterations to track progress
{
//To track progress
System.out.printf("%d \n", testCase);
}
}
}
}
public static boolean isTaxicab(long testCase, int order)
{
int way = 0; //Number of ways that testCase can be expressed as sum of 2 cube numbers.
long i = 1;
long iUpperBound = (long) (1+Math.cbrt(testCase/2));
//If testCase = i*i*i + j*j*j AND i<=j
//then i*i*i cant be > testCase/2
//No need to test beyond that
while (i < iUpperBound)
{
if ( isSumOfTwoCubes(testCase, i) )
{
way++;
}
i++;
}
return (way >= order);
}
public static boolean isSumOfTwoCubes(long testCase,long i)
{
boolean isSum = false;
long jLowerBound = (long) Math.cbrt(testCase -i*i*i);
for (long j = jLowerBound; j < jLowerBound+2; j++)
{
long sumCubes = i*i*i + j*j*j;
if (sumCubes == testCase)
{
isSum = true;
break;
}
}
return isSum;
}
}
The program itself will only ever use one core until you parallelize it.
You need to learn how to use Threads.
Your problem is embarrassingly parallel. Parallelizing too much (i.e. creating too many threads) will be detrimental because each thread creates an overhead, so you need to be careful regarding exactly how you parallelize.
If it was up to me, I would initialize a list of worker threads where each thread effectively performs isTaxicab() and simply assign a single testCase to each worker as it becomes available.
You would want to code such that you can easily experiment with the number of workers.
Related
Say I want to go through a loop a billion times how could I optimize the loop to get my results faster?
As an example:
double randompoint;
for(long count =0; count < 1000000000; count++) {
randompoint = (Math.random() * 1) + 0; //generate a random point
if(randompoint <= .75) {
var++;
}
}
I was reading up on vecterization? But I'm not quite sure how to go about it. Any Ideas?
Since Java is cross-platform, you pretty much have to rely on the JIT to vectorize. In your case it can't, since each iteration depends heavily on the previous one (due to how the RNG works).
However, there are two other major ways to improve your computation.
The first is that this work is very amenable to parallelization. The technical term is embarrassingly parallel. This means that multithreading will give a perfectly linear speedup over the number of cores.
The second is that Math.random() is written to be multithreading safe, which also means that it's slow because it needs to use atomic operations. This isn't helpful, so we can skip that overhead by using a non-threadsafe RNG.
I haven't written much Java since 1.5, but here's a dumb implementation:
import java.util.*;
import java.util.concurrent.*;
class Foo implements Runnable {
private long count;
private double threshold;
private long result;
public Foo(long count, double threshold) {
this.count = count;
this.threshold = threshold;
}
public void run() {
ThreadLocalRandom rand = ThreadLocalRandom.current();
for(long l=0; l<count; l++) {
if(rand.nextDouble() < threshold)
result++;
}
}
public static void main(String[] args) throws Exception {
long count = 1000000000;
double threshold = 0.75;
int cores = Runtime.getRuntime().availableProcessors();
long sum = 0;
List<Foo> list = new ArrayList<Foo>();
List<Thread> threads = new ArrayList<Thread>();
for(int i=0; i<cores; i++) {
// TODO: account for count%cores!=0
Foo t = new Foo(count/cores, threshold);
list.add(t);
Thread thread = new Thread(t);
thread.start();
threads.add(thread);
}
for(Thread t : threads) t.join();
for(Foo f : list) sum += f.result;
System.out.println(sum);
}
}
You can also optimize and inline the random generator, to avoid going via doubles. Here it is with code taken from the ThreadLocalRandom docs:
public void run() {
long seed = new Random().nextLong();
long limit = (long) ((1L<<48) * threshold);
for(int i=0; i<count; i++) {
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
if (seed < limit) ++result;
}
}
However, the best approach is to work smarter, not harder. As the number of events increases, the probability tends towards a normal distribution. This means that for your huge range, you can randomly generate a number with such a distribution and scale it:
import java.util.Random;
class StayInSchool {
public static void main(String[] args) {
System.out.println(coinToss(1000000000, 0.75));
}
static long coinToss(long iterations, double threshold) {
double mean = threshold * iterations;
double stdDev = Math.sqrt(threshold * (1-threshold) * iterations);
double p = new Random().nextGaussian();
return (long) (p*stdDev + mean);
}
}
Here are the timings on my 4 core system (including VM startup) for these approaches:
Your baseline: 20.9s
Single threaded ThreadLocalRandom: 6.51s
Single threaded optimized random: 1.75s
Multithreaded ThreadLocalRandom: 1.67s
Multithreaded optimized random: 0.89s
Generating a gaussian: 0.14s
I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot
As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.
Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.
As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);
Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).
If I have a simple program to parallelize counting the number of 1's from a random integer from 0 - 9 for a large number of iterations, how do I reduce the variable counting the 1's (numOnes) using the sum function so that I can be able to use the total sum later on in my program.
This is equivalent to the reduction directive in OpenMP.
public void run() {
long work = total_iterations / threads;
long numOnes = 0;
for (long i = 0; i < work; i++) {
int randomNum = rand.nextInt(9);
if (randomNum == 1) {
numOnes += 1;
}
}
}
Once each thread is done executing, I want to be able to use numOnes containing the aggregate result.
In Java, you would have to sit down and manage things manually. In other words: assuming that your data is partitioned, you just kick of those 10 threads, and let them do their work.
In the end, you want to join all those threads; as in: only when all threads "joined"; all of them are done; and you are ready to go forward and process their results.
Alternatively, you would look into more "abstract" things like ExecutorService and Futures; to get away from dealing with "bare metal" threads directly.
Of course that is pretty generic; but well, so is your question.
You could use streams for this.
public class ParallelInts
{
public static void main(String[] args) {
int count = new Random().ints( 1_000_000, 0, 10 ).parallel()
.reduce( 0, (sum, i) -> sum + ((i==1)?1:0) );
System.out.println( "count = " + count );
}
}
I've been playing around with the Project Euler challenges to help improve my knowledge of Java. In particular, I wrote the following code for problem 14, which asks you to find the longest Collatz chain which starts at a number below 1,000,000. It works on the assumption that subchains are incredibly likely to arise more than once, and by storing them in a cache, no redundant calculations are done.
Collatz.java:
import java.util.HashMap;
public class Collatz {
private HashMap<Long, Integer> chainCache = new HashMap<Long, Integer>();
public void initialiseCache() {
chainCache.put((long) 1, 1);
}
private long collatzOp(long n) {
if(n % 2 == 0) {
return n/2;
}
else {
return 3*n +1;
}
}
public int collatzChain(long n) {
if(chainCache.containsKey(n)) {
return chainCache.get(n);
}
else {
int count = 1 + collatzChain(collatzOp(n));
chainCache.put(n, count);
return count;
}
}
}
ProjectEuler14.java:
public class ProjectEuler14 {
public static void main(String[] args) {
Collatz col = new Collatz();
col.initialiseCache();
long limit = 1000000;
long temp = 0;
long longestLength = 0;
long index = 1;
for(long i = 1; i < limit; i++) {
temp = col.collatzChain(i);
if(temp > longestLength) {
longestLength = temp;
index = i;
}
}
System.out.println(index + " has the longest chain, with length " + longestLength);
}
}
This works. And according to the "measure-command" command from Windows Powershell, it takes roughly 1708 milliseconds (1.708 seconds) to execute.
However, after reading through the forums, I noticed that some people, who had written seemingly naive code, which calculate each chain from scratch, seemed to be getting much better execution times than me. I (conceptually) took one of the answers, and translated it into Java:
NaiveProjectEuler14.java:
public class NaiveProjectEuler14 {
public static void main(String[] args) {
int longest = 0;
int numTerms = 0;
int i;
long j;
for (i = 1; i <= 10000000; i++) {
j = i;
int currentTerms = 1;
while (j != 1) {
currentTerms++;
if (currentTerms > numTerms){
numTerms = currentTerms;
longest = i;
}
if (j % 2 == 0){
j = j / 2;
}
else{
j = 3 * j + 1;
}
}
}
System.out.println("Longest: " + longest + " (" + numTerms + ").");
}
}
On my machine, this also gives the correct answer, but it gives it in 0.502 milliseconds - a third of the speed of my original program. At first I thought that maybe there was a small overhead in creating a HashMap, and that the times taken were too small to draw any conclusions. However, if I increase the upper limit from 1,000,000 to 10,000,000 in both programs, NaiveProjectEuler14 takes 4709 milliseconds (4.709 seconds), whilst ProjectEuler14 takes a whopping 25324 milliseconds (25.324 seconds)!
Why does ProjectEuler14 take so long? The only explanation I can fathom is that storing huge amounts of pairs in the HashMap data structure is adding a huge overhead, but I can't see why that should be the case. I've also tried recording the number of (key, value) pairs stored during the course of the program (2,168,611 pairs for the 1,000,000 case, and 21,730,849 pairs for the 10,000,000 case) and supplying a little over that number to the HashMap constructor so that it only has to resize itself at most once, but this does not seem to affect the execution times.
Does anyone have any rationale for why the memoized version is a lot slower?
There are some reasons for that unfortunate reality:
Instead of containsKey, do an immediate get and check for null
The code uses an extra method to be called
The map stores wrapped objects (Integer, Long) for primitive types
The JIT compiler translating byte code to machine code can do more with calculations
The caching does not concern a large percentage, like fibonacci
Comparable would be
public static void main(String[] args) {
int longest = 0;
int numTerms = 0;
int i;
long j;
Map<Long, Integer> map = new HashMap<>();
for (i = 1; i <= 10000000; i++) {
j = i;
Integer terms = map.get(i);
if (terms != null) {
continue;
}
int currentTerms = 1;
while (j != 1) {
currentTerms++;
if (currentTerms > numTerms){
numTerms = currentTerms;
longest = i;
}
if (j % 2 == 0){
j = j / 2;
// Maybe check the map only here
Integer m = map.get(j);
if (m != null) {
currentTerms += m;
break;
}
}
else{
j = 3 * j + 1;
}
}
map.put(j, currentTerms);
}
System.out.println("Longest: " + longest + " (" + numTerms + ").");
}
This does not really do an adequate memoization. For increasing parameters not checking the 3*j+1 somewhat decreases the misses (but might also skip meoized values).
Memoization lives from heavy calculation per call. If the function takes long because of deep recursion rather than calculation, the memoization overhead per function call counts negatively.
import java.io.*;
import java.util.ArrayList;
public class Ristsumma {
static long numberFromFile;
static long sum1, sum2;
static long number, number2;
static long variable, variable2;
static long counter;
public static void main(String args[]) throws IOException{
try{
BufferedReader br = new BufferedReader(new FileReader("ristsis.txt"));
numberFromFile = Long.parseLong(br.readLine());
br.close();
}catch(Exception e){
e.printStackTrace();
}
variable=numberFromFile;
ArrayList<Long> numbers = new ArrayList<Long>();
while (variable > 0){
number = variable %10;
variable/=10;
numbers.add(number);
}
for (int i=0; i< numbers.size(); i++) {
sum1 += numbers.get(i);
}
ArrayList<Long> numbers2 = new ArrayList<Long>();
for(long s=1; s<numberFromFile; s++){
variable2=s;
number2=0;
sum2=0;
while (variable2 > 0){
number2 = variable2 %10;
variable2/=10;
numbers2.add(number2);
}
for (int i=0; i< numbers2.size(); i++) {
sum2 += numbers2.get(i);
}
if(sum1==sum2){
counter+=1;
}
numbers2.clear();
}
PrintWriter pw = new PrintWriter("ristval.txt", "UTF-8");
pw.println(counter);
pw.close();
}
}
So I have this code. It takes a number from a file, adds all numbers separately from that number and adds them together (for example the number is 123 then it gives 1+2+3=6). In the second half it looks out all numbers from 1 to that number in the file and counts how many different numbers give the same answer. If the number is 123, the sum is 6 and the answer that the code writes is 9 (because 6, 15, 24, 33, 42, 51, 60, 105, 114 also give the same answer). The code works, but my problem is that when the number from a file is for example 2 222 222 222, then it takes almost half an hour to get the answer. How can I make this run faster?
Remove unnecessary creation of lists
You are unnecessarily creating lists
ArrayList<Long> numbers = new ArrayList<Long>();
while (variable > 0){
number = variable %10;
variable/=10;
numbers.add(number);
}
for (int i=0; i< numbers.size(); i++) {
sum1 += numbers.get(i);
}
Here you create an arraylist, just to temporaily hold Longs, you can eliminate the entire list
while (variable > 0){
number = variable %10;
variable/=10;
sum1 += number
}
The same for the other arraylist numbers2
Presize arralists
We have already eliminated the arraylists but if we hadn't we could improve speed by presizing the arrays
ArrayList<Long> numbers = new ArrayList<Long>(someGuessAsToSize);
It isn't nessissary that your guess be correct, the arraylist will still auto resize, but if the guess is approximately correct you will speed up the code as the arraylist will not have to periodically resize.
General style
You are holding lots of (what should be) method variables as fields
static long numberFromFile;
static long sum1, sum2;
static long number, number2;
static long variable, variable2;
static long counter;
This is unlikely to affect performance but is an unusual thing to do and makes the code less readable with the potential for "hidden effects"
Your problem is intriguing - it got me wondering how much faster it would run with threads.
Here is a threaded implementation that splits the task of calculating the second problem across threads. My laptop only has two cores so I have set the threads to 4.
public static void main(String[] args) throws Exception {
final long in = 222222222;
final long target = calcSum(in);
final ExecutorService executorService = Executors.newFixedThreadPool(4);
final Collection<Future<Integer>> futures = Lists.newLinkedList();
final int chunk = 100;
for (long i = in; i > 0; i -= chunk) {
futures.add(executorService.submit(new Counter(i > chunk ? i - chunk : 0, i, target)));
}
long res = 0;
for (final Future<Integer> f : futures) {
res += f.get();
}
System.out.println(res);
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.DAYS);
}
public static final class Counter implements Callable<Integer> {
private final long start;
private final long end;
private final long target;
public Counter(long start, long end, long target) {
this.start = start;
this.end = end;
this.target = target;
}
#Override
public Integer call() throws Exception {
int count = 0;
for (long i = start; i < end; ++i) {
if (calcSum(i) == target) {
++count;
}
}
return count;
}
}
public static long calcSum(long num) {
long sum = 0;
while (num > 0) {
sum += num % 10;
num /= 10;
}
return sum;
}
It calculates the solution with 222 222 222 as an input in a few seconds.
I optimised the calculation of the sum to remove all the Lists that you were using.
EDIT
I added some timing code using Stopwatch and tried with and without #Ingo's optimisation using 222222222 * 100 as the input number.
Without the optimisation the code takes 35 seconds. Changing the calc method to:
public static long calcSum(long num, final long limit) {
long sum = 0;
while (num > 0) {
sum += num % 10;
if (limit > 0 && sum > limit) {
break;
}
num /= 10;
}
return sum;
}
With the added the optimisation the code takes 28 seconds.
Note this this is a highly non-scientific benchmark as I didn't warm the JIT or run multiple trials (partly because I'm lazy and partly because I'm busy).
EDIT
Fiddling with the chunk size gives fairly different results too. With a chunk of 1000 time drops to around 17 seconds.
EDIT
If you want to be really fancy you can use a ForkJoinPool:
public static void main(String[] args) throws Exception {
final long in = 222222222;
final long target = calcSum(in);
final ForkJoinPool forkJoinPool = new ForkJoinPool();
final ForkJoinTask<Integer> result = forkJoinPool.submit(new Counter(0, in, target));
System.out.println(result.get());
forkJoinPool.shutdown();
forkJoinPool.awaitTermination(1, TimeUnit.DAYS);
}
public static final class Counter extends RecursiveTask<Integer> {
private static final long THRESHOLD = 1000;
private final long start;
private final long end;
private final long target;
public Counter(long start, long end, long target) {
this.start = start;
this.end = end;
this.target = target;
}
#Override
protected Integer compute() {
if (end - start < 1000) {
return computeDirectly();
}
long mid = start + (end - start) / 2;
final Counter low = new Counter(start, mid, target);
final Counter high = new Counter(mid, end, target);
low.fork();
final int highResult = high.compute();
final int lowResult = low.join();
return highResult + lowResult;
}
private Integer computeDirectly() {
int count = 0;
for (long i = start; i < end; ++i) {
if (calcSum(i) == target) {
++count;
}
}
return count;
}
}
public static long calcSum(long num) {
long sum = 0;
while (num > 0) {
sum += num % 10;
num /= 10;
}
return sum;
}
On a different (much faster) computer this runs in under a second, as compared to 2.8 seconds for the original approach.
You spend most of the time checking numbers that are failing the test. However, as Ingo observed, if you have a number ab, then (a-1)(b+1) has the same sum as ab. Instead of checking all numbers, you can generate them:
Lets say our number is 2 222, the sum is 8.
Approach #1: bottom up
We now generate the number starting with the smallest (we pad with zeroes for reading convenience): 0008. The next one is 0017, the next are 0026, 0035, 0044, 0053, 0062, 0071, 0080, 0107 and so on. The problematic part is finding the first number that has this sum.
Approach #2: top down
We start at 2222, the next lower number is 2213, then 2204, 2150, 2141, and so on. Here you don't have the problem that you need to find the lowest number.
I don't have time to write code now, but there should be an algorithm to realize both approaches, that does not involve trying out all numbers.
For a number abc, (a)(b-1)(c+1) is the next lower number, while (a)(b+1)(c-1) is the next higher number. The only interesting/difficult thing is when you need to overflow because b==9 or c==9, or b==0, c==0. The next bigger number if b==9 is (a+1)(9)(c-1) if c>0, and (a)(8)(0) if c==0. Now go make your algorithm, these examples should be enough.
Observe that you don't need to store the individual digits at all.
Instead, all you're interested in is the actual sum of the digits.
Considering this, a method like
static int diagsum(long number) { ... }
would be great. If it is easy enogh, the JIT could inline it, or at least optimize better than your spaghetti code.
Then again, you could benefit from another method that stops computing the digit sum at some limit. Fore example, when you have
22222222
the sum is 20, and that means that you need not compute any other sum that is greater than 20. For example:
45678993
Instead, you could just stop after you have the last 3 digits (which you get first by your diision method), because 9+9+3 is 21 and this is alread greater than 20.
===================================================================
Another optimization:
If you have some number:
123116
it is immediately clear that all unique permutations of those 6 digits have the same digit sum, thus
321611, 231611, ... are solutions
Then, for any pair of individual digits ab, a transformed number would contain (a+1)(b-1) and (a-1)(b+1) in the same place, as long as a+1, ... are still in the range 0..9. Apply recursively to get even more numbewrs.
You can then turn to numbers with less digits. Obviously, to have the same digit sum, you must combine 2 digits of the original number, if possible, for example
5412 => 912, 642, 741, 552, 561, 543
etc.
Apply the same algorithm recursively as above, until no transformations and combinations are possible.
=========
It must be said, though, that above idea would take lots of memory, because one must maintain a Set-like data structure to take care of duplicates. However, for 987_654_321 we get already 39_541_589 results, and probably much more with even greater numbers. Thus it is questionable if the effort to actually do it the combinatorical way is worth it.