How to calculate mutual index of concidence of two strings Java - java

i am trying to calculate the mutual index of concidence of two strings, A and B. I have calculated the frequency of each letter in each string. However, i do not know how to continue from there. Any help is appreciated. The expected output is supposed to be some decimal value. Thanks!
public class MutualIndexOfAB
{
public double calculateMutual(String a, String b)
{
int i, j;
int NA = 0;
int NB = 0;
double sum = 0.0, total = 0.0;
a = a.toUpperCase();
b = b.toUpperCase();
// calculate frequency of each letter in String a
int chA;
for (i=0; i<a.length(); i++)
{
ch = a.charAt(i)-65;
if (chA>=0 && chA <26)
{
values[chA]++;
NA++;
}
}
// calculate frequency of each letter in String b
int chB;
for (j=0; j<b.length(); j++)
{
chB = b.charAt(j)-65;
if (chB>=0 && chB <26)
{
values[chB]++;
NB++;
}
}
}
public static void main(String[] args)
{
MutualIndexOfAB test = new MutualIndexOfAB();
String textA = "cyber security is about how we develop secure computers and computer networks, to ensure that the data stored and transmitted through them is protected from unauthorized access or to combat digital security threats and hazards. as we conduct more of our social, consumer and business activities online, there is a corresponding increase in the demand for ict professionals to manage our digital environment and economy.";
String textB = "cyber security has been identified as one of the strategic priorities in australia to meet the demands of law enforcement, national and state governments, defense, security and finance industries. jobs of the future will be in all of these areas ensuring there is national capability to maintain and build our essential services and stop them from being disrupted, destroyed, or threatened, and that our personal information is not communicated, shared, visualized or analysed without our permission.";
System.out.println("Mutual Index of Concidence of Texts A and B: " + test.calculateMutual(textA, textB));
}
}

Simply create a single loop with an index (instead of two), and then compare the characters of each string, increasing a counter if they match. Then perform division by the total number of characters.

Related

Using Parallel array Show highest sales value, and the month in which it occurred

I'm trying to display in the Message dialog on the JOptionPane the highest number of sales from my array of sales.
And I also want to show in which month they happened, but I am failing to find a way to display the month.
public static void main(String[] args) {
int[] CarSales= {1234,2343,1456,4567,8768,2346,9876,4987,7592,9658,7851,2538};
String [] Months = {"January","February","March","April","May","June"
,"July ","August","September","October","November","December" };
int HighNum = CarSales[0];
for(int i = 0; i < CarSales.length; i++)
{
if(CarSales[i] > HighNum)
{
HighNum = CarSales[i];
}
}
JOptionPane.showMessageDialog(null,"The highest car sales value is :"+HighNum +
"-which happened in the month of");
}
Use the Power of Objects
Avoid using parallel arrays in Java. It brings unnecessary complexity, makes the code brittle and less maintainable.
Your code doesn't automatically become object-oriented just because of the fact that you're using an object-oriented language.
Objects provide you an easy way of structuring your data and organizing the code (if you need to implement some functionality related, to a particular data you know where it should go - its plane is in the class representing the data).
So, to begin with, I advise you to implement a class, let's call it CarSale:
public static class CarSale {
private Month month;
private int amount;
// getters, constructor, etc
}
Or, if you don't need it to be mutable, it can be implemented as a Java 16 record. In a nutshell, record is a specialized form of class, instances of which are meant to be transparent carriers of data (you can not change their properties after instantiation).
One of the greatest things about records is their concise syntax. The line below is an equivalent of the fully fledged class with getters, constructor, equals/hashCode and toString (all these would be generated for you by the compiler):
public record CarSale(Month month, int amount) {}
java.time.Month
You've probably noticed that in the code above, property month is not a String. It's a standard enum Month that resides in java.time package.
When you have a property that might have a limited set of values, enum is always preferred choice because contrary to a plain String, enum guards you from making typo and also enums have an extensive language support. So you don't need this array filled with moth-names.
That's how your code might look like:
CarSale[] carSales = {
new CarSale(Month.JANUARY, 1234),
new CarSale(Month.FEBRUARY, 2343),
new CarSale(Month.MARCH, 1456),
// ...
};
// you might want to check if carSales is not is empty before accessing its first element
CarSale best = carSales[0];
for (CarSale sale: carSales) {
if (sale.getAmount() > best.getAmount()) best = sale;
}
JOptionPane.showMessageDialog(null,
"The highest car sales value is :" + best.getAmount() +
" which happened in the month of " + best.getMonth());
Note
Try to keep your code aligned with Java naming conventions. Only names of classes and interface should start with a capital letter.
In case if you've heard from someone that usage of parallel arrays can improve memory consumption, then I would advise to examine this question dipper and take a look at questions like this Why use parallel arrays in Java? In case of such tiny arrays the only thing are disadvantages of a fragile code.
There are multiple solutions but i'll give you the simplest based on your structure.
Just declare one String variable and assign the value whenever you change the highest num.
public static void main(String[] args) {
int[] CarSales= {1234,2343,1456,4567,8768,2346,9876,4987,7592,9658,7851,2538};
String [] Months = {"January","February","March","April","May","June"
,"July ","August","September","October","November","December" };
int HighNum = CarSales[0];
String month = Months[0];
for(int i = 0; i < CarSales.length; i++)
{
if(CarSales[i] > HighNum)
{
HighNum = CarSales[i];
month = Months[i];
}
}
JOptionPane.showMessageDialog(null,"The highest car sales value is :"+HighNum +
"-which happened in the month of " + month);
}
Keep an index. Whenever you change the highest found, update the index.
public static void main(String[] args) {
int[] CarSales= {1234,2343,1456,4567,8768,2346,
9876,4987,7592,9658,7851,2538};
String [] Months = {"January","February","March","April","May","June"
,"July ","August","September","October","November","December" };
int HighNum = CarSales[0];
int highMonth = 0;
for(int i = 0; i < CarSales.length; i++)
{
if(CarSales[i] > HighNum)
{
HighNum = CarSales[i];
highMonth = i;
}
}
JOptionPane.showMessageDialog
(null,"The highest car sales value is :"+HighNum +
"-which happened in the month of " + Months[highMonth]);
}

Neuron results are a little off

Hi I coded a single neuron to predict a student's mark for subject D based of the marks they got for subject A, B and C.
After training my neuron with some historical data that contain the 3 marks as well as the actual mark they got for subject D, I then inputed test data to see how closely the predicted mark would match with the actual one.
Below is my Neuron class
public class Neuron
{
double[] Weights = new double[3];
public Neuron(double W1, double W2, double W3)
{
Weights[0] = W1;
Weights[1] = W2;
Weights[2] = W3;
}
public double FnetLinear(int Z1, int Z2, int Z3)
{
return (Z1*Weights[0] + Z2*Weights[1] + Z3*Weights[2]);
}
public void UpdateWeight(int i, double Wi)
{
Weights[i] = Wi;
}
}
And here is my main class
public class Main
{
public int t;
public Neuron neuron;
double LearningRate = 0.00001;
public ArrayList<Marks> TrainingSet, TestSet;
public static void main(String[] args) throws IOException
{
Main main = new Main();
main.run();
}
public void run()
{
TrainingSet = ReadCSV("G:\\EVOS\\EVO_Assignemnt1\\resources\\Streamdata.csv");
TestSet = ReadCSV("G:\\EVOS\\EVO_Assignemnt1\\resources\\Test.csv");
Random ran = new Random();
neuron = new Neuron(ran.nextDouble(), ran.nextDouble(), ran.nextDouble());
train();
Test();
}
public void train()
{
t = 0;
while(t<1000000)
{
for(Marks mark: TrainingSet)
{
for(int i=0; i<neuron.Weights.length; i++)
{
double yp = neuron.FnetLinear(mark.marks[0] , mark.marks[1], mark.marks[2]);
double wi = neuron.Weights[i] - LearningRate*(-2*(mark.marks[3]-yp))*mark.marks[i];
neuron.UpdateWeight(i, wi);
}
}
t++;
}
}
public void Test()
{
System.out.println("Test Set results:");
int count = 1;
for(Marks mark: TestSet)
{
double fnet = neuron.FnetLinear(mark.marks[0] , mark.marks[1], mark.marks[2]);
System.out.println("Mark " + count + ": " + fnet);
count++;
}
}
public static ArrayList<Marks> ReadCSV(String csv)
{
ArrayList<Marks> temp = new ArrayList<>();
String line;
BufferedReader br;
try
{
br = new BufferedReader(new FileReader(csv));
while((line=br.readLine()) != null)
{
String[] n = line.split(",");
Marks stud = new Marks(Integer.valueOf(n[0]), Integer.valueOf(n[1]), Integer.valueOf(n[2]), Integer.valueOf(n[3]));
temp.add(stud);
}
}
catch (Exception e)
{
System.out.println("ERROR");
}
return temp;
}
}
This is the test data with the last number being the actual mark.
After running the test data i get results around these:
As you can see the first 4 marks predictions are way off from the actual mark.
I followed the text book's explenation of Computational Intlligence An Introduction (Chapter 2 if u are curious).
However I would like to know what I im doing wrong. How can I get more accurate results?
Neural networks are very black-box esque; Due to this, it's pretty hard to say exactly why your marks results are way off.
That being said, here are some of the main methods of increasing the accuracy of your neural network:
Adjust the number of layers and neurons; I notice you're only using a single neuron. A single neuron in a neural network is typically just... bad. You're never going to get any good results like that. Neural networks need enough complexity in the form of layering and neuron count in order to calculate or predict whatever it is you're trying to teach it to do. A single neuron by itself really can't learn anything useful. This is also probably a big reason why your network accuracy is so bad.
Train for longer; I notice you're only training your network 1 million times; this is not always enough. For reference, the last time I trained a neural network, I used over 30 million sets of input/output.
Retrain your network with different starting weights; Randomized starting weights are great, but sometimes you just get a bad batch of starting weights. In the same project where I used 30 million input/output sets, I also tried over 25 different configurations of initial starting weights across 15 different layouts of nodes and layers.
Pick a different activation function; Linear activation functions are usually not that useful. I usually default to using a sigmoid function to start off, unless there are specific other functions that fulfill the use case I'm trying to train.
A common pitfall that can cause low accuracy is bad training data; Make sure the training data you're using is correct and is internally consistent with whatever it is you're trying to teach.
As a final note, I find myself having some trouble understanding what kind of a neural network you're trying to write exactly; I've made the assumption that this is some sort of attempt at a feed forward, back propagation neural network with a single neuron in it, but most of the advice here should still apply.

How to prevent genetic algorithm from converging on local minima?

I am trying to build a 4 x 4 sudoku solver by using the genetic algorithm. I have some issues with values converging to local minima. I am using a ranked approach and removing the bottom two ranked answer possibilities and replacing them with a crossover between the two highest ranked answer possibilities. For additional help avoiding local mininma, I am also using mutation. If an answer is not determined within a specific amount of generation, my population is filled with completely new and random state values. However, my algorithm seems to get stuck in local minima. As a fitness function, I am using:
(Total Amount of Open Squares * 7 (possible violations at each square; row, column, and box)) - total Violations
population is an ArrayList of integer arrays in which each array is a possible end state for sudoku based on the input. Fitness is determined for each array in the population.
Would someone be able to assist me in determining why my algorithm converges on local minima or perhaps recommend a technique to use to avoid local minima. Any help is greatly appreciated.
Fitness Function:
public int[] fitnessFunction(ArrayList<int[]> population)
{
int emptySpaces = this.blankData.size();
int maxError = emptySpaces*7;
int[] fitness = new int[populationSize];
for(int i=0; i<population.size();i++)
{
int[] temp = population.get(i);
int value = evaluationFunc(temp);
fitness[i] = maxError - value;
System.out.println("Fitness(i)" + fitness[i]);
}
return fitness;
}
Crossover Function:
public void crossover(ArrayList<int[]> population, int indexWeakest, int indexStrong, int indexSecStrong, int indexSecWeak)
{
int[] tempWeak = new int[16];
int[] tempStrong = new int[16];
int[] tempSecStrong = new int[16];
int[] tempSecWeak = new int[16];
tempStrong = population.get(indexStrong);
tempSecStrong = population.get(indexSecStrong);
tempWeak = population.get(indexWeakest);
tempSecWeak = population.get(indexSecWeak);
population.remove(indexWeakest);
population.remove(indexSecWeak);
int crossoverSite = random.nextInt(14)+1;
for(int i=0;i<tempWeak.length;i++)
{
if(i<crossoverSite)
{
tempWeak[i] = tempStrong[i];
tempSecWeak[i] = tempSecStrong[i];
}
else
{
tempWeak[i] = tempSecStrong[i];
tempSecWeak[i] = tempStrong[i];
}
}
mutation(tempWeak);
mutation(tempSecWeak);
population.add(tempWeak);
population.add(tempSecWeak);
for(int j=0; j<tempWeak.length;j++)
{
System.out.print(tempWeak[j] + ", ");
}
for(int j=0; j<tempWeak.length;j++)
{
System.out.print(tempSecWeak[j] + ", ");
}
}
Mutation Function:
public void mutation(int[] mutate)
{
if(this.blankData.size() > 2)
{
Blank blank = this.blankData.get(0);
int x = blank.getPosition();
Blank blank2 = this.blankData.get(1);
int y = blank2.getPosition();
Blank blank3 = this.blankData.get(2);
int z = blank3.getPosition();
int rando = random.nextInt(4) + 1;
if(rando == 2)
{
int rando2 = random.nextInt(4) + 1;
mutate[x] = rando2;
}
if(rando == 3)
{
int rando2 = random.nextInt(4) + 1;
mutate[y] = rando2;
}
if(rando==4)
{
int rando3 = random.nextInt(4) + 1;
mutate[z] = rando3;
}
}
The reason you see rapid convergence is that your methodology for "mating" is not very good. You are always producing two offspring from "mating" of the top two scoring individuals. Imagine what happens when one of the new offspring is the same as your top individual (by chance, no crossover and no mutation, or at least none that have an effect on the fitness). Once this occurs, the top two individuals are identical which eliminates the effectiveness of crossover.
A more typical approach is to replace EVERY individual on every generation. There are lots of possible variations here, but you might do a random choice of two parents weighted fitness.
Regarding population size: I don't know how hard of a problem sudoku is given your genetic representation and fitness function, but I suggest that you think about millions of individuals, not dozens.
If you are working on really hard problems, genetic algorithms are much more effective when you place your population on a 2-D grid and choosing "parents" for each point in the grid from the nearby individuals. You will get local convergence, but each locality will have converged on different solutions; you get a huge amount of variation produced from the borders between the locally-converged areas of the grid.
Another technique you might think about is running to convergence from random populations many times and store the top individual from each run. After you build up a bunch of different local minima genomes, build a new random population from those top individuals.
I think the Sudoku is a permutation problem. therefore i suggest you to use random permutation numbers for initializing population and use the crossover method which Compatible to permutation problems.

Coefficient Correlation Over a Large Binary Image Data-Set - Slow Performance

I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}
Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.

How do I know that my neural network is being trained correctly

I've written an Adaline Neural Network. Everything that I have compiles, so I know that there isn't a problem with what I've written, but how do I know that I have to algorithm correct? When I try training the network, my computer just says the application is running and it just goes. After about 2 minutes I just stopped it.
Does training normally take this long (I have 10 parameters and 669 observations)?
Do I just need to let it run longer?
Hear is my train method
public void trainNetwork()
{
int good = 0;
//train until all patterns are good.
while(good < trainingData.size())
{
for(int i=0; i< trainingData.size(); i++)
{
this.setInputNodeValues(trainingData.get(i));
adalineNode.run();
if(nodeList.get(nodeList.size()-1).getValue(Constants.NODE_VALUE) != adalineNode.getValue(Constants.NODE_VALUE))
{
adalineNode.learn();
}
else
{
good++;
}
}
}
}
And here is my learn method
public void learn()
{
Double nodeValue = value.get(Constants.NODE_VALUE);
double nodeError = nodeValue * -2.0;
error.put(Constants.NODE_ERROR, nodeError);
BaseLink link;
int count = inLinks.size();
double delta;
for(int i = 0; i < count; i++)
{
link = inLinks.get(i);
Double learningRate = value.get(Constants.LEARNING_RATE);
Double value = inLinks.get(i).getInValue(Constants.NODE_VALUE);
delta = learningRate * value * nodeError;
inLinks.get(i).updateWeight(delta);
}
}
And here is my run method
public void run()
{
double total = 0;
//find out how many input links there are
int count = inLinks.size();
for(int i = 0; i< count-1; i++)
{
//grab a specific link in sequence
BaseLink specificInLink = inLinks.get(i);
Double weightedValue = specificInLink.weightedInValue(Constants.NODE_VALUE);
total += weightedValue;
}
this.setValue(Constants.NODE_VALUE, this.transferFunction(total));
}
These functions are part of a library that I'm writing. I have the entire thing on Github here. Now that everything is written, I just don't know how I should go about actually testing to make sure that I have the training method written correctly.
I asked a similar question a few months ago.
Ten parameters with 669 observations is not a large data set. So there is probably an issue with your algorithm. There are two things you can do that will make debugging your algorithm much easier:
Print the sum of squared errors at the end of each iteration. This will help you determine if the algorithm is converging (at all), stuck at a local minimum, or just very slowly converging.
Test your code on a simple data set. Pick something easy like a two-dimensional input that you know is linearly separable. Will your algorithm learn a simple AND function of two inputs? If so, will it lean an XOR function (2 inputs, 2 hidden nodes, 2 outputs)?
You should be adding debug/test mode messages to watch if the weights are getting saturated and more converged. It is likely that good < trainingData.size() is not happening.
Based on Double nodeValue = value.get(Constants.NODE_VALUE); I assume NODE_VALUE is of type Double ? If that's the case then this line nodeList.get(nodeList.size()-1).getValue(Constants.NODE_VALUE) != adalineNode.getValue(Constants.NODE_VALUE) may not really converge exactly as it is of type double with lot of other parameters involved in obtaining its value and your convergence relies on it. Typically while training a neural network you stop when the convergence is within an acceptable error limit (not a strict equality like you are trying to check).
Hope this helps

Categories