Learning the boolean AND function using a perceptron

Learning the boolean AND function using a perceptron - java

I'm new to machine learning. I've written this code http://ideone.com/t9VOag for training a perceptron to learn the boolean AND function using the perceptron training rule.
The perceptron never learnes the correct weights. Errors for input (1, -1) and (-1, 1) make the weights to oscillate between 0.7999999999999999, 0.20000000000000004 and 0.7, 0.300000000000000 which is obvious as
For input 1, -1
target output - output given = 0-1 = -1
w1 = w1 + n*(t-o)*1 = w1 - n
w2 = w2 + n*(t-o)*(-1) = w2 + n
for input -1, 1
t-o = 0-1 = -1
w1 = w1 + (n)(t-o)(-1) = w1 + n
w2 = w2 + (n)(t-o)(1) = w2 - n
The weights are getting increased and decreased by the same amount
If I include the weight w0 for being updated during learning, it reaches a solution (but w0 isn't supposed to be updated?).
What is the correct implementation?

Take w0 out of the code altogether. Your perceptron should have 2 input nodes and 1 output node with a single weight connecting each input node to the output node. Like this (excuse the bad ascii art):
I1
\
\W1
\
Out
/
/W2
/
I2
You are effectively feeding in a strong bias by setting W0 to 1

In contrast to your statement "but w0 isn't supposed to be updated", w0 is supposed to be updated, however, input to w0 is always your unchangable bias (for example 1).
Intuition: Look at your problem; you have two inputs could be either 1 or -1 and they could change their positions without affecting the result. This is the nature of "and" operator. Therefore, w1 should be equal to w2 and bias weight (w0) should be zero.
Briefly, your code is correct and you should just uncomment updating w0.

Related

How to find the point that gives the maximum value fast? Java or c++ code please

I need a fast way to find maximum value when intervals are overlapping, unlike finding the point where got overlap the most, there is "order". I would have int[][] data that 2 values in int[], where the first number is the center, the second number is the radius, the closer to the center, the larger the value at that point is going to be. For example, if I am given data like:
int[][] data = new int[][]{
{1, 1},
{3, 3},
{2, 4}};
Then on a number line, this is how it's going to looks like:
x axis: -2 -1 0 1 2 3 4 5 6 7
1 1: 1 2 1
3 3: 1 2 3 4 3 2 1
2 4: 1 2 3 4 5 4 3 2 1
So for the value of my point to be as large as possible, I need to pick the point x = 2, which gives a total value of 1 + 3 + 5 = 9, the largest possible value. It there a way to do it fast? Like time complexity of O(n) or O(nlogn)

This can be done with a simple O(n log n) algorithm.
Consider the value function v(x), and then consider its discrete derivative dv(x)=v(x)-v(x-1). Suppose you only have one interval, say {3,3}. dv(x) is 0 from -infinity to -1, then 1 from 0 to 3, then -1 from 4 to 6, then 0 from 7 to infinity. That is, the derivative changes by 1 "just after" -1, by -2 just after 3, and by 1 just after 6.
For n intervals, there are 3*n derivative changes (some of which may occur at the same point). So find the list of all derivative changes (x,change), sort them by their x, and then just iterate through the set.
Behold:
intervals = [(1,1), (3,3), (2,4)]
events = []
for mid, width in intervals:
before_start = mid - width - 1
at_end = mid + width
events += [(before_start, 1), (mid, -2), (at_end, 1)]
events.sort()
prev_x = -1000
v = 0
dv = 0
best_v = -1000
best_x = None
for x, change in events:
dx = x - prev_x
v += dv * dx
if v > best_v:
best_v = v
best_x = x
dv += change
prev_x = x
print best_x, best_v
And also the java code:
TreeMap<Integer, Integer> ts = new TreeMap<Integer, Integer>();
for(int i = 0;i<cows.size();i++) {
int index = cows.get(i)[0] - cows.get(i)[1];
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
index = cows.get(i)[0] + 1;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) - 2);
}else {
ts.put(index, -2);
}
index = cows.get(i)[0] + cows.get(i)[1] + 2;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
}
int value = 0;
int best = 0;
int change = 0;
int indexBefore = -100000000;
while(ts.size() > 1) {
int index = ts.firstKey();
value += (ts.get(index) - indexBefore) * change;
best = Math.max(value, best);
change += ts.get(index);
ts.remove(index);
}
where cows is the data

Hmmm, a general O(n log n) or better would be tricky, probably solvable via linear programming, but that can get rather complex.
After a bit of wrangling, I think this can be solved via line intersections and summation of function (represented by line segment intersections). Basically, think of each as a triangle on top of a line. If the inputs are (C,R) The triangle is centered on C and has a radius of R. The points on the line are C-R (value 0), C (value R) and C+R (value 0). Each line segment of the triangle represents a value.
Consider any 2 such "triangles", the max value occurs in one of 2 places:
The peak of one of the triangle
The intersection point of the triangles or the point where the two triangles overall. Multiple triangles just mean more possible intersection points, sadly the number of possible intersections grows quadratically, so O(N log N) or better may be impossible with this method (unless some good optimizations are found), unless the number of intersections is O(N) or less.
To find all the intersection points, we can just use a standard algorithm for that, but we need to modify things in one specific way. We need to add a line that extends from each peak high enough so it would be higher than any line, so basically from (C,C) to (C,Max_R). We then run the algorithm, output sensitive intersection finding algorithms are O(N log N + k) where k is the number of intersections. Sadly this can be as high as O(N^2) (consider the case (1,100), (2,100),(3,100)... and so on to (50,100). Every line would intersect with every other line. Once you have the O(N + K) intersections. At every intersection, you can calculate the the value by summing the of all points within the queue. The running sum can be kept as a cached value so it only changes O(K) times, though that might not be posible, in which case it would O(N*K) instead. Making it it potentially O(N^3) (in the worst case for K) instead :(. Though that seems reasonable. For each intersection you need to sum up to O(N) lines to get the value for that point, though in practice, it would likely be better performance.
There are optimizations that could be done considering that you aim for the max and not just to find intersections. There are likely intersections not worth pursuing, however, I could also see a situation where it is so close you can't cut it down. Reminds me of convex hull. In many cases you can easily reduce 90% of the data, but there are cases where you see the worst case results (every point or almost every point is a hull point). For example, in practice there are certainly causes where you can be sure that the sum is going to be less than the current known max value.
Another optimization might be building an interval tree.

Why adding from biggest to smallest floating-point numbers is less accurate than adding from smallest to biggest?

My Java textbook states that adding from biggest to smallest is less accurate than adding from smallest to biggest when dealing with floating-point numbers. However, he doesn't go on to clearly explain why this is the case.

Floating point has a limited number of digits of precision (6 for float, 15 for double). The calculation
1.0e20d + 1
gives the result 1.0e20 because there is not enough precision to represent the number
100,000,000,000,000,000,001
If you start with the largest number then any numbers more than n orders of magnitude smaller (where n is 6 or 15 depending on type) will not contribute to the sum at all. Start with the smallest and you might sum several smaller numbers into one that will affect the final total.
Where it would make a difference is, for example
1.0e20 + 1.0e4 + 6.0e4 + 3.0e4
Assuming it's exactly 15 decimal digits precision (it's not, see the linked article below, but 15 is good enough for the example), if you start with the larger number, none of the others will make a difference because they're too small. If you start with the smaller ones, they add up to 1.0e5, which IS large enough to affect the final total.
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic

An excellent explanation is available in section 4.2 of "Accuracy and Stability of Numerical Algorithms" by Nick Higham. Below is my casual interpretation of this:
The key property of floating point is that when the result of an individual operation cannot be exactly represented, it is rounded to the nearest value. This has many consequences, namely that addition (and multiplication) is no longer associative.
The other main thing to note is that the error (the difference between the true value and the rounded value) is relative. If we use square brackets ([])to denote this rounding operation, then we have the property for any number x:
|[x] - x| <= ϵ |[x]| / 2
Where ϵ is the machine epsilon.
So suppose that we want to sum up [x1, x2, x3, x4]. The obvious way to do it is via
s2 = x1 + x2
s3 = s2 + x3 = x1 + x2 + x3
s4 = s3 + x4 = x1 + x2 + x3 + x4
As noted above, we can't do this exactly, so we're actually doing:
t2 = [x1 + x2]
t3 = [t2 + x3] = [[x1 + x2] + x3]
t4 = [t3 + x4] = [[[x1 + x2] + x3] +x4]
So how big is the resulting error |t4 - s4|? Well we know that
|t2 - s2| = |[x1+x2] - (x1+x2)| <= ϵ/2 |t2|
Now by the Triangle inequality we can write
|t3 - s3| = |[t2+x3] - (t2+x3) + (t2+x3) - (s2+x3)|
<= |[t2+x3] - (t2+x3)| + |t2 - s2|
<= ϵ/2 (|t3| + |t2|)
And again:
|t4 - s4| = |[t3+x4] - (t3+x4) + (t3+x4) - (s3+x4)|
<= |[t3+x4] - (t3+x4)| + |t3 - s3|
<= ϵ/2 (|t4| + |t3| + |t2|)
This leads to Higham's general advice:
In designing or choosing a summation method to achieve high accuracy, the aim should be to minimize the absolute values of the intermediate sums ti.
So if you're doing sequential summation (like we did above), then you want to start with the smallest elements, as that will give you the smallest intermediate sums.
But that is not the only option! There is also pairwise summation, where you add up pairs in a tree form (e.g. [[x1 + x2] + [x3 + x4]]), though this requires allocating a work array. You can also utilise SIMD vectorisation, by storing the intermediate sum in a vector, which can give both speed and accuracy improvements.

Directed probability graph - algorithm to reduce cycles?

Consider a directed graph which is traversed from first node 1 to some final nodes (which have no more outgoing edges). Each edge in the graph has a probability associated with it. Summing up the probabilities to take each possible path towards all possible final nodes returns 1. (Which means, we are guaranteed to arrive at one of the final nodes eventually.)
The problem would be simple if loops in the graph would not exist. Unfortunately rather convoluted loops can arise in the graph, which can be traversed an infinite amount of times (probability decreases multiplicatively with each loop traversal, obviously).
Is there a general algorithm to find the probabilities to arrive at each of the final nodes?
A particularly nasty example:
We can represent the edges as a matrix (probability to go from row (node) x to row (node) y is in the entry (x,y))
{{0, 1/2, 0, 1/14, 1/14, 0, 5/14},
{0, 0, 1/9, 1/2, 0, 7/18, 0},
{1/8, 7/16, 0, 3/16, 1/8, 0, 1/8},
{0, 1, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0}}
Or as a directed graph:
The starting node 1 is blue, the final nodes 5,6,7 are green. All edges are labelled by the probability to traverse them when starting from the node where they originate.
This has eight different paths from starting node 1 to the final nodes:
{{1/14, {1, 5}}, {5/14, {1, 7}}, {7/36, {1, 2, 6}},
{1/144, {1, 2, 3, 5}}, {1/144, {1, 2, 3, 7}},
{1/36, {1, 4, 2, 6}}, {1/1008, {1, 4, 2, 3, 5}}, {1/1008, {1, 4, 2, 3, 7}}}
(The notation for each path is {probability,sequence of nodes visited})
And there are five distinct loops:
{{1/144, {2, 3, 1}}, {7/144, {3, 2}}, {1/2, {4, 2}},
{1/48, {3, 4, 2}}, {1/1008, {4, 2, 3, 1}}}
(Notation for loops is {probability to traverse loop once,sequence of nodes visited}).
If only these cycles could be resolved to obtain an effectively tree like graph, the problem would be solved.
Any hint on how to tackle this?
I'm familiar with Java, C++ and C, so suggestions in these languages are preferred.

I'm not expert in the area of Markov chains, and although I think it's likely that algorithms are known for the kind of problem you present, I'm having difficulty finding them.
If no help comes from that direction, then you can consider rolling your own. I see at least two different approaches here:
Simulation.
Examine how the state of the system evolves over time by starting with the system in state 1 at 100% probability, and performing many iterations in which you apply your transition probabilities to compute the probabilities of the state obtained after taking a step. If at least one final ("absorbing") node can be reached (at non-zero probability) from every node, then over enough steps, the probability that the system is in anything other than a final state will decrease asymptotically toward zero. You can estimate the probability that the system ends in final state S as the probability that it is in state S after n steps, with an upper bound on the error in that estimate given by the probability that the system is in a non-final state after n steps.
As a practical matter, this is the same is computing Trn, where Tr is your transition probability matrix, augmented with self-edges at 100% probability for all the final states.
Exact computation.
Consider a graph, G, such as you describe. Given two vertices i and f, such that there is at least one path from i to f, and f has no outgoing edges other than self-edges, we can partition the paths from i to f into classes characterized by the number of times they revisit i prior to reaching f. There may be an infinite number of such classes, which I will designate Cif(n), where n represents the number of times the paths in Cif(n) revisit node i. In particular, Cii(0) contains all the simple loops in G that contain i (clarification: as well as other paths).
The total probability of ending at node f given that the system traverses graph G starting at node i is given by
Pr(f|i, G) = Pr(Cif(0)|G) + Pr(Cif(1)|G) + Pr(Cif(2)|G) ...
Now observe that if n > 0 then each path in Cif(n) has the form of a union of two paths c and t, where c belongs to Cii(n-1) and t belongs to Cif(0). That is, c is a path that starts at node i and ends at node i, passing through i n-1 times between, and t is a path from i to f that does not pass through i again. We can use that to rewrite our probability formula:
Pr(f|i,G) = Pr(Cif(0)|G) + Pr(Cii(0)|G) * Pr(Cif(0)|G) + Pr(Cii(1)|G) * Pr(Cif(0)|G) + ...
But note that every path in Cii(n) is a composition of n+1 paths belonging to Cii(0). It follows that Pr(Cii(n)|G) = Pr(Cii(0)|G)n+1, so we get
Pr(f|i) = Pr(Cif(0)|G) + Pr(Cii(0)|G) * Pr(Cif(0)|G) + Pr(Cii(0)|G)2 * Pr(Cif(0)|G) + ...
And now, a little algebra gives us
Pr(f|i,G) - Pr(Cif(0)|G) = Pr(Cii(0)|G) * Pr(f|i,G)
, which we can solve for Pr(f|i,G) to get
Pr(f|i,G) = Pr(Cif(0)|G) / (1 - Pr(Cii(0)|G))
We've thus reduced the problem to one in terms of paths that do not return to the starting node, except possibly as their end node. These do not preclude paths that have loops that don't include the starting node, but we can we nevertheless rewrite this problem in terms of several instances of the original problem, computed on a subgraph of the original graph.
In particular, let S(i, G) be the set of successors of vertex i in graph G -- that is, the set of vertices s such that there is an edge from i to s in G, and let X(G,i) be the subgraph of G formed by removing all edges that start at i. Furthermore, let pis be the probability associated with edge (i, s) in G.
Pr(Cif(0)|G) = Sum over s in S(i, G) of pis * Pr(f|s,X(G,i))
In other words, the probability of reaching f from i through G without revisiting i in between is the sum over all successors of i of the product of the probability of reaching s from i in one step with the probability of reaching f from s through G without traversing any edges outbound from i. That applies for all f in G, including i.
Now observe that S(i, G) and all the pis are knowns, and that the problem of computing Pr(f|s,X(G,i)) is a new, strictly smaller instance of the original problem. Thus, this computation can be performed recursively, and such a recursion is guaranteed to terminate. It may nevertheless take a long time if your graph is complex, and it looks like a naive implementation of this recursive approach would scale exponentially in the number of nodes. There are ways you could speed the computation in exchange for higher memory usage (i.e. memoization).
There are likely other possibilities as well. For example, I'm suspicious that there may be a bottom-up dynamic programming approach to a solution, but I haven't been able to convince myself that loops in the graph don't present an insurmountable problem there.

Problem Clarification
The input data is a set of m rows of n columns of probabilities, essentially an m by n matrix, where m = n = number of vertices on a directed graph. Rows are edge origins and columns are edge destinations. We will, on the bases of the mention of cycles in the question, that the graph is cyclic, that at least one cycle exists in the graph.
Let's define the starting vertex as s. Let's also define a terminal vertex as a vertex for which there are no exiting edges and the set of them as set T with size z. Therefore we have z sets of routes from s to a vertex in T, and the set sizes may be infinite due to cycles 1. In such a scenario, one cannot conclude that a terminal vertex will be reached in an arbitrarily large number of steps.
In the input data, probabilities for rows that correspond with vertices not in T are normalized to total to 1.0. We shall assume the Markov property, that the probabilities at each vertex do not vary with time. This precludes the use of probability to prioritize routes in a graph search 2.
Finite math texts sometimes name example problems similar to this question as Drunken Random Walks to underscore the fact that the walker forgets the past,
referring to the memory-free nature of Markovian chains.
Applying Probability to Routes
The probability of arriving at a terminal vertex can be expressed as an infinite series sum of products.
Pt = lim s -> ∞ Σ ∏ Pi, j,
where s is the step index, t is a terminal vertex index, i ∈ [1 .. m] and j ∈ [1 .. n]
Reduction
When two or more cycles intersect (sharing one or more vertices), analysis is complicated by an infinite set of patterns involving them. It appears, after some analysis and review of relevant academic work, that arriving at an accurate set of terminal vertex arrival probabilities with today's mathematical tools may best be accomplished with a converging algorithm.
A few initial reductions are possible.
The first consideration is to enumerate the destination vertex, which is easy since the corresponding rows have probabilities of zero.
The next consideration is to differentiate any further reductions from what the academic literature calls irreducible sub-graphs. The below depth first algorithm remembers which vertices have already been visited while constructing a potential route, so it can be easily retrofitted to identify which vertices are involved in cycles. However it is recommended to use existing well tested, peer reviewed graph libraries to identify and characterize sub-graphs as irreducible.
Mathematical reduction of irreducible portions of the graph may or may not be plausible. Consider starting vertex A and sole terminating vertex B in the graph represented as {A->C, C->A, A->D, D->A, C->D, D->C, C->B, D->B}.
Although one can reduce the graph to probability relations absent of cycles through vertex A, the vertex A cannot be removed for further reduction without either modifying probabilities of vertices exiting C and D or allowing both totals of probabilities of edges exiting C and D to be less than 1.0.
Convergent Breadth First Traversal
A breadth first traversal that ignores revisiting and allows cycles can iterate step index s, not to some fixed smax but to some sufficiently stable and accurate point in a convergent trend. This approach is especially called for if cycles overlap creating bifurcations in the simpler periodicity caused by a single cycle.
Σ PsΔ s.
For the establishment of a reasonable convergence as s increases, one must determine the desired accuracy as a criteria for completing convergence algorithm and a metric for measuring accuracy by looking at longer term trends in results at all terminal vertices. It may be important to provide a criteria where the sum of terminal vertex probabilities is close to unity in conjunction with the trend convergence metric, as both a sanity check and an accuracy criteria. Practically, four convergence criteria may be necessary 3.
Per terminal vertex probability trend convergence delta
Average probability trend convergence delta
Convergence of total probability on unity
Total number of steps (to cap depth for practical computing reasons)
Even beyond these four, the program may need to contain a trap for an interrupt that permits the writing and subsequent examination of output after a long wait without the satisfying of all four above criteria.
An Example Cycle Resistant Depth First Algorithm
There are more efficient algorithms than the following one, but it is fairly comprehensible, it compiles without warning with C++ -Wall, and it produces the desired output for all finite and legitimate directed graphs and start and destination vertices possible 4. It is easy to load a matrix in the form given in the question using the addEdge method 5.
#include <iostream>
#include <list>
class DirectedGraph {
private:
int miNodes;
std::list<int> * mnpEdges;
bool * mpVisitedFlags;
private:
void initAlreadyVisited() {
for (int i = 0; i < miNodes; ++ i)
mpVisitedFlags[i] = false;
}
void recurse(int iCurrent, int iDestination,
int route[], int index,
std::list<std::list<int> *> * pnai) {
mpVisitedFlags[iCurrent] = true;
route[index ++] = iCurrent;
if (iCurrent == iDestination) {
auto pni = new std::list<int>;
for (int i = 0; i < index; ++ i)
pni->push_back(route[i]);
pnai->push_back(pni);
} else {
auto it = mnpEdges[iCurrent].begin();
auto itBeyond = mnpEdges[iCurrent].end();
while (it != itBeyond) {
if (! mpVisitedFlags[* it])
recurse(* it, iDestination,
route, index, pnai);
++ it;
}
}
-- index;
mpVisitedFlags[iCurrent] = false;
}
public:
DirectedGraph(int iNodes) {
miNodes = iNodes;
mnpEdges = new std::list<int>[iNodes];
mpVisitedFlags = new bool[iNodes];
}
~DirectedGraph() {
delete mpVisitedFlags;
}
void addEdge(int u, int v) {
mnpEdges[u].push_back(v);
}
std::list<std::list<int> *> * findRoutes(int iStart,
int iDestination) {
initAlreadyVisited();
auto route = new int[miNodes];
auto pnpi = new std::list<std::list<int> *>();
recurse(iStart, iDestination, route, 0, pnpi);
delete route;
return pnpi;
}
};
int main() {
DirectedGraph dg(5);
dg.addEdge(0, 1);
dg.addEdge(0, 2);
dg.addEdge(0, 3);
dg.addEdge(1, 3);
dg.addEdge(1, 4);
dg.addEdge(2, 0);
dg.addEdge(2, 1);
dg.addEdge(4, 1);
dg.addEdge(4, 3);
int startingNode = 2;
int destinationNode = 3;
auto pnai = dg.findRoutes(startingNode, destinationNode);
std::cout
<< "Unique routes from "
<< startingNode
<< " to "
<< destinationNode
<< std::endl
<< std::endl;
bool bFirst;
std::list<int> * pi;
auto it = pnai->begin();
auto itBeyond = pnai->end();
std::list<int>::iterator itInner;
std::list<int>::iterator itInnerBeyond;
while (it != itBeyond) {
bFirst = true;
pi = * it ++;
itInner = pi->begin();
itInnerBeyond = pi->end();
while (itInner != itInnerBeyond) {
if (bFirst)
bFirst = false;
else
std::cout << ' ';
std::cout << (* itInner ++);
}
std::cout << std::endl;
delete pi;
}
delete pnai;
return 0;
}
Notes
[1] Improperly handled cycles in a directed graph algorithm will hang in an infinite loop. (Note the trivial case where the number of routes from A to B for the directed graph represented as {A->B, B->A} is infinity.)
[2] Probabilities are sometimes used to reduce the CPU cycle cost of a search. Probabilities, in that strategy, are input values for meta rules in a priority queue to reduce the computational challenge very tedious searches (even for a computer). The early literature in production systems termed the exponential character of unguided large searches Combinatory Explosions.
[3] It may be practically necessary to detect breadth first probability trend at each vertex and specify satisfactory convergence in terms of four criteria
Δ(Σ∏P)t <= Δmax ∀ t
Σt=0T Δ(Σ∏P)t / T <= Δave
|Σ Σ∏P - 1| <= umax, where u is the maximum allowable deviation from unity for the sum of final probabilities
s < Smax
[4] Provided there are enough computing resources available to support the data structures and ample time to arrive at an answer for the given computing system speed.
[5] You can load DirectedGraph dg(7) with the input data using two loops nested to iterate through the rows and columns enumerated in the question. The body of the inner loop would simply be a conditional edge addition.
if (prob != 0) dg.addEdge(i, j);
Variable prob is P m,n. Route existence is only concerned with zero/nonzero status.

I found this question while researching directed cyclic graphs. The probability of reaching each of the final nodes can be calculated using absorbing Markov chains.
The video Markov Chains - Part 7 (+ parts 8 and 9) explains absorbing states in Markov chains and the math behind it.

I understand this as the following problem:
Given an initial distribution to be on each node as a vector b and a Matrix A that stores the probability to jump from node i to node j in each time step, somewhat resembling an adjacency matrix.
Then the distribution b_1 after one time step is A x b. The distribution b_2 after two time steps is A x b_1. Likewise, the distribution b_n is A^n x b.
For an approximation of b_infinite, we can do the following:
Vector final_probability(Matrix A, Vector b,
Function Vector x Vector -> Scalar distance, Scalar threshold){
b_old = b
b_current = A x b
while(distance(b_old,b_current) < threshold){
b_old = b_current
b_current = A x b_current
}
return b_current
}
(I used mathematical variable names for convencience)
In other words, we assume that the sequence of distributions converges nicely after the given threshold. Might not hold true, but will usually work.
You might want to add a maximal amount of iterations to that.
Euclidean distance should work well as distance.
(This uses the concept of a Markov Chain but is more of a pragmatical solution)

XOR Neural Network(FF) converges to 0.5

I've created a program that allows me to create flexible Neural networks of any size/length, however I'm testing it using the simple structure of an XOR setup(Feed forward, Sigmoid activation, back propagation, no batching).
EDIT: The following is a completely new approach to my original question which didn't supply enough information
EDIT 2: I started my weight between -2.5 and 2.5, and fixed a problem in my code where I forgot some negatives. Now it either converges to 0 for all cases or 1 for all, instead of 0.5
Everything works exactly the way that I THINK it should, however it is converging toward 0.5, instead of oscillating between outputs of 0 and 1. I've completely gone through and hand calculated an entire setup of feeding forward/calculating delta errors/back prop./ etc. and it matched what I got from the program. I have also tried optimizing it by changing learning rate/ momentum, as well as increase complexity in the network(more neurons/layers).
Because of this, I assume that either one of my equations is wrong, or I have some other sort of misunderstanding in my Neural Network. The following is the logic with equations that I follow for each step:
I have an input layer with two inputs and a bias, a hidden with 2 neurons and a bias, and an output with 1 neuron.
Take the input from each of the two input neurons and the bias neuron, then multiply them by their respective weights, and then add them together as the input for each of the two neurons in the hidden layer.
Take the input of each hidden neuron, pass it through the Sigmoid activation function (Reference 1) and use that as the neuron's output.
Take the outputs of each neuron in hidden layer (1 for the bias), multiply them by their respective weights, and add those values to the output neuron's input.
Pass the output neuron's input through the Sigmoid activation function, and use that as the output for the whole network.
Calculate the Delta Error(Reference 2) for the output neuron
Calculate the Delta Error(Reference 3) for each of the 2 hidden neurons
Calculate the Gradient(Reference 4) for each weight (starting from the end and working back)
Calculate the Delta Weight(Reference 5) for each weight, and add that to its value.
Start the process over with by Changing the inputs and expected output(Reference 6)
Here are the specifics of those references to equations/processes (This is probably where my problem is!):
x is the input of the neuron: (1/(1 + Math.pow(Math.E, (-1 * x))))
-1*(actualOutput - expectedOutput)*(Sigmoid(x) * (1 - Sigmoid(x))//Same sigmoid used in reference 1
SigmoidDerivative(Neuron.input)*(The sum of(Neuron.Weights * the deltaError of the neuron they connect to))
ParentNeuron.output * NeuronItConnectsTo.deltaError
learningRate*(weight.gradient) + momentum*(Previous Delta Weight)
I have an arrayList with the values 0,1,1,0 in it in that order. It takes the first pair(0,1), and then expects a 1. For the second time through, it takes the second pair(1,1) and expects a 0. It just keeps iterating through the list for each new set. Perhaps training it in this systematic way causes the problem?
Like I said before, they reason I don't think it's a code problem is because it matched exactly what I had calculated with paper and pencil (which wouldn't have happened if there was a coding error).
Also when I initialize my weights the first time, I give them a random double value between 0 and 1. This article suggests that that may lead to a problem: Neural Network with backpropogation not converging
Could that be it? I used the n^(-1/2) rule but that did not fix it.
If I can be more specific or you want other code let me know, thanks!

This is wrong
SigmoidDerivative(Neuron.input)*(The sum of(Neuron.Weights * the deltaError of the neuron they connect to))
First is sigmoid activation (g)
second is derivative of sigmoid activation
private double g(double z) {
return 1 / (1 + Math.pow(2.71828, -z));
}
private double gD(double gZ) {
return gZ * (1 - gZ);
}
Unrelated note: Your notation of (-1*x) is really strange just use -x
Your implementation from how you phrase the steps of your ANN seems poor. Try to focus on implementing Forward/BackPropogation and then an UpdateWeights method.
Creating a matrix class
This is my Java implementation, its very simple and somewhat rough. I use a Matrix class to make the math behind it appear very simple in code.
If you can code in C++ you can overload operaters which will enable for even easier writing of comprehensible code.
https://github.com/josephjaspers/ArtificalNetwork/blob/master/src/artificalnetwork/ArtificalNetwork.java
Here are the algorithms (C++)
All of these codes can be found on my github (the Neural nets are simple and funcitonal)
Each layer includes the bias nodes, which is why there are offsets
void NeuralNet::forwardPropagation(std::vector<double> data) {
setBiasPropogation(); //sets all the bias nodes activation to 1
a(0).set(1, Matrix(data)); //1 to offset for bias unit (A = X)
for (int i = 1; i < layers; ++i) {
// (set(1 -- offsets the bias unit
z(i).set(1, w(i - 1) * a(i - 1));
a(i) = g(z(i)); // g(z ) if the sigmoid function
}
}
void NeuralNet::setBiasPropogation() {
for (int i = 0; i < activation.size(); ++i) {
a(i).set(0, 0, 1);
}
}
outLayer D = A - Y (y is the output data)
hiddenLayers d^l = (w^l(T) * d^l+1) *: gD(a^l)
d = derivative vector
W = weights matrix (Length = connections, width = features)
a = activation matrix
gD = derivative function
^l = IS NOT POWER OF (this just means at layer l)
= dotproduct
*: = multiply (multiply each element "through")
cpy(n) returns a copy of the matrix offset by n (ignores n rows)
void NeuralNet::backwardPropagation(std::vector<double> output) {
d(layers - 1) = a(layers - 1) - Matrix(output);
for (int i = layers - 2; i > -1; --i) {
d(i) = (w(i).T() * d(i + 1).cpy(1)).x(gD(a(i)));
}
}
Explaining this code maybe confusing without images so I'm sending this link which I think is a good source, it also contains an explanation of BackPropagation which may be better then my own explanation.
http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
void NeuralNet::updateWeights() {
// the operator () (int l, int w) returns a double reference at that position in the matrix
// thet operator [] (int n) returns the nth double (reference) in the matrix (useful for vectors)
for (int l = 0; l < layers - 1; ++l) {
for (int i = 1; i < d(l + 1).length(); ++i) {
for (int j = 0; j < a(l).length(); ++j) {
w(l)(i - 1, j) -= (d(l + 1)[i] * a(l)[j]) * learningRate + m(l)(i - 1, j);
m(l)(i - 1, j) = (d(l + 1)[i] * a(l)[j]) * learningRate * momentumRate;
}
}
}
}

Find a sum equal or greater than given target using only numbers from set

Example 1:
Shop selling beer, available packages are 6 and 10 units per package. Customer inputs 26 and algorithm replies 26, because 26 = 10 + 10 + 6.
Example 2:
Selling spices, available packages are 0.6, 1.5 and 3. Target value = 5. Algorithm returns value 5.1, because it is the nearest greater number than target possible to achieve with packages (3, 1.5, 0.6).
I need a Java method that will suggest that number.
Simmilar algorithm is described in Bin packing problem, but it doesn't suit me.
I tried it and when it returned me the number smaller than target I was runnig it once again with increased target number. But it is not efficient when number of packages is huge.
I need almost the same algorithm, but with the equal or greater nearest number.
Similar question: Find if a number is a possible sum of two or more numbers in a given set - python.

First let's reduce this problem to integers rather than real numbers, otherwise we won't get a fast optimal algorithm out of this. For example, let's multiply all numbers by 100 and then just round it to the next integer. So say we have item sizes x1, ..., xn and target size Y. We want to minimize the value
k1 x1 + ... + kn xn - Y
under the conditions
(1) ki is a non-positive integer for all n ≥ i ≥ 1
(2) k1 x1 + ... + kn xn - Y ≥ 0
One simple algorithm for this would be to ask a series of questions like
Can we achieve k1 x1 + ... + kn xn = Y + 0?
Can we achieve k1 x1 + ... + kn xn = Y + 1?
Can we achieve k1 x1 + ... + kn xn = Y + z?
etc. with increasing z
until we get the answer "Yes". All of these problems are instances of the Knapsack problem with the weights set equal to the values of the items. The good news is that we can solve all those at once, if we can establish an upper bound for z. It's easy to show that there is a solution with z ≤ Y, unless all the xi are larger than Y, in which case the solution is just to pick the smallest xi.
So let's use the pseudopolynomial dynamic programming approach to solve Knapsack: Let f(i,j) be 1 iif we can reach total item size j with the first i items (x1, ..., xi). We have the recurrence
f(0,0) = 1
f(0,j) = 0 for all j > 0
f(i,j) = f(i - 1, j) or f(i - 1, j - x_i) or f(i - 1, j - 2 * x_i) ...
We can solve this DP array in O(n * Y) time and O(Y) space. The result will be the first j ≥ Y with f(n, j) = 1.
There are a few technical details that are left as an exercise to the reader:
How to implement this in Java
How to reconstruct the solution if needed. This can be done in O(n) time using the DP array (but then we need O(n * Y) space to remember the whole thing).

You want to solve the integer programming problem min(ct) s.t. ct >= T, c >= 0 where T is your target weight, and c is a non-negative integer vector specifying how much of each package to purchase, and t is the vector specifying the weight of each package. You can either solve this with dynamic programming as pointed out by another answer, or, if your weights and target weight are too large then you can use general integer programming solvers, which have been highly optimized over the years to give good speed performance in practice.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Learning the boolean AND function using a perceptron - java

Take w0 out of the code altogether. Your perceptron should have 2 input nodes and 1 output node with a single weight connecting each input node to the output node. Like this (excuse the bad ascii art): I1 \ \W1 \ Out / /W2 / I2 You are effectively feeding in a strong bias by setting W0 to 1

Related

How to find the point that gives the maximum value fast? Java or c++ code please

Why adding from biggest to smallest floating-point numbers is less accurate than adding from smallest to biggest?

Directed probability graph - algorithm to reduce cycles?

XOR Neural Network(FF) converges to 0.5

Find a sum equal or greater than given target using only numbers from set

Categories

Resources