implementing unification algorithm - java

I worked the last 5 days to understand how unification algorithm works in Prolog .
Now ,I want to implement such algorithm in Java ..
I thought maybe best way is to manipulate the string and decompose its parts using some datastructure such as Stacks ..
to make it clear :
suppose user inputs is:
a(X,c(d,X)) = a(2,c(d,Y)).
I already take it as one string and split it into two strings (Expression1 and 2 ).
now, how can I know if the next char(s) is Variable or constants or etc.. ,
I can do it by nested if but it seems to me not good solution ..
I tried to use inheritance but the problem still ( how can I know the type of chars being read ?)

First you need to parse the inputs and build expression trees. Then apply Milner's unification algorithm (or some other unification algorithm) to figure out the mapping of variables to constants and expressions.
A really good description of Milner's algorithm may be found in the Dragon Book: "Compilers: Principles, Techniques and Tools" by Aho, Sethi and Ullman. (Milners algorithm can also cope with unification of cyclic graphs, and the Dragon Book presents it as a way to do type inference). By the sounds of it, you could benefit from learning a bit about parsing ... which is also covered by the Dragon Book.
EDIT: Other answers have suggested using a parser generator; e.g. ANTLR. That's good advice, but (judging from your example) your grammar is so simple that you could also get by with using StringTokenizer and a hand-written recursive descent parser. In fact, if you've got the time (and inclination) it is worth implementing the parser both ways as a learning exercise.

It sounds like this problem is more to do with parsing than unification specifically. Using something like ANTLR might help in terms of turning the original string into some kind of tree structure.
(It's not quite clear what you mean by "do it by nested", but if you mean that you're doing something like trying to read an expression, and recursing when meeting each "(", then that's actually one of the right ways to do it -- this is at heart what the code that ANTLR generates for you will do.)
If you are more interested in the mechanics of unifying things than you are in parsing, then one perfectly good way to do this is to construct the internal representation in code directly, and put off the parsing aspect for now. This can get a bit annoying during development, as your Prolog-style statements are now a rather verbose set of Java statements, but it lets you focus on one problem at a time, which is usually helpful.
(If you structure things this way, this should make it straightforward to insert a proper parser later, that will produce the same sort of tree as you have until then been constructing by hand. This will let you attack the two problems separately in a reasonably neat fashion.)

Before you get to do the semantics of the language, you have to convert the text into a form that's easy to operate on. This process is called parsing and the semantic representation is called an abstract syntax tree (AST).
A simple recursive descent parser for Prolog might be hand written, but it's more common to use a parser toolkit such as Rats! or Antlr
In an AST for Prolog, you might have classes for Term, and CompoundTerm, Variable, and Atom are all Terms. Polymorphism allows the arguments to a compound term to be any Term.
Your unification algorithm then becomes unifying the name of any compound term, and recursively unifying the value of each argument of corresponding compound terms.

Related

What would be the best way to build a Big-O runtime complexity analyzer for pseudocode in a text file?

I am trying to create a class that takes in a string input containing pseudocode and computes its' worst case runtime complexity. I will be using regex to split each line and analyze the worst-case and add up the complexities (based on the big-O rules) for each line to give a final worst-case runtime. The pseudocode written will follow a few rules for declaration, initilization, operations on data structures. This is something I can control. How should I go about designing a class considering the rules of iterative and recursive analysis?
Any help in C++ or Java is appreciated. Thanks in advance.
class PseudocodeAnalyzer
{
public:
string inputCode;
string performIterativeAnalysis(string line);
string performRecursiveAnalysis(string line);
string analyzeTotalComplexity(string inputCode);
}
An example for iterative algorithm: Check if number in a grid is Odd:
1. Array A = Array[N][N]
2. for i in 1 to N
3. for j in 1 to N
4. if A[i][j] % 2 == 0
5. return false
6. endif
7. endloop
8. endloop
Worst-case Time-Complexity: O(n*n)
The concept: "I wish to write a program that analyses pseudocode in order to print out the algorithmic complexity of the algorithm it describes" is mathematically impossible!
Let me try to explain why that is, or how you get around the inevitability that you cannot write this.
Your pseudocode has certain capabilities. You call it pseudocode, but given that you are now trying to parse it, it's still a 'real' language where terms have real meaning. This language is capable of expressing algorithms.
So, which algorithms can it express? Presumably, 'all of them'. There is this concept called a 'turing machine': You can prove that anything a computer can do, a turing machine can also do. And turing machines are very simple things. Therefore, if you have some simplistic computer and you can use that computer to emulate a turing machine, you can therefore use it to emulate a complete computer. This is how, in fundamental informatics, you can prove that a certain CPU or system is capable of computing all the stuff some other CPU or system is capable of computing: Use it to compute a turing machine, thus proving you can run it all. Any system that can be used to emulate a turing machine is called 'turing complete'.
Then we get to something very interesting: If your pseudocode can be used to express anything a real computer can do, then your pseudocode can be used to 'write'... your very pseudocode checker!
So let's say we do just that and stick the pseudocode that describes your pseudocode checker in a function we shall call pseudocodechecker. It takes as argument a string containing some pseudocode, and returns a string such as O(n^2).
You can then write this program in pseudocode:
1. if pseudocodechecker(this-very-program) == O(n^2)
2. If True runSomeAlgorithmThatIsO(1)
3. If False runSomeAlgorithmTahtIsO(n^2)
And this is self-defeating: We have 'programmed' a paradox. It's like "This statement is a lie", or "the set of all sets that do not contain themselves". If it's false it is true and if it is true it false. [Insert GIF of exploding computer here].
Thus, we have mathematically proved that what you want is impossible, unless one of the following is true:
A. Your pseudocode-based checker is incorrect. As in, it will flat out give a wrong answer sometimes, thus solving the paradox: If you feed your program a paradox, it gives a wrong answer. But how useful is such an app? An app where you know the answer it gives may be incorrect?
B. Your pseudocode-based checker is incomplete: The official definition of your pseudocode language is so incapable, you cannot even write a turing machine in it.
That last one seems like a nice solution; but it is quite drastic. It pretty much means that your algorithm can only loop over constant ranges. It cannot loop until a condition is true, for example. Another nice solution appears to be: The program is capable of realizing that an answer cannot be given, and will then report 'no answer available', but unfortunately, with some more work, you can show that you can still use such a system to develop a paradox.
The answer by #rzwitserloot and the ones given in the link are correct. Let me just add that it is possible to compute an approximation both to the halting problem as well as to finding the time complexity of a piece of code (written in a Turing-complete language!). (Compare that to the existence of automated theorem provers for arithmetic and other second order logics, which are undecidable!) A tool that under-approximated the complexity problem would output the correct time complexity for some inputs, and "don't know" for other inputs.
Indeed, the whole wide field of code analyzers, often built into the IDEs that we use every day, more often than not under-approximate decision problems that are uncomputable, e.g. reachability, nullability or value analyses.
If you really want to write such a tool: the basic idea is to identify heuristics, i.e., common patterns for which a solution is known, such as various patterns of nested for-loops with only very basic arithmetic operations manipulating the indices, or simple recursive functions where the recurrence relation can be spotted straight-away. It would actually be not too hard (though definitely not easy!) to write a tool that could solve most of the toy problems (such as the one you posted) that are given as homework to students, and that are often posted as questions here on SO, since they follow a rather small number of patterns.
If you wish to go beyond simple heuristics, the main theoretical concept underlying more powerful code analyzers is abstract interpretation. Applied to your use case, this would mean developing a mapping between code constructs in your language to code constructs in a different language (or simpler code constructs in the same language) for which it is easier to compute the time complexity. This mapping would have to conform to some constraints, in particular, the mapped constructs have have the same or worse time complexity as the original code. Actually, mapping a piece of code to a recurrence relation would be an example of abstract interpretation. So is replacing a line of code with something like "O(1)". So, the task is just to formalize some of the things that we do in our heads anyway when we are analyzing the time complexity of code.

Determine the Plurality of a Noun/Verb

I have a program that is randomly generating sentences based on a bunch of text documents of all the nouns, verbs, adjectives, and adverbs. Does anyone know a way to determine if a noun/verb are plural or singular, or if there any text documents that contain a list of singular nouns/verbs and plural nouns? I'm doing this all in Java, and I have a decent idea of how to get information off of a website, so if there are any websites that could do that as well, I'd also appreciate those.
I am afraid, you cannot solve this by having a fixed list of words, especially verbs. Consider sentences:
You are free. We are free.
In the first one, are is singular, it is plural. Using a proper tagger as #jdaz suggests is the only way how you can do it in a reliable way.
If you work with English or a few other supported languages, StanfordNLP is an excellent choice. If you need a broad language coverage, you can use UDPipe that is natively in C++ but has a Java binding.
The first step would be to look it up in a list. For English you can reduce the size of the list by only including singular nouns, and then apply some basic string processing to find plurals: if your word ends in -s and is not in the list, cut off the -s and look again. If it now is in the list, it was a simple plural (car/cars). If not, continue. If it ends in -ies, remove that, append -y and look again. Now you will capture remedies/remedy. There are a number of such patterns you can use.
Some irregular nouns need to be in an exception list (ox/oxen), but there aren't that many. Some words of course are unspecified, like sheep, data, or police. Here you need to look at the context: if the noun is followed by a singular verb (eg eats, or is), then it would be singular as well.
With (English) verbs you can generally only identify the third person singular (with a similar procedure as used for nouns; you's need a list of exceptions for verbs anding in -s (such as kiss)). Forms of to be are more helpful, but the second person singular is an issue (are). However, unless you have direct speech in your texts, it will not be used very frequently.
Part of speech taggers can also only make these decisions on context, so I don't think they will be much of a help here. It's likely to be overkill. A couple of word lists and simple heuristic rules will probably give you equal or better accuracy using far fewer resources. This is the way these things were done before large amounts of annotated data were available.
In the end it depends on your circumstances. It might be quicker to simply use an existing tagger, but for this limited problem you might get better accuracy and speed with the rule-based approach, (or even a combined one for accuracy).

Java CFG parser that supports ambiguities

I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language.
I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle ambiguities. So which parser generator or java library is best for me?
Thanks in advance
You want a parser that uses the Earley algorithm. I haven't used either of these two libraries, but PEN and PEP appear implement this algorithm in Java.
Another option is Bison, which implements GLR. GLR is an LR type parsing algorithm that supports ambiguous grammars. Bison also generates Java code, in addition to C++.
Take a look at the related discussion here. In my last comment in that discussion I explain that you can make any parser generator produce all of the parse trees by cloning the parse tree derived so far before making the derivation fail.
If your grammar is:
G -> ...
You would augment is as this:
G' -> G {semantic:deal-with-complete-parse-tree} <NOT-VALID-TOKEN>.
The parsing engine will ultimately fail on all derivations, but your program will either have:
Saved clones of all the trees.
Dealt with the semantics of each of the trees as they were found.
Both ANTLR and JavaCC did well when I was teaching. My preference was for ANTLR because of its BNF lexical analysis, and its much less convoluted history, vision, y and licensing.

Programmatical approach in Java for file comparison

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub
For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).
A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.
There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf
As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.

Mathematical expression (string) to number in Java

I'm trying to find something like Java Embedding Plugin (JEP) that can evaluate a mathematical formula (string) and give back the answer.
But it should also calculate a variable, for example: (25+36+x)*2 = 25 should give: x = -11
A little like http://www.wolframalpha.com/, but it shouldn't be that versatile, and it should work off-line.
Open source is preferred.
I need it for my little calculator project, http://sourceforge.net/projects/calex/.
This is called Arithmetic evaluation. One of the easiest way to implement this is using Edsger Dijkstra Shunting-yard_algorithm.
The shunting-yard algorithm is a
method for parsing mathematical
equations specified in infix notation.
It can be used to produce output in
Reverse Polish notation (RPN) or as an
abstract syntax tree (AST). The
algorithm was invented by Edsger
Dijkstra and named the "shunting yard"
algorithm because its operation
resembles that of a railroad shunting
yard. Like the evaluation of RPN, the
shunting yard algorithm is
stack-based. Infix expressions are the
form of mathematical notation most
people are used to, for instance 3+4
or 3+4*(2−1). For the conversion there
are two text variables (strings), the
input and the output. There is also a
stack that holds operators not yet
added to the output queue. To convert,
the program reads each symbol in order
and does something based on that
symbol.
But I have seen exact solution what you are looking for on some stackoverflow user blog, but I can't remember the address (it was like 'code monkeyism'). It was lightweight class, that you could use in applets (you could also define constants and reset values).
Edit: Found it: http://tech.dolhub.com/Code/MathEval
A Linear-Recursive Math Evaluator
This math expression evaluator was born out of a need to have a small-footprint and efficient solution which could evaluate arbitrary expressions reasonably efficiently without requiring pre-compilation. I needed something which would do basic math with variables, expressions like: "Top+2", "Bottom-2" and "(Right+1-Left)/2".
Research on the Internet turned up a number of fairly good solutions, all of which revolved around creating parse trees (which makes sense). The problem was - they were all rather bulky, and I couldn't afford to add 100K to my applet size just for math. So I started wondering about a linear recursive solution to the problem. The end result is an acceptably performing single class with no external dependencies, weighing in under 10 KiB.
I released an expression evaluator based on Dijkstra's Shunting Yard algorithm, under the terms of the Apache License 2.0:
http://projects.congrace.de/exp4j/index.html

Categories