Algorithm to generate a Turing Machine from a Regular Expression

Algorithm to generate a Turing Machine from a Regular Expression - java

I'm developing a software to generate a Turing Machine from a regular expression.
[ EDIT: To clarify, the OP wants to take a regular expression as input, and programmatically generate a Turing Machine to perform the same task. OP is seeking to perform the task of creating a TM from a regular expression, not using a regular expression. ]
First I'll explain a bit what I have done and then what is my specific problem:
I've modeled the regular expression as follows:
RegularExpression (interface): the classes below implements this interface
Simple (ie: "aaa","bbb","abcde"): this is a leaf class it does not have any subexpressions
ComplexWithoutOr (ie: "a(ab)*","(a(ab)c(b))*"): this class contains a list of RegularExpression.
ComplexWithOr (ie: "a(a|b)","(a((ab)|c(b))"): this class contains an Or operation, which contains a list of RegularExpression. It represents the "a|b" part of the first example and the "(ab)|c(b)" of the second one.
Variable (ie: "awcw", where w E {a,b}*): this is not yet implemented, but the idea is to model it as a leaf class with some different logic from Simple. It represents the "w" part of the examples.
It is important that you understand and agree with the model above. If you have questions make a comment, before continue reading...
When it comes to MT generation, I have different levels of complexity:
Simple: this type of expression is already working. Generates a new state for each letter and moves right. If in any state, the letter read is not the expected, it starts a "rollback circuit" that finishes with the MT head in the initial position and stops in a not final state.
ComplexWithoutOr: here it comes my problem. Here, the algorithm generates an MT for each subexpression and concat them. This work for some simple cases, but I have problems with the rollback mechanism.
Here is an example that does not work with my algorithm:
"(ab)abac": this is a ComplexWithoutOr expression that contains a ComplexWithOr expression "(ab)" (that has a Simple expression inside "ab") and a Simple expression "abac"
My algorithm generates first an MT1 for "ab". This MT1 is used by the MT2 for "(ab)*", so if MT1 succeed it enters again in MT1, otherwise MT1 rollbacks and MT2 finishes right. In other words, MT2 cannot fail.
Then, it generates an MT3 for "abac". The output of MT2 it is the input of MT3. The output of MT3 is the result of the execution
Now, let suppose this input string: "abac". As you can see it matches with the regular expression. So let see what happens when the MT is executed.
MT1 is executed right the first time "ab". MT1 fails the second time "ac" and rollback, putting the MT head in the 3rd position "a". MT2 finishes right and input is forwarded to MT3. MT3 fails, because "ac"!="abac". So MT does not recognize "abac".
Do you understand the problem? Do you know any solution for this?
I'm using Java to develop it, but the language it is not important, I'd like to discuss the algorithm.

It is not entirely clear to me what exactly you are trying to implement. It looks like you want to make a Turing Machine (or any FSM in general) that accepts only those strings that are also accepted by the regular expression. In effect, you want to convert a regular expression to a FSM.
Actually that is exactly what a real regex matcher does under the hood. I think this series of articles by Russ Cox covers a lot of what you want to do.

Michael Sipser, in Introduction to the Theory of Computation, proves in chapter 1 that regular expressions are equivalent to finite automata in their descriptive power. Part of the proof involves constructing a nondeterministic finite automaton (NDFA) that recognizes the language described by a specific regular expression. I'm not about to copy half that chapter, which would be quite hard due to the notation used, so I suggest you borrow or purchase the book (or perhaps a Google search using these terms will turn up a similar proof) and use that proof as the basis for your algorithm.
As Turing machines can simulate an NDFA, I assume an algorithm to produce an NDFA is good enough.

in the chomsky hierarchy a regex is Level3, whereas a TM is Level1. this means, that a TM can produce any regex, but not vice versa.

Related

How to write a pseudocode of an Selenium automation test?

I need to write a pseudocode, but I've never write a pseudocode before. Searching about I have finded basic and simples algorithms pseudocode examples, but I don't have any idea to write a pseudocode that have Selenium methods.
Do you have an example of pseudocode for an automation test?
I have in my mind Java and selenium, automation tests from cucumber scenarios. I need just a example to guide me to write my pseudocode.

Pseudocode
Pseudocode is written in the form of annotations and informational text that is written in plain English only. Just like programming languages, it doesn't have any syntax, so it cannot be compiled or interpreted by the compiler.
Ways to write Pseudocode in Java
In order to write the Pseudocode in java, you can follow the steps below:
You need to maintain the arrangement of the sequence of the tasks and, based on that, write the pseudocode.
The pseudocode starts with the statement that establishes the aim or goal.
Points which we need to keep in mind while designing the pseudocode of a program in Java:
You should have to use the appropriate naming convention. By doing that, it is very easy to understand the pseudocode. So, the naming should be simple and distinct.
You should have to use the appropriate sentence casings. For methods, we use the CamelCase, for constants, we use the upper case, and for variables, we use the lower case.
The pseudocode should not be abstract, and the thing which is going to happen in the actual code should be elaborated.
We use the if-then, for, while, cases standard programming structures in the same way as we use it in programming.
All the sections of the pseudocode should be completed, finite and clear to understand.
The pseudocode should be as simple as it can be understood by a layman having no sufficient knowledge of technical terms.
Ensure that the pseudocode isn't written in a complete programmatic manner.
Sample Pseudocode
Initialize c to zero.
Initialize n to a random number to check Armstrong.
Initialize temp to n.
Repeat steps until the value of n are greater than zero.
Find a reminder of n by using n%10.
Remove the last digit from the number by using n/10.
Find the thrice of the reminder and add it to c.
If temp == c
Print "Armstrong number"
else
Not an Armstrong number"

Pseudo code is a "pseudo" because it has not necessarily operate with existing methods. Just use the common sense for you code like
elements = Selenium.find(locator)
for each element in elements
do:
assert that element.text is not empty
od

What would be the best way to build a Big-O runtime complexity analyzer for pseudocode in a text file?

I am trying to create a class that takes in a string input containing pseudocode and computes its' worst case runtime complexity. I will be using regex to split each line and analyze the worst-case and add up the complexities (based on the big-O rules) for each line to give a final worst-case runtime. The pseudocode written will follow a few rules for declaration, initilization, operations on data structures. This is something I can control. How should I go about designing a class considering the rules of iterative and recursive analysis?
Any help in C++ or Java is appreciated. Thanks in advance.
class PseudocodeAnalyzer
{
public:
string inputCode;
string performIterativeAnalysis(string line);
string performRecursiveAnalysis(string line);
string analyzeTotalComplexity(string inputCode);
}
An example for iterative algorithm: Check if number in a grid is Odd:
1. Array A = Array[N][N]
2. for i in 1 to N
3. for j in 1 to N
4. if A[i][j] % 2 == 0
5. return false
6. endif
7. endloop
8. endloop
Worst-case Time-Complexity: O(n*n)

The concept: "I wish to write a program that analyses pseudocode in order to print out the algorithmic complexity of the algorithm it describes" is mathematically impossible!
Let me try to explain why that is, or how you get around the inevitability that you cannot write this.
Your pseudocode has certain capabilities. You call it pseudocode, but given that you are now trying to parse it, it's still a 'real' language where terms have real meaning. This language is capable of expressing algorithms.
So, which algorithms can it express? Presumably, 'all of them'. There is this concept called a 'turing machine': You can prove that anything a computer can do, a turing machine can also do. And turing machines are very simple things. Therefore, if you have some simplistic computer and you can use that computer to emulate a turing machine, you can therefore use it to emulate a complete computer. This is how, in fundamental informatics, you can prove that a certain CPU or system is capable of computing all the stuff some other CPU or system is capable of computing: Use it to compute a turing machine, thus proving you can run it all. Any system that can be used to emulate a turing machine is called 'turing complete'.
Then we get to something very interesting: If your pseudocode can be used to express anything a real computer can do, then your pseudocode can be used to 'write'... your very pseudocode checker!
So let's say we do just that and stick the pseudocode that describes your pseudocode checker in a function we shall call pseudocodechecker. It takes as argument a string containing some pseudocode, and returns a string such as O(n^2).
You can then write this program in pseudocode:
1. if pseudocodechecker(this-very-program) == O(n^2)
2. If True runSomeAlgorithmThatIsO(1)
3. If False runSomeAlgorithmTahtIsO(n^2)
And this is self-defeating: We have 'programmed' a paradox. It's like "This statement is a lie", or "the set of all sets that do not contain themselves". If it's false it is true and if it is true it false. [Insert GIF of exploding computer here].
Thus, we have mathematically proved that what you want is impossible, unless one of the following is true:
A. Your pseudocode-based checker is incorrect. As in, it will flat out give a wrong answer sometimes, thus solving the paradox: If you feed your program a paradox, it gives a wrong answer. But how useful is such an app? An app where you know the answer it gives may be incorrect?
B. Your pseudocode-based checker is incomplete: The official definition of your pseudocode language is so incapable, you cannot even write a turing machine in it.
That last one seems like a nice solution; but it is quite drastic. It pretty much means that your algorithm can only loop over constant ranges. It cannot loop until a condition is true, for example. Another nice solution appears to be: The program is capable of realizing that an answer cannot be given, and will then report 'no answer available', but unfortunately, with some more work, you can show that you can still use such a system to develop a paradox.

The answer by #rzwitserloot and the ones given in the link are correct. Let me just add that it is possible to compute an approximation both to the halting problem as well as to finding the time complexity of a piece of code (written in a Turing-complete language!). (Compare that to the existence of automated theorem provers for arithmetic and other second order logics, which are undecidable!) A tool that under-approximated the complexity problem would output the correct time complexity for some inputs, and "don't know" for other inputs.
Indeed, the whole wide field of code analyzers, often built into the IDEs that we use every day, more often than not under-approximate decision problems that are uncomputable, e.g. reachability, nullability or value analyses.
If you really want to write such a tool: the basic idea is to identify heuristics, i.e., common patterns for which a solution is known, such as various patterns of nested for-loops with only very basic arithmetic operations manipulating the indices, or simple recursive functions where the recurrence relation can be spotted straight-away. It would actually be not too hard (though definitely not easy!) to write a tool that could solve most of the toy problems (such as the one you posted) that are given as homework to students, and that are often posted as questions here on SO, since they follow a rather small number of patterns.
If you wish to go beyond simple heuristics, the main theoretical concept underlying more powerful code analyzers is abstract interpretation. Applied to your use case, this would mean developing a mapping between code constructs in your language to code constructs in a different language (or simpler code constructs in the same language) for which it is easier to compute the time complexity. This mapping would have to conform to some constraints, in particular, the mapped constructs have have the same or worse time complexity as the original code. Actually, mapping a piece of code to a recurrence relation would be an example of abstract interpretation. So is replacing a line of code with something like "O(1)". So, the task is just to formalize some of the things that we do in our heads anyway when we are analyzing the time complexity of code.

Distinguishing between right shift (>>) and Java generics

I am writing a lexer for java in flex.
The java spec says:
"The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (§4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >."
So how can I distinguish between right shift operator and something like in <List<List>>?

The original Java generics proposal (JSR-14) required modifying the Java grammar for parameterized types so that it would accept >> and >>> in contexts where multiple close angle brackets were possible. (I couldn't find a useful authoritative link for JSR-14 but Gilad Bracha's GJ specification is still available on his website; the grammar modifications are shown in section 2.3.)
These modifications were never formally incorporated in any Java standard as far as I know; eventually, JLS8 incorporated the change to the description of lexical analysis which you quote in your question. (See JDK-8021600, which also reproduces the convoluted grammar which was originally proposed.)
The grammar modifications proposed by Bracha et al will work, but you might find that they make incorporating other grammar changes more complicated. (I haven't really looked at this in any depth, so it might not actually be a problem for the current Java Language Specification. But it still might be an issue for future editions.)
While contextual lexical analysis does allow the simpler grammar actually used in the JLS, it certainly creates difficulties for lexical analysis. One possible approach is to abandon lexical analysis altogether by using a scannerless parser; this will certainly work but you won't be able to accomplish that within the Bison/Flex model. Also, you might find that some of the modifications needed to support scannerless parsing also require non-trivial changes to the published grammar.
Another possibility is to use lexical feedback from the parser, by incorporating mid-rule actions (MRAs) which turn a "type context" flag on and off when type contexts are entered and exited. (There is a complete list of type contexts in §4.11 which can be used to find the appropriate locations in the grammar.) If you try this, please be aware that the execution of MRAs is not fully synchronised with lexical analysis because the parser generally requires a lookahead token to decide whether or not to reduce the MRA. You often need to put the MRA one symbol earlier in the grammar than you might think, so that it actually takes effect by the time it is needed.
Another possibility might be to never recognise >> and >>> as tokens. Instead, the lexer could return two different > tokens, one used when the immediate next character is a >:
>/> { return CONJUNCTIVE_GT; }
> { return INDEPENDENT_GT; }
/* These two don't need to be changed. */
>>= { return SHIFT_ASSIGN; }
>>>= { return LONG_SHIFT_ASSIGN; }
Then you can modify your grammar to recognise >> and >>> operators, while allowing either form of > as a close angle bracket:
shift_op : CONJUNCTIVE_GT INDEPENDENT_GT
long_shift_op: CONJUNCTIVE_GT CONJUNCTIVE_GT INDEPENDENT_GT
close_angle : CONJUNCTIVE_GT | INDEPENDENT_GT
gt_op : INDENPENDENT_GT /* This unit production is not really necessary */
That should work (although I haven't tried it), but it doesn't play well with the Bison/Yacc operator precedence mechanism, because you cannot declare precedence for a non-terminal. So you'd need to use an expression grammar with explicit operator precedence rules, rather than an ambiguous grammar augmented with precedence declarations.

How to store mathematical formula in MS SQL Server DB and interpret it using JAVA?

I have to give the user the option to enter in a text field a mathematical formula and then save it in the DB as a String. That is easy enough, but I also need to retrieve it and use it to do calculations.
For example, assume I allow someone to specify the formula of employee salary calculation which I must save in String format in the DB.
GROSS_PAY = BASIC_SALARY - NO_PAY + TOTAL_OT + ALLOWANCE_TOTAL
Assume that terms such as GROSS_PAY, BASIC_SALARY are known to us and we can make out what they evaluate to. The real issue is we can't predict which combinations of such terms (e.g. GROSS_PAY etc.) and other mathematical operators the user may choose to enter (not just the +, -, ×, / but also the radical sigh - indicating roots - and powers etc. etc.). So how do we interpret this formula in string format once where have retrieved it from DB, so we can do calculations based on the composition of the formula.

Building an expression evaluator is actually fairly easy.
See my SO answer on how to write a parser. With a BNF for the range of expression operators and operands you exactly want, you can follow this process to build a parser for exactly those expressions, directly in Java.
The answer links to a second answer that discusses how to evaluate the expression as you parse it.
So, you read the string from the database, collect the set of possible variables that can occur in the expression, and then parse/evaluate the string. If you don't know the variables in advance (seems like you must), you can parse the expression twice, the first time just to get the variable names.

as of Evaluating a math expression given in string form there is a JavaScript Engine in Java which can execute a String functionality with operators.
Hope this helps.

You could build a string representation of a class that effectively wraps your expression and compile it using the system JavaCompiler — it requires a file system. You can evaluate strings directly using javaScript or groovy. In each case, you need to figure out a way to bind variables. One approach would be to use regex to find and replace known variable names with a call to a binding function:
getValue("BASIC_SALARY") - getValue("NO_PAY") + getValue("TOTAL_OT") + getValue("ALLOWANCE_TOTAL")
or
getBASIC_SALARY() - getNO_PAY() + getTOTAL_OT() + getALLOWANCE_TOTAL()
This approach, however, exposes you to all kinds of injection type security bugs; so, it would not be appropriate if security was required. The approach is also weak when it comes to error diagnostics. How will you tell the user why their expression is broken?
An alternative is to use something like ANTLR to generate a parser in java. It's not too hard and there are a lot of examples. This approach will provide both security (users can't inject malicious code because it won't parse) and diagnostics.

Finite State Machine program

I am tasked with creating a small program that can read in the definition of a FSM from input, read some strings from input and determine if those strings are accepted by the FSM based on the definition. I need to write this in either C, C++ or Java. I've scoured the net for ideas on how to get started, but the best I could find was a Wikipedia article on Automata-based programming. The C example provided seems to be using an enumerated list to define the states, that's fine if the states are hard coded in advance. Again, I need to be able to actually read the number of states and the definition of what each state is supposed to do. Any suggestions are appreciated.
UPDATE:
I can make the alphabet small (e.g. { a b }) and adopt other conventions such as the
start state is always state 0. I'm allowed to impose reasonable restrictions on the number of
states, e.g. no more than 10.
Question summary:
How do I implement an FSA?

First, get a list of all the states (N of them), and a list of all the symbols (M of them). Then there are 2 ways to go, interpretation or code-generation:
Interpretation. Make an NxM matrix, where each element of the matrix is filled in with the corresponding destination state number, or -1 if there is none. Then just have an initial state variable and start processing input. If you get to state -1, you fail. If you run out of input symbols without getting to the success state, you fail. Otherwise you succeed.
Code generation. Print out a program in C or your favorite compiler language. It should have an integer state variable initialized to the start state. It should have a for loop over the input characters, containing a switch on the state variable. You should have one case per state, and at each case, have a switch statement on the current character that changes the state variable.
If you want something even faster than 2, and that is sure to get you flunked (!), get rid of the state variable and instead use goto :-) If you flunk, you can comfort yourself in the knowledge that that's what compilers do.
P.S. You could get your F changed to an A if you recognize loops etc. in the state diagram and print out corresponding while and if statements, rather than using goto.

One non-hardcoded way to represent an automaton is as a transition matrix, which allows to represent for each current state, and each input character, what the next state is.

You haven't actually asked a question. You'll get more and better help if you have a specific question for a specific task (but still give the overall goal). The question should be narrow in scope (e.g. not "How can I implement an FSA?").
As for how to represent an FSA (which seems to be what you're having difficulties with), read on.
Start by considering the definition of an FSM: it's an alphabet ∑, a set of states S, a start state s0, a set of accept states A and a transition function δ from a state and a symbol to a state. You have to be able to determine these properties from the input. Any states not reachable by the transition function can be dropped to produce an equivalent FSM. The minimal set of states and alphabet are thus implicit in the transition function; you could make your FSM easier to use (and harder to implement, but not much harder) by not requiring either ∑ or S in the input.
You don't need to use the same representation for states that the input uses. You could use unsigned integers for your internal representation, as long as you have a map from integers to strings and strings to integers so you can convert between the internal representation and external representation. This way, your transition function can be stored as an array, so the transition step can be performed in constant time.
A simpler approach would be to use the external representation as your internal representation. With this option, the transition function would be stored as a map from strings and symbols to strings. The transition step would probably be O(log(|S|+|∑|)), given the performance of most map data structures. If symbols are represented as integers (e.g. chars), the transition function could be represented as a map from strings to an array of strings, giving O(log(|S|)) performance.
Yet another optionmodeled after the graph view of an FSM, is to create a class for states. A state has a name (the external representation). States are responsible for transitions; send a symbol to a state and get back another state.
class State {
property name;
State& transition(Symbol s);
void setTransition(Symbol s, State& to);
}
Store the set of states as a map from names to states.
There you go, three different places to start, each with a different way to represent states.

Stop thinking about everything at once. Do one thing at a time
- come with language of state machine
- come with language for stimulus
- create sample file of one state machine in language
- create sample file of stimulus
- come with class for state
- come with class for transition
- come with class for state machine as set of states and transitions
- add method to handle violation to state class
- code a little parser for language
- code another parser for language
- initial state
- some output thing like WriteLn here and there
- main method
- compile
- run
- debug
- done

The way the OpenFst toolkit does it is: A FSM has a vector of states, each of which has a vector of arcs. Each arc has an input (and output) label, a target state ID and a weight. You could take a look at the code. Maybe it will inspire you.

If you're using an object-oriented language like Java or C++, I'd recommend that you start with objects. Before you worry about file formats and the like, get a good object model for a finite state automata and how it behaves. How will you represent states, transitions, events, etc.? Will your FSA be a Composite? Once you have that sort of thing working you can get the file formats right. Anything will do: XML, text, etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.