Compiler written in Java: Peephole optimizer implementation

Compiler written in Java: Peephole optimizer implementation - java

I'm writing a compiler for a subset of Pascal. The compiler produces machine instructions for a made-up machine. I want to write a peephole optimizer for this machine language, but I'm having trouble substituting some of the more complicated patterns.
Peephole optimizer specification
I've researched several different approaches to writing a peephole optimizer, and I've settled on a back-end approach:
The Encoder makes a call to an emit() function every time a machine instruction is to be generated.
emit(Instruction currentInstr) checks a table of peephole optimizations:
If the current instruction matches the tail of a pattern:
Check previously emitted instructions for matching
If all instructions matched the pattern, apply the optimization, modifying the tail end of the code store
If no optimization was found, emit the instruction as usual
Current design approach
The method is easy enough, it's the implementation I'm having trouble with. In my compiler, machine instructions are stored in an Instruction class. I wrote an InstructionMatch class stores regular expressions meant to match each component of a machine instruction. Its equals(Instruction instr) method returns true if the patterns match some machine instruction instr.
However, I can't manage to fully apply the rules I have. First off, I feel that given my current approach, I'll end up with a mess of needless objects. Given that a complete list of peephole optimizations numbers can number around 400 patterns, this will get out of hand fast. Furthermore, I can't actually get more difficult substitutions working with this approach (see "My question").
Alternate approaches
One paper I've read folds previous instructions into one long string, using regular expressions to match and substitute, and converting the string back to machine instructions. This seemed like a bad approach to me, please correct me if I'm wrong.
Example patterns, pattern syntax
x: JUMP x+1; x+1: JUMP y --> x: JUMP y
LOADL x; LOADL y; add --> LOADL x+y
LOADA d[r]; STOREI (n) --> STORE (n) d[r]
Note that each of these example patterns is just a human-readable representation of the following machine instruction template:
op_code register n d
(n usually indicates the number of words, and d an address displacement). The syntax x: <instr> indicates that the instruction is stored at address x in the code store.
So, the instruction LOADL 17 is equivalent to the full machine instruction 5 0 0 17 when the LOADL opcode is 5 (n and r are unused in this instruction)
My question
So, given that background, my question is this: How do I effectively match and replace patterns when I need to include parts of previous instructions as variables in my replacement? For example, I can simply replace all instances of LOADL 1; add with the increment machine instruction - I don't need any part of the previous instructions to do this. But I'm at a loss of how to effectively use the 'x' and 'y' values of my second example in the substitution pattern.
edit: I should mention that each field of an Instruction class is just an integer (as is normal for machine instructions). Any use of 'x' or 'y' in the pattern table is a variable to stand in for any integer value.

An easy way to do this is to implement your peephole optimizer as a finite state machine.
We assume you have a raw code generator that generates instructions but does not emit them, and an emit routine that sends actual code to the object stream.
The state machine captures instructions that your code generator produces, and remembers sequences of 0 or more generated instructions by transitioning between states. A state thus implicitly remembers a (short) sequence of generated but un-emitted instructions; it also has to remember the key parameters of the instructions it has captured, such as a register name, a constant value, and/or addressing modes and abstract target memory locations. A special start state remembers the empty string of instructions. At any moment, you need to be able to emit the unemitted instructions ("flush"); if you do this all the time, your peephole generator captures the next instruction and then emits it, doing no useful work.
To do useful work, we want the machine to capture as long a sequence as possible. Since there are typically many kinds of machine instructions, as practical matter you can't remember too many in a row or the state machine will become enormous. But it is practical to remember the last two or three for the most common machine instructions (load, add, cmp, branch, store). The size of the machine will really be determined by lenght of the longest peephole optimization we care to do, but if that length is P, the entire machine need not be P states deep.
Each state has transitions to a next state based on the "next" instruction I produced by your code generator. Imagine a state represents the capture of N instructions.
The transition choices are:
flush the leftmost 0 or more (call this k) instructions that this state represents, and transition to a next state, representing N-k+1, instructions that represents the additional capture of machine instruction I.
flush the leftmost k instructions this state represents, transition to the state
that represents the remaining N-k instructions, and reprocess instruction I.
flush the state completely, and emit instruction I, too. [You can actually
do this on the just the start state].
When flushing the k instructions, what actually gets emitted is the peephole optimized version of those k. You can compute anything you want in emitting such instructions. You also need to remember "shift" the parameters for the remaining instructions appropriately.
This is all pretty easily implemented with a peephole optimizer state variable, and a case statement at each point where your code generator produces its next instruction. The case statement updates the peephole optimizer state and implements the transition operations.
Assume our machine is an augmented stack machine, has
PUSHVAR x
PUSHK i
ADD
POPVAR x
MOVE x,k
instructions, but the raw code generator generates only pure stack machine instructions, e.g., it does not emit the MOV instruction at all. We want the peephole optimizer to do this.
The peephole cases we care about are:
PUSHK i, PUSHK j, ADD ==> PUSHK i+j
PUSHK i, POPVAR x ==> MOVE x,i
Our state variables are:
PEEPHOLESTATE (an enum symbol, initialized to EMPTY)
FIRSTCONSTANT (an int)
SECONDCONSTANT (an int)
Our case statements:
GeneratePUSHK:
switch (PEEPHOLESTATE) {
EMPTY: PEEPHOLESTATE=PUSHK;
FIRSTCONSTANT=K;
break;
PUSHK: PEEPHOLESTATE=PUSHKPUSHK;
SECONDCONSTANT=K;
break;
PUSHKPUSHK:
#IF consumeEmitLoadK // flush state, transition and consume generated instruction
emit(PUSHK,FIRSTCONSTANT);
FIRSTCONSTANT=SECONDCONSTANT;
SECONDCONSTANT=K;
PEEPHOLESTATE=PUSHKPUSHK;
break;
#ELSE // flush state, transition, and reprocess generated instruction
emit(PUSHK,FIRSTCONSTANT);
FIRSTCONSTANT=SECONDCONSTANT;
PEEPHOLESTATE=PUSHK;
goto GeneratePUSHK; // Java can't do this, but other langauges can.
#ENDIF
}
GenerateADD:
switch (PEEPHOLESTATE) {
EMPTY: emit(ADD);
break;
PUSHK: emit(PUSHK,FIRSTCONSTANT);
emit(ADD);
PEEPHOLESTATE=EMPTY;
break;
PUSHKPUSHK:
PEEPHOLESTATE=PUSHK;
FIRSTCONSTANT+=SECONDCONSTANT;
break:
}
GeneratePOPX:
switch (PEEPHOLESTATE) {
EMPTY: emit(POP,X);
break;
PUSHK: emit(MOV,X,FIRSTCONSTANT);
PEEPHOLESTATE=EMPTY;
break;
PUSHKPUSHK:
emit(MOV,X,SECONDCONSTANT);
PEEPHOLESTATE=PUSHK;
break:
}
GeneratePUSHVARX:
switch (PEEPHOLESTATE) {
EMPTY: emit(PUSHVAR,X);
break;
PUSHK: emit(PUSHK,FIRSTCONSTANT);
PEEPHOLESTATE=EMPTY;
goto GeneratePUSHVARX;
PUSHKPUSHK:
PEEPHOLESTATE=PUSHK;
emit(PUSHK,FIRSTCONSTANT);
FIRSTCONSTANT=SECONDCONSTANT;
goto GeneratePUSHVARX;
}
The #IF shows two different styles of transitions, one that consumes the generated
instruction, and one that does not; either works for this example.
When you end up with a few hundred of these case statements,
you'll find both types handy, with the "don't consume" version helping
you keep your code smaller.
We need a routine to flush the peephole optimizer:
flush() {
switch (PEEPHOLESTATE) {
EMPTY: break;
PUSHK: emit(PUSHK,FIRSTCONSTANT);
break;
PUSHKPUSHK:
emit(PUSHK,FIRSTCONSTANT),
emit(PUSHK,SECONDCONSTANT),
break:
}
PEEPHOLESTATE=EMPTY;
return; }
It is interesting to consider what this peephole optimizer does with the following generated code:
PUSHK 1
PUSHK 2
ADD
PUSHK 5
POPVAR X
POPVAR Y
What this whole FSA scheme does is hide your pattern matching in the state transitions, and the response to matched patterns in the cases. You can code this by hand, and it is fast and relatively easy to code and debug. But when the number of cases gets large, you don't want to build such a state machine by hand. You can write a tool to generate this state machine for you; good background for this would be FLEX or LALR parser state machine generation. I'm not going to explain this here :-}

Related

Java switch smart enough to rearrange?

Given a hot piece of code that has switch with many case options (and all are with breaks e.g. can be rearranged) will JVM figure out the frequent entries to check them ahead of others?

Frequency or likelihood of execution of individual cases doesn't come into it. The compiler will generate either:
a tableswitch instruction with an associated jump table that is indexed directly by the switch value, or
a lookupswitch instruction with a table of key/target pairs that can (typically) be binary-searched.
See the JVM Specification #3.10.

Detection of Loops in Java Bytecode - Distinguishing back edge types

Background:
Before asking my question, I wish to state that I have checked the following links:
Identify loops in java byte code
goto in Java bytecode
http://blog.jamesdbloom.com/JavaCodeToByteCode_PartOne.html
I can detect the loops in the bytecode (class files) using dominator analysis algorithm-based approach of detecting back edges on the control-flow graphs (https://en.wikipedia.org/wiki/Control_flow_graph).
My problem:
After detection of loops, you can end up having two loops (defined by two distinct back edges) sharing the same loop head. This can be created by the following two cases as I have realized: (Case 1) In the source code, you have a for or while loop with a continue statement, (Case 2) In the source code, you have two loops - an outer loop that is a do-while and an inner loop; and no instructions between these loops.
My question is the following: By only looking at the bytecode, how can you distinguish between these two cases?
My thoughts:
In a do-while loop (that is without any continue statements), you don't expect a go-to statement that goes back to the loop head, in other words, creating a back edge.
For a while or for loop (that is again without any continue statements), it appears that there can be a go-to statement (I am not sure if there must be one). My compiler generates (I am using a standard 1.7 compiler) this go-to instruction outside of the loop, not as a back edge unlike what is mentioned in the given links (this go-to statement creates a control-flow to the head of the loop, but not as a jump back from the end of the loop).
So, my guess is, (repeating, in case of two back edges), if one of them is a back edge created by a go-to statement, then there is only one loop in the source code and it includes a continue statement (Case 1). Otherwise, there are two loops in the source code (Case 2).
Thank you.

When two loops are equivalent all you can do is to take the simplest one.
e.g. there is no way to tell the difference between while (true), do { } while (true) and for (;;)
If you have do { something(); } while (false) this loop might not appear in the byte code at all.

As Peter Lawrey already pointed out, there is no way to determine the source code form by looking at the bytecode. To name an example closer to your intention, the following single-loop code
do action(); while(condition1() || condition2());
produces exactly the same code as the nested loop
do do action(); while(condition1()); while(condition2());
Likewise the following loop
do {
action();
if(condition1()) continue;
break;
} while(condition2());
produces exactly the same code as
do action(); while(condition1() && condition2());
with current javac, whereas surprisingly
do {
action();
if(!condition1()) break;
} while(condition2());
does not, which only shows how much the exact form depends on compiler internals. The next version of javac might compile them differently.

Switch statement vs Ifs in Java

Herbert Schildt in "The Complete Reference" - Chapter 5:
A switch statement is usually more efficient than a set of nested ifs.
When the compiler compiles a switch statement , it inspects each of the case constants and creates a "Jump Table" that it will use for selecting the path of execution depending on the value of the expression. Therefore, if you need to select among a large group of values, a switch statement will run much faster than the equivalent logic coded using a sequence of if-elses. The compiler can do this because it knows that the case constants are all the same type and simply must be compared for equality with the switch expression. The compiler has no such knowledge of a long list of if expressions.
what does he mean by the term "JUMP TABLE"?
Switch differs from if-statements in that, switch can test only for equality whereas if can evaluate any type of boolean expression. If a switch(expression) does not match any of the case constants, that means it goes inside the default case. Isn't it a case of unequality? That makes me think of having no such big difference between the switch and if.
From an extract in the same book, he wrote:
An if-then-else statement can test expressions based on ranges of values or conditions, whereas a switch statement tests expressions based only on a single integer, enumerated value, or String object.
Doesn't that make if more powerful than switch?
I have worked in different projects on Core Java, but never felt the need to make use of the switch statement, and also not seen other associates make use of it. Not even a single time. How come a much more powerful switch control statement fails to land itself ahead of if in terms of usability?

1) The compiler creates a table with a row for every case statement, where the case condition is the key and the block of code is the value. When the switch statement is executed in your code, the block of the case can directly be accessed by the key of the case block. Where in nested if blocks you need to check and execute every condition before your will find the right one where your enter the conditional block of code.
Think of it like a hashtable (for switch case) compared to a linear linked list (for nested if statements) and compare the time to look up a specific value. The lookup time of a table is O(1) (just get it in one operation) where the lookup time in the linked list is O(n) (look at every item until you get the right one).
3) yes, an if statement can be used more flexible and express more, but in the case you want to switch between different codeblocks for a list of different values a switch block is executed faster and more readable.
4) you never need to use a switch because nested ifs can express the same, but in some cases it is more elegant. But because of the rare time it is good to use, programmers forget about it and don't use it even in the spots where it would fit better.
2) extract the difference out of the answers to the other points.

I use switch all the time, if you aren't using it you're doing Java wrong.
A jump table can be thought of like an array of locations in the code. For example imagine the following:
case 0:
// do stuff
break;
case 1:
// do stuff 2
break;
case 3:
// do stuff 3
break;
Then you have a jump table just containing
codeLocations[] = { do stuff, do stuff 2, do stuff 3 }.
Then to find the correct bit of code for case 1 Java just needs to goto codeLocations[1]. No actual integer comparisons are needed.
Obviously the actual implementation is rather more complicated than this as it needs to handle much more varied cases but it gives the idea.
Neither if nor switch are more powerful than each other. They are both tools. You're asking whether a hammer is more powerful than a screwdriver. Yes a hammer will knock more things into the wood, but a screwdriver will do a better job if you are working with screws :)
To expand on the "doing java wrong" thing:
Switch is a tool in the Java language, one that has been added for a reason.
You could apply the exact same question to for loops. Everything you can do with a for loop you can also do with a while loop (although it will likely involve writing quite a bit more code). Try it sometime.
This doesn't make while "more powerful" or even "more useful" than for. They are different tools with different strengths and weaknesses.
switch and if are the same as for and while. They do similar jobs and in a lot of places you could use either. However there are clear examples where one is preferable to the other.
If you are programming using a language while ignoring one of the fundamental tools in that languages toolkit then you are using that language wrong. This is not some obscure feature that hardly ever gets used, this is one of the primary flow control statements in the language.
If a carpenter only ever uses a hammer and never uses a screwdriver... they are doing it wrong. Equally if a programmer never uses a significant language feature...they are doing it wrong.
A proverb applies here "If the only tool you have is a hammer, everything looks like a nail".

1: A jump table is a table of jump targets. A jump target is an address in the code that corresponds to the code belonging to a case in the switch. If your switch cases are, for a simple example, 0, 1, ... N, then the N + 1 targets would be at index 0, 1, ... N of the table, respectively. You just calculate the value you're switching on an then look up the target by indexing into the table, not by comparing N + 1 times. That's the efficiency.
2: default is a case of inequality, yes, but very limited, as you note. Its the case for "not equal to any other of the cases" but it can't be "less than a given value" or "greater than..."
3: Yes, if is more powerful. However, switch is more efficient, sometimes. Efficiency != power.
4: Power != efficiency. But, to answer the implied question: the more efficient switch statement is not used because your colleagues have not used it. You have to ask them why. Maybe it's a matter of taste, or of ignorance of the advantages of switch. The greater efficiency of a switch statement is almost always going to be unnoticeable, unless you are in a very performance-sensitive context, so people are generally unaware of it. If performance is not paramount, then readability is. In some cases a developer might consider a switch statement and decide that a series of if statements is easier to read and understand.

1) In a switch case, a jump table maps a position in the code segment to a particular case. For instance, considering this snippet:
int a;
// a is manipulated here
switch (a) {
case 1: // address 0
// ...
break;
case 2: // address 1
// ...
break;
case 3: // address 2
// ...
break;
default: // other cases
// ...
break;
}
There would be a jump table mapping a=1 to the effective code address marked by address 0, a=2 to address 1, and so on. This can become more apparent when observing the resulting bytecode. With this approach, the number of conditional jumps in the bytecode becomes much smaller than in the case of if-else statements.
2) It's true that if-else statements can test of unequality, but achieving the same behavior of a switch-case statement involves chaining if-else statements. Here is an equivalent version of the above in terms of behavior.
if (a==1) {
// ...
} else if (a==2) {
// ...
} else if (a==3) {
// ...
} else {
// for any other cases
}
It could be a matter of personal opinion at some point, but a switch-case statement is more readable in this particular case.
But naturally, if-else statements are still indispensable, and you are expected to use the right tool for the job. If-else is more versatile than switch-case, but the latter may produce nicer code in some places. Hopefully, this answers the 3rd and 4th part of your question.

In layman's terms difference is negligible.
Both have their pros and cons. In most cases you would never use switch - instead you would use hashmap with actions or some other construct that has better readability.
I would reccomend you read Efficient Java or https://code.google.com/p/guava-libraries/wiki/GuavaExplained
In c# you could write this like :
class MyClass
{
static Dictionary<string, Func<MyClass, string, bool>> commands
= new Dictionary<string, Func<MyClass, string, bool>>
{
{ "Foo", (#this, x) => #this.Foo(x) },
{ "Bar", (#this, y) => #this.Bar(y) }
};
public bool Execute(string command, string value)
{
return commands[command](this, value);
}
public bool Foo(string x)
{
return x.Length > 3;
}
public bool Bar(string x)
{
return x == "";
}
}
After that you could have
var item = new MyClass();
item.Execute("Foo","ooF");

Is there a programming language with better approach for switch's break statements?

It's the same syntax in a way too many languages:
switch (someValue) {
case OPTION_ONE:
case OPTION_LIKE_ONE:
case OPTION_ONE_SIMILAR:
doSomeStuff1();
break; // EXIT the switch
case OPTION_TWO_WITH_PRE_ACTION:
doPreActionStuff2();
// the default is to CONTINUE to next case
case OPTION_TWO:
doSomeStuff2();
break; // EXIT the switch
case OPTION_THREE:
doSomeStuff3();
break; // EXIT the switch
}
Now, all you know that break statements are required, because the switch will continue to the next case when break statement is missing. We have an example of that with OPTION_LIKE_ONE, OPTION_ONE_SIMILAR and OPTION_TWO_WITH_PRE_ACTION. The problem is that we only need this "skip to next case" very very very rarely. And very often we put break at the end of case.
It's very easy for a beginner to forget about it. And one of my C teachers even explained it to us as if it was a bug in C language (don't want to talk about it :)
I would like to ask if there are any other languages that I don't know of (or forgot about) that handle switch/case like this:
switch (someValue) {
case OPTION_ONE: continue; // CONTINUE to next case
case OPTION_LIKE_ONE: continue; // CONTINUE to next case
case OPTION_ONE_SIMILAR:
doSomeStuff1();
// the default is to EXIT the switch
case OPTION_TWO_WITH_PRE_ACTION:
doPreActionStuff2();
continue; // CONTINUE to next case
case OPTION_TWO:
doSomeStuff2();
// the default is to EXIT the switch
case OPTION_THREE:
doSomeStuff3();
// the default is to EXIT the switch
}
The second question: is there any historical meaning to why we have the current break approach in C? Maybe continue to next case was used far more often than we use it these days ?

From this article, I can enumerate some languages that don't require a break-like statement:
Ada (no fallthrough)
Eiffel (no fallthrough)
Pascal (no fallthrough)
Go - fallthrough
Perl - continue
Ruby (no fallthrough)
VB, VBA, VBS, VB.NET (no fallthrough)
To be continued by someone else...
Your second question is pretty interesting. Assuming only C, I believe this decision keeps the language cohesive. Since break is a jump, it must be explicitly written.

Scala pattern matching I think is a huge improvement in these cases. :)
object MatchTest2 extends Application {
def matchTest(x: Any): Any = x match {
case 1 => "one"
case "two" => 2
case y: Int => "scala.Int"
}
println(matchTest("two"))
}
Sample from scala-lang.org

And VB .NET handles it a little more like how you expect it should work.
Select Case i
Case 1 to 3
DoStuff(i)
Case 4,5,6
DoStuffDifferently(i)
Case Is >= 7
DoStuffDifferentlyRedux(i)
Case Else
DoStuffNegativeNumberOrZero(i)
End Select
There is no fall through at all, without possibly using a Goto

Here is the answer:
http://en.wikipedia.org/wiki/Switch_statement
It's called fall-through statement (continue in the example) and it exists in the following languages:
Go, Perl, C#
In C# it won't compile without break or goto case statement (except when there's no pre-action).

PASCAL doesn't have fall-through

I think the answer to your question of why it is this way centers around two behaviors, both having to do with the generated assembly code from the C source.
The first is that in assembly, the current instruction is executed, and unless there is a jump or some other flow control instruction, the instruction at the next address will be executed. Performing a naive compile of the switch statement to assembly would generate code that would just start executing the first instruction, which would be to see if there was a matching condition...
The second related reason is the notion of a branch table or jump list. Basically the compiler can take what it knows about your value, and create some extremely efficient machine code for the same thing. Take for example a simple function like atoi that converts a string representation of a number and returns it in integer form. Simplifying things way down to support just a single digit, you could write some code similar to this:
int atoi(char c) {
switch (c) {
case '0': return 0;
case '1': return 1;
// ....
}
}
The naive compiler would perhaps just convert that to a series of if/then blocks, meaning a substantial amount of CPU cycles would be taken for the number 9, while 0 returns almost immediately. Using a branch table the compiler could emit some [psuedo] assembly that would immediately "jump" to the correct return clause:
0x1000 # stick value of c in a register
0x1004 # jump to address c + calculated offset
# example '0' would be 0x30, the offset in for this sample
# would always be 0x0FD8... thus 0x30 + 0x0FD8 = 0x1008
0x1008 # return 0
Apology: my assembly and C skills are quite rusty. I hope this helps clarify things.
0x

Hey, don't forget COBOL's EVALUATE:
EVALUATE MENU-INPUT
WHEN "0" PERFORM INIT-PROC
WHEN "1" THRU "9" PERFORM PROCESS-PROC
WHEN "R" PERFORM READ-PARMS
WHEN "X" PERFORM CLEANUP-PROC
WHEN OTHER PERFORM ERROR-PROC
END-EVALUATE.

Ada doesn't have fallthrough, and requires that all values are explicitly handled, or a "others" clause added to handle the rest.
SQL CASE statement also does not fallthrough.
XSLT has which does not fallthrough.
It seems to be C and derived languages that have the fallthrough. It's quite insidious, and the only real use I've seen is implementing duff's device.
http://www.adaic.org/whyada/intro4.html

Python doesn't have one at all.
Took some getting used to but I have some horrendous memories hunting through massive switch blocks back in my C# days. I'm much happier without it.

Although not exactly what you asked for, Groovy has a very powerful switch statement

The OP talks about "fall through", but very seldom have I ever been bit by that.
Many many times, however I have been bit by designs that are non-extensible. To wit, "switch (kbHit)" statements, with a few hundred keys in there, that are a maintenance nightmare, and a frequent location for "god methods", and giant piles of spaghetti-code.
Using switch is often a sign of poor object oriented programming. As another person answered, "2 uses of Switch in 48 source files", in one of his application, shows a programmer who does not rely heavily on this construct. From his metric, I surmise that he is probably at least a good structured programmer, and probably understands OOP/OOD as well.
OOP (not necessarily only C++) programmers, and even pure C users who do not have an object description technique forced upon them, could implement an "inversion of control" container that publishes a "key was hit" and allows subscribers to plug in their handlers for "on keyboard code x". This can make reading your code much easier.

Pure speculation, but:
I occasionally write C or Java in which I say something like:
switch (tranCode)
{
case 'A':
case 'D':
case 'R':
processCredit();
break;
case 'B':
case 'G':
processDebit();
break;
default:
processSpecial();
}
That is, I deliberately use fall-thru to let several values fire the same operation.
I wonder if this is what the inventors of C were thinking of when they created the SWITCH statement, that this would be the normal usage.

Tcl doesn't automatically fall through.

In object oriented languages you use the Chain of Responsibility pattern. How that is implemented can vary. What you are describing in your example, is mixing a state machine with behavior by abusing a switch statement. For your particular example a Chain of Responsibility pattern with the parameter that the chain evaluates being a State pattern, that mutates as it goes down the chain, would be the appropriate implementation.

languages are too much and I can't answer for sure that there's not such a language, provided it is a "derivative" of C in syntax, because other languages using different syntax and where the case does not "continue" naturally exist, e.g. Fortran. I don't know languages that uses an explicit "continue" to continue to the following case.
I believe it is historical reason due to the way such a case could be programmed at a "low level". Moreover the syntactical aspect of the case is that of a label, and break works like in loops, so you can imagine an equivalent like this:
if ( case == 1 ) goto lab1;
if ( case == 2 ) goto lab2;
if ( case == 3 ) goto lab3;
//...
default:
// default
goto switch_end;
lab1:
// do things
goto switch_end; // if break is present
lab2:
// do things, and follow to lab3
lab3:
// lab3 stuffs
goto switch_end;
//
...
switch_end: // past all labS.

More languages without a fallthough:
XSLT
JSTL
Algol
PL/1

Lima does it like this:
if someValue == ?1 ::
OPTION_ONE: fall
OPTION_LIKE_ONE: fall
OPTION_ONE_SIMILAR:
doSomeStuff1[]
;; the default is to EXIT the switch
OPTION_TWO_WITH_PRE_ACTION:
doPreActionStuff2[]
fall ;; fall through to the next case
OPTION_TWO:
doSomeStuff2[]
OPTION_THREE:
doSomeStuff3[]

Finite State Machine program

I am tasked with creating a small program that can read in the definition of a FSM from input, read some strings from input and determine if those strings are accepted by the FSM based on the definition. I need to write this in either C, C++ or Java. I've scoured the net for ideas on how to get started, but the best I could find was a Wikipedia article on Automata-based programming. The C example provided seems to be using an enumerated list to define the states, that's fine if the states are hard coded in advance. Again, I need to be able to actually read the number of states and the definition of what each state is supposed to do. Any suggestions are appreciated.
UPDATE:
I can make the alphabet small (e.g. { a b }) and adopt other conventions such as the
start state is always state 0. I'm allowed to impose reasonable restrictions on the number of
states, e.g. no more than 10.
Question summary:
How do I implement an FSA?

First, get a list of all the states (N of them), and a list of all the symbols (M of them). Then there are 2 ways to go, interpretation or code-generation:
Interpretation. Make an NxM matrix, where each element of the matrix is filled in with the corresponding destination state number, or -1 if there is none. Then just have an initial state variable and start processing input. If you get to state -1, you fail. If you run out of input symbols without getting to the success state, you fail. Otherwise you succeed.
Code generation. Print out a program in C or your favorite compiler language. It should have an integer state variable initialized to the start state. It should have a for loop over the input characters, containing a switch on the state variable. You should have one case per state, and at each case, have a switch statement on the current character that changes the state variable.
If you want something even faster than 2, and that is sure to get you flunked (!), get rid of the state variable and instead use goto :-) If you flunk, you can comfort yourself in the knowledge that that's what compilers do.
P.S. You could get your F changed to an A if you recognize loops etc. in the state diagram and print out corresponding while and if statements, rather than using goto.

One non-hardcoded way to represent an automaton is as a transition matrix, which allows to represent for each current state, and each input character, what the next state is.

You haven't actually asked a question. You'll get more and better help if you have a specific question for a specific task (but still give the overall goal). The question should be narrow in scope (e.g. not "How can I implement an FSA?").
As for how to represent an FSA (which seems to be what you're having difficulties with), read on.
Start by considering the definition of an FSM: it's an alphabet ∑, a set of states S, a start state s0, a set of accept states A and a transition function δ from a state and a symbol to a state. You have to be able to determine these properties from the input. Any states not reachable by the transition function can be dropped to produce an equivalent FSM. The minimal set of states and alphabet are thus implicit in the transition function; you could make your FSM easier to use (and harder to implement, but not much harder) by not requiring either ∑ or S in the input.
You don't need to use the same representation for states that the input uses. You could use unsigned integers for your internal representation, as long as you have a map from integers to strings and strings to integers so you can convert between the internal representation and external representation. This way, your transition function can be stored as an array, so the transition step can be performed in constant time.
A simpler approach would be to use the external representation as your internal representation. With this option, the transition function would be stored as a map from strings and symbols to strings. The transition step would probably be O(log(|S|+|∑|)), given the performance of most map data structures. If symbols are represented as integers (e.g. chars), the transition function could be represented as a map from strings to an array of strings, giving O(log(|S|)) performance.
Yet another optionmodeled after the graph view of an FSM, is to create a class for states. A state has a name (the external representation). States are responsible for transitions; send a symbol to a state and get back another state.
class State {
property name;
State& transition(Symbol s);
void setTransition(Symbol s, State& to);
}
Store the set of states as a map from names to states.
There you go, three different places to start, each with a different way to represent states.

Stop thinking about everything at once. Do one thing at a time
- come with language of state machine
- come with language for stimulus
- create sample file of one state machine in language
- create sample file of stimulus
- come with class for state
- come with class for transition
- come with class for state machine as set of states and transitions
- add method to handle violation to state class
- code a little parser for language
- code another parser for language
- initial state
- some output thing like WriteLn here and there
- main method
- compile
- run
- debug
- done

The way the OpenFst toolkit does it is: A FSM has a vector of states, each of which has a vector of arcs. Each arc has an input (and output) label, a target state ID and a weight. You could take a look at the code. Maybe it will inspire you.

If you're using an object-oriented language like Java or C++, I'd recommend that you start with objects. Before you worry about file formats and the like, get a good object model for a finite state automata and how it behaves. How will you represent states, transitions, events, etc.? Will your FSA be a Composite? Once you have that sort of thing working you can get the file formats right. Anything will do: XML, text, etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.