Syntax analysis question

Syntax analysis question - java

In school we were assigned to design a language and then to implement it, (I'm having so much fun implementing it =)). My teacher told us to use yacc/lex, but i decided to go with java + regex API, here is how the the language I designed looks:
Program "my program"
var yourName = read()
if { equals("guy1" to yourName) }
print("hello my friend")
else
print("hello extranger")
end
Program End
Well, as you can see, its a pretty basic language =).
I thought I could implement it in a very OOP fashion, like make an abstract class Sentence and then have subclasses like VariableAssignment, IfSentence etc. and have a class Program which is only a bunch of sentences right? And then call an abstract method eval on all Sentences, so my initial approach to complie the language consisted only of two phases:
Identify syntax of seach line
Create the correspondig class for each line
of course, if something goes wrong on any phase Ii could raise an error.
My question is, am I doing it wrong? Should I go over all phases like the theory says (lexical, syntactical, semantical)? Should I continue with my naive two-phase compiler?

I won't ask the obvious question of why you're not following the advice of your instructor and using yacc/lex because I know the answer. You wanted to go off and do something that you thought was cool and would help you learn. Unfortunately, that approach was recommended by your professor because as another posted stated, a lot of very smart people before you have explored multiple approaches and spent vast quantities of time trying to find a good solution.
You can make a two-phase compiler work, but you will need to accept that it will never be as good as going through the full process because it's harder to detect errors. A lot harder in fact. In some cases, you won't even be able to tell that there's an error until it's too late. ie: already compiled and attempting to run.
If you want to learn a lot more about it, go with the two phase approach and you will run into the same problems that the people before you ran into. Just be sure to understand that it will take you a lot longer to get to a final solution, you might be docked points on your project, and it might not work right.
That said, you're going to learn more about it than anyone else in the class. If you have the time to spare, I'd do it the way you are now. The knowledge might come in handy down the road. I would also talk to your professor and tell him that you're going to do it another way against his recommendations because you want to have a more thorough understanding. Perhaps he won't knock points off from your project for being ambitious, even if it turns out wrong.
After all, the point of doing projects in college is to learn.

A lot of smart people thought about this, and from your post I take, they came to the conclusion that all the phases are needed.
So if you want your compiler to work, go the way the theory dictates.
If you want to understand, why it dictates the phases, try the short cut. It will probably take a lot longer.
Disclaimer: I have no idea about compiler theory
Another note: You have a problem; You decide to solve it using regexps; Now you have two problems

If you use regexes to parse each line your language would have a very limited syntax.
You would not be able to parse each line using just a regular expression API if your syntax becomes more complex. Even the if { equals("guy1" to yourName) } would become impossible to parse with regexes if you start adding AND and OR operators, and what would happen if you start supporting escape characters like \n in your string literals?
The Java Regex API would be able to help you with the lexical analyzer, but you would have to write the parser from there. You could take one of several approaches:
If you're using Java, you could look at Antlr (which negates the need for writing a lexicall analyzer with Java's regex library), or
You could write a recursive descent parser by hand
among others
(also, "Statement" is a synonym for "Sentence" that is more common in compiler texts)

If you want to use only regular expressions to parse your language, your language can only be regular. This is a big constriction, for example, arbitrarily deep nesting would be impossible, as you would have to teach your parser each nesting combination separately. I am not sure if building a Turing-complete regular language is even possible.

If u really want to dirty ur hands code a recursive descent parser. If you want to understand compiler theory use antlr and concentrate on the principles leaving the implementation for the parser generator.
BTW, why would wnat to complicate your life with regex?!

Related

grammar compiler compiler for Java

My company is trying to write some software for Android. We would like to work with Java, and there is a component of the company's software that is c++ and so needs to be ported (or at least porting needs to be tried before trying NDK stuff). This code was created using Accent, and it defines a grammar grammar. As near as I can tell, the original writer (now gone) wrote a grammar to specify how to specify a grammar, then compiled a compiler-compiler with that grammar and Accent. The compiler-compiler takes a grammar of the specified format and produces a binary code to parse strings conforming to that grammar. Here's an example snippet of the grammar:
//include rules from from this file (such as <alpha>)
include "alphabet.bnf"
<<topSymbol>> = <alpha> <alpha> <alpha>? .//two letters with an optional third
//square brackets enclose an XML statement clarifying semantics of the rule
[
<topSymbol>
<letter>
<command val="doSomethingToLetter"/>
</letter>
<!--etc.-->
</topSymbol>
]
My question is how to do this with Java, using Antlr or some other tool. A compiler-compiler-compiler seems rather complicated to me. Alternatively, I would like to know how to easily compile/parse this type of grammar, which contains a grammatical and semantic XML information.

If the original designer knew what he was doing, and it is warranted, then you want to preserve that concept. Going with another parser generator (or at least a parsing scheme of some kind) is the right approach. Either JavaCC or ANTLR would be fine as parser generators; you'll have to hand-translate the grammar. You might hand code a recursive descent parser if the grammar is simple enough.
If the original designer was simply over the top, then you can probably replace the grammar-driven aspect, but you won't be able to do that without understanding what he was achieving. The fact that this "seems rather complicated to me" suggests you don't really understand parsing/parser generator technology, and you are driven by a desire to do something you understand than preserve something you don't. But its a bad idea to tear apart something that is well designed/implemented just because you don't understand it. I strongly suggest you learn more about these kinds of technologies, and ask why was it implemented this way? Ultimately you may be right and should replace his approach by something else, but make that choice based on knowledge, not fear.

My question is how to do this with Java, using Antlr or some other tool. A compiler-compiler-compiler seems rather complicated to me.
It sounds complicated to me too!
Alternatively, I would like to know how to easily compile/parse this type of grammar, which contains a grammatical and semantic XML information.
No ... there is no easy answer to this. It sounds like your ex-colleague has gone over the top on the complexity front. You are going to have to:
either get your head around what his code does, and how it does it, learn how Antlr works, and hand translate,
or ditch his code AND design and find a simpler way to do what it is doing.
Good luck!
(Actually, there is a good chance that the code is not as complicated as it seems ... once you get your head around it, and compiler-compiler technology.)

Your best bet is to translate the grammar you have into ANTLR or Java CC or some other tool.
Another possibility is to call your C++ code using JNI, but that's fraught with peril.
I'm not aware of anything that can help. You'll just have to get a shovel and start digging.

PHP to Java (using PtoJ)

I would like to transition our codebase from poorly written PHP code to poorly written Java, since I believe Java code is easier to tidy up. What are the pros and cons, and for those who have done it yourselves, would you recommend PtoJ for a project of about 300k ugly lines of code? Tips and tricks are most welcome; thanks!

Poorly written PHP is likely to be very hard to convert because a lot of the bad stuff in PHP just doesn't exist in Java (the same is true vice versa though, so don't take that as me saying Java is better - I'm going to keep well clear of that flame-war).
If you're talking about a legacy PHP app, then its highly likely that your code contains a lot of procedural code and inline HTML, neither of which are going to convert well to Java.
If you're really unlucky, you'll have things like eval() statements, dynamic variable names (using $$ syntax), looped include() statements, reliance on the 'register_globals' flag, and worse. That kind of stuff will completely thwart any conversion attempt.
Your other major problem is that debugging the result after the conversion is going to be hell, even if you have beautiful code to start with. If you want to avoid regressions, you will basically need to go through the entire code base on both sides with a fine comb.
The only time you're going to get a satisfactory result from an automated conversion of this type is if you start with a reasonably tide code base, written at least mainly in up-to-date OOP code.
In my opinion, you'd be better off doing the refacting excersise before the conversion. But of course, given your question, that would rather defeat the point. Therefore my recommendation is to stick it in PHP. PHP code can be very good, and even bad PHP can be polished up with a bit of refactoring.
[EDIT]
In answer to #Jonas's question in the comments, 'what is the best way to refactor horrible PHP code?'
It really depends on the nature of the code. A large monolithic block of code (which describes a lot of the bad PHP I've seen) can be very hard (if not imposible) to implementunit tests for. You may find that functional tests are the only kind of tests you can write on the old code base. These would use Selenium or similar tools to run the code through the browser as if it were a user. If you can get a set of reliable functional tests written, it is good for helping you remain confident that you aren't introducing regressions.
The good news is that it can be very easy - and satisfying - to rip apart bad code and rebuild it.
The way I've approached it in the past is to take a two-stage approach.
Stage one rewrites the monolithic code into decent quality procedural code. This is relatively easy, and the new code can be dropped into place as you go. This is where the bulk of the work happens, but you'll still end up with procedural code. Just better procedural code.
Stage two: once you've got a critical mass of reasonable quality procedural code, you can then refactor it again into an OOP model. This has to wait until later, because it is typically quite hard to convert old bad quality PHP straight into a set of objects. It also has to be done in fairly large chunks because you'll be moving large amounts of code into objects all at once. But if you did a good job in stage one, then stage two should be fairly straightforward.
When you've got it into objects, then you can start seriously thinking about unit tests.

I would say that automatic conversion from PHP to Java have the following:
pros:
quick and dirty, possibly making happy some project manager concerned with short-time delivery (assuming that you're lucky and the automatically generated code works without too much debugging, which I doubt)
cons:
ugly code: I doubt that automatic conversion from ugly PHP will generate anything but ugly Java
unmaintainable code: the automatically generate code is likely to be unmaintainable, or, at least, very difficult to maintain
bad approach: I assume you have a PHP Web application; in this case, I think that the automatic translation is unlikely to use Java best practices for Web application, or available frameworks
In summary
I would avoid automatic translation from PHP to Java, and I woudl at least consider rewriting the application from the ground up using Java. Especially if you have a Web application, choose a good Java framework for webapps, do a careful design, and proceed with an incremental implementation (one feature of your original PHP webapp at a time). With this approach, you'll end up with cleaner code that is easier to maintain and evolve ... and you may find out that the required time is not that bigger that what you'd need to clean/debug automatically generated code :)

P2J appears to be offline now, but I've written a proof-of-concept that converts a subset of PHP into Java. It uses the transpiler library for SWI-Prolog:
:- use_module(library(transpiler)).
:- set_prolog_flag(double_quotes,chars).
:- initialization(main).
main :-
Input = "function add($a,$b){ print $a.$b; return $a.$b;} function squared($a){ return $a*$a; } function add_exclamation_point($parameter){return $parameter.\"!\";}",
translate(Input,'php','java',X),
atom_chars(Y,X),
writeln(Y).
This is the program's output:
public static String add(String a,String b){
System.out.println(a+b);
return a+b;
}
public static int squared(int a){
return a*a;
}
public static String add_exclamation_point(String parameter){
return parameter+"!";
}

In contrast to other answers here, I would agree with your strategy to convert "PHP code to poorly written Java, since I believe Java code is easier to tidy up", but you need to make sure the tool that you are using doesn't introduce more bugs than you can handle.
An optimum stategy would be:
1) Do automated conversion
2) Get an MVP running with some basic tests
3) Start using the amazing Eclipse/IntelliJ refractoring tool to make the code more readable.
A modern Java IDE can refactor code with zero bugs when done properly. It can also tell you which functions are never called and a lot of other inspections.
I don't know how "PtoJ" was, since their website has vanished, but you ideally want something that doesn't just translate the syntax, but the logic. I used php2java.com recently and it worked very well. I've also used various "syntax" converters (not just for PHP to Java, but also ObjC -> Swift, Java -> Swift), and even they work just fine if you put in the time to make things work after.
Also, found this interesting blog entry about what might have happened to numiton PtoJ (http://www.runtimeconverter.com/single-post/2017/11/14/What-happened-to-numition).

http://www.numiton.com/products/ntile-ptoj/translation-samples/web-and-db-access/mysql.html
Would you rather not use Hibernate ?

"Cosmetic" clean-up of old, unknown code. Which steps, which order? How invasive?

When I receive code I have not seen before to refactor it into some sane state, I normally fix "cosmetic" things (like converting StringTokenizers to String#split(), replacing pre-1.2 collections by newer collections, making fields final, converting C-style arrays to Java-style arrays, ...) while reading the source code I have to get familiar with.
Are there many people using this strategy (maybe it is some kind of "best practice" I don't know?) or is this considered too dangerous, and not touching old code if it is not absolutely necessary is generally prefered? Or is it more common to combine the "cosmetic cleanup" step with the more invasive "general refactoring" step?
What are the common "low-hanging fruits" when doing "cosmetic clean-up" (vs. refactoring with more invasive changes)?

In my opinion, "cosmetic cleanup" is "general refactoring." You're just changing the code to make it more understandable without changing its behavior.
I always refactor by attacking the minor changes first. The more readable you can make the code quickly, the easier it will be to do the structural changes later - especially since it helps you look for repeated code, etc.
I typically start by looking at code that is used frequently and will need to be changed often, first. (This has the biggest impact in the least time...) Variable naming is probably the easiest and safest "low hanging fruit" to attack first, followed by framework updates (collection changes, updated methods, etc). Once those are done, breaking up large methods is usually my next step, followed by other typical refactorings.

There is no right or wrong answer here, as this depends largely on circumstances.
If the code is live, working, undocumented, and contains no testing infrastructure, then I wouldn't touch it. If someone comes back in the future and wants new features, I will try to work them into the existing code while changing as little as possible.
If the code is buggy, problematic, missing features, and was written by a programmer that no longer works with the company, then I would probably redesign and rewrite the whole thing. I could always still reference that programmer's code for a specific solution to a specific problem, but it would help me reorganize everything in my mind and in source. In this situation, the whole thing is probably poorly designed and it could use a complete re-think.
For everything in between, I would take the approach you outlined. I would start by cleaning up everything cosmetically so that I can see what's going on. Then I'd start working on whatever code stood out as needing the most work. I would add documentation as I understand how it works so that I will help remember what's going on.
Ultimately, remember that if you're going to be maintaining the code now, it should be up to your standards. Where it's not, you should take the time to bring it up to your standards - whatever that takes. This will save you a lot of time, effort, and frustration down the road.

The lowest-hanging cosmetic fruit is (in Eclipse, anyway) shift-control-F. Automatic formatting is your friend.

First thing I do is trying to hide most of the things to the outside world. If the code is crappy most of the time the guy that implemented it did not know much about data hiding and alike.
So my advice, first thing to do:
Turn as many members and methods as
private as you can without breaking the
compilation.
As a second step I try to identify the interfaces. I replace the concrete classes through the interfaces in all methods of related classes. This way you decouple the classes a bit.
Further refactoring can then be done more safely and locally.

You can buy a copy of Refactoring: Improving the Design of Existing Code from Martin Fowler, you'll find a lot of things you can do during your refactoring operation.
Plus you can use tools provided by your IDE and others code analyzers such as Findbugs or PMD to detect problems in your code.
Resources :
www.refactoring.com
wikipedia - List of tools for static code analysis in java
On the same topic :
How do you refactor a large messy codebase?
Code analyzers: PMD & FindBugs

By starting with "cosmetic cleanup" you get a good overview of how messy the code is and this combined with better readability is a good beginning.
I always (yeah, right... sometimes there's something called a deadline that mess with me) start with this approach and it has served me very well so far.

You're on the right track. By doing the small fixes you'll be more familiar with the code and the bigger fixes will be easier to do with all the detritus out of the way.
Run a tool like JDepend, CheckStyle or PMD on the source. They can automatically do loads of changes that are cosemetic but based on general refactoring rules.

I do not change old code except to reformat it using the IDE. There is too much risk of introducing a bug - or removing a bug that other code now depends upon! Or introducing a dependency that didn't exist such as using the heap instead of the stack.
Beyond the IDE reformat, I don't change code that the boss hasn't asked me to change. If something is egregious, I ask the boss if I can make changes and state a case of why this is good for the company.
If the boss asks me to fix a bug in the code, I make as few changes as possible. Say the bug is in a simple for loop. I'd refactor the loop into a new method. Then I'd write a test case for that method to demonstrate I have located the bug. Then I'd fix the new method. Then I'd make sure the test cases pass.
Yeah, I'm a contractor. Contracting gives you a different point of view. I recommend it.

There is one thing you should be aware of. The code you are starting with has been TESTED and approved, and your changes automatically means that that retesting must happen as you may have inadvertently broken some behaviour elsewhere.
Besides, everybody makes errors. Every non-trivial change you make (changing StringTokenizer to split is not an automatic feature in e.g. Eclipse, so you write it yourself) is an opportunity for errors to creep in. Do you get the exact behaviour right of a conditional, or did you by mere mistake forget a !?
Hence, your changes implies retesting. That work may be quite substantial and severely overwhelm the small changes you have done.

I don't normally bother going through old code looking for problems. However, if I'm reading it, as you appear to be doing, and it makes my brain glitch, I fix it.
Common low-hanging fruits for me tend to be more about renaming classes, methods, fields etc., and writing examples of behaviour (a.k.a. unit tests) when I can't be sure of what a class is doing by inspection - generally making the code more readable as I read it. None of these are what I'd call "invasive" but they're more than just cosmetic.

From experience it depends on two things: time and risk.
If you have plenty of time then you can do a lot more, if not then the scope of whatever changes you make is reduced accordingly. As much as I hate doing it I have had to create some horrible shameful hacks because I simply didn't have enough time to do it right...
If the code you are working on has lots of dependencies or is critical to the application then make as few changes as possible - you never know what your fix might break... :)
It sounds like you have a solid idea of what things should look like so I am not going to say what specific changes to make in what order 'cause that will vary from person to person. Just make small localized changes first, test, expand the scope of your changes, test. Expand. Test. Expand. Test. Until you either run out of time or there is no more room for improvement!
BTW When testing you are likely to see where things break most often - create test cases for them (JUnit or whatever).
EXCEPTION:
Two things that I always find myself doing are reformatting (CTRL+SHFT+F in Eclipse) and commenting code that is not obvious. After that I just hammer the most obvious nail first...

From Static Typing to Dynamic Typing

I have always worked on statically typed languages (C/C++, Java). I have been playing with Clojure and I really like it.
One thing I am worried about is: say that I have a windows that takes 3 modules as arguments and along the way the requirements change and I need to pass another module to the function. I just change the function and the compiler complains everywhere I used it. But in Clojure it won't complain until the function is called. I can just do a regex search and replace but it seems there is a chance to miss a call and it will go unnoticed until that function is actually called. How do you guys deal with this?

This is one of the reasons automated testing/test driven development is even more important in dynamically typed languages. I haven't used Clojure (I mostly use Ruby), so unfortunately I can't recommend a specific testing framework.

The first thing I'd like to mention is that Bruce Eckel has written a very interesting article called Strong Typing vs Strong Testing (the link is down at the moment, unfortunately, but hopefully it will be up soon).
His idea is that when dealing with compiled languages, the compiler is just acting as the first, automatic step of automatic testing. When making the move to a dynamic language, you lose this first level of automatic testing. But in both cases, this first, automatic level is just one part of testing, and not even a very important part.
His point is that if you're developing programs properly, i.e. doing some form of tests and regression tests, the lack of a compiler will only force you to add some more, somewhat basic tests anyways, which is why it's no big loss.
So I guess the first answer I'd give you is, focus on your testing, something you should be doing anyway, and such changes shouldn't affect you too badly.
The second thing I'd like to mention is many dynamic languages that I've seen (for example, Python) have much better abilities to change what methods/classes do without breaking existing code.
For example, with Python, if your method used to accept two parameters but now requires a third one, you can always add a default parameter without breaking any existing code, but that you can now utilize. This is a very basic technique, but in Python's case (and I assume most other dynamic languages as well), these techniques can get much more interesting; since they're dynamic, you can pretty much change the implementation of functions for specific modules, change what variables mean, etc.
I'd suggest looking at which techniques Clojure has that allow similair things, and deciding if they apply in your situation.

You do the same thing you did if the method was part of a public interface that you weren't the only user of.
You add a new method with the extra module and and change the old one to call the new one with a suitable default.
Oh and if your program is that big, make sure you have good tests (test-is should make it simpler than Java)

Test coverage is definitely important. But a dynamically typed language will allow you to work in a different way. In a strongly typed language (like Java), a change in the interface needs to modify all the callers. In Ruby, you could do this-- but probably won't. Instead, you'll probably add flexibility to the method on one of a few ways. Namely:
you tend to have very few methods that take as many as three parameters in Ruby (as opposed to Java). Because you don't have strong typed interface of Java, you break the problem down into smaller pieces and steps. It's much more common to write methods that take just 1 parameter, and then refactor when it becomes more complex.
it's possible-- and common-- to leave the old behavior in place while adding more arguments. For example, if you have to add a third argument to a two argument method, you will set its default value to preserve the old behavior (and save you a refactor). If you are familiar with Javascript libraries like jQuery, they take advantage of this everywhere with "optional" arguments.
similar to optional arguments, methods can grow to take a flexible parameter list. With solid test coverage, you can quite easily add a new behavior to an existing method and safely know you haven't broken the existing code. In Rails, methods like "render" take a wide range of options.

You're not completely without compiler support in Clojure. In the specific example you give, it's the arity of the function that changed, which would be picked up by compiling the Clojure code. I'm still making the strong -> dynamic typing transition and find this comforting!

You lose some level of refactoring and type safety when you move to dynamic languages. The more information the compiler has, the more it can do at compile time for you.

Tim Bray discusses it here,critique of which by Cedric is here,and a post on artima discussing it at length.

If you really need static typing, you can use https://github.com/clojure/core.typed and it's leiningen module to test static variable passing.

replacement project for existing school assignment

I have a school assignment which consists of programming a scanner/lexical analyzer for a specified simple language. The scanner has to be programmed in C++.
This type of assignment has been used since the 90's and, although still a valid excersise, I consider it to be a little antiquated and a little boring.
I have gotten permission to come up with a new programming assignment.
It has to be of equal difficulty and it can be in C++, Objective C or Java.
What direction should I go that has the same level of difficulty but is a little bit more modern and applicable to modern CS/life.
Thanks

This type of assignment... is considered to be a little antiquated and a little boring.
I'm curious: who considers this antiquated? Your professor? Somebody notable in the parsing community? Or you?
Scanners and parsers are still relevant to professional software development and, more importantly, relevant to the science of computation. If you wish to understand computers, then you should understand scanners and parsers.
Still, if you are convinced that you should do some other assignment, why not write a tool to generate a scanner in C++? You could supply, as input, a set of regular expressions that define the tokens of the grammar, and it would produce a C++ program that would recognize the input tokens. Then, you will never need to write a scanner ever again!

Why do you think that Lexers / Parsers are not relevant anymore? I find that I write something along those lines at least once a year.

As a software engineer, I would say whatever code you write during the CS courses would be the best ones that you may probably write in your life. Once you come into the industry, you will probably write only modules and not the entire thing.
Believe me. Once you come into the industry and has spend some time here, you will just want to write those compilers, assemblers, lexical analyzers. So please don't miss the chance.
I believe the challenges in writing this "boring" stuffs are just worth it and you will find them truly interesting once you start designing the stuff.

Writing a scanner/lexical analyzer was one of my favorite assignments. I would argue that it was also one of the most relevant. It is a real world problem.
My guess is that it does not feel modern because of the simple programming language you are scanning. I personally would change out the simple programming language for something like Markdown or Textile. Both of these are used in modern programming, and will teach you similar concepts.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.