How can I parse code to build a compiler in Java?

How can I parse code to build a compiler in Java? - java

I need to write a compiler. It's homework at the univ. The teacher told us that we can use any API we want to do the parsing of the code, as long as it is a good one. That way we can focus more on the JVM we will generate.
So yes, I'll write a compiler in Java to generate Java.
Do you know any good API for this? Should I use regex? I normally write my own parsers by hand, though it is not advisable in this scenario.
Any help would be appreciated.

Regex is good to use in a compiler, but only for recognizing tokens (i.e. no recursive structures).
The classic way of writing a compiler is having a lexical analyzer for recognizing tokens, a syntax analyzer for recognizing structure, a semantic analyzer for recognizing meaning, an intermediate code generator, an optimizer, and last a target code generator. Any of those steps can be merged, or skipped entirely, if makes the compiler easier to write.
There have been many tools developed to help with this process. For Java, you can look at
ANTLR - http://www.antlr.org/
Coco/R - http://ssw.jku.at/Coco/
JavaCC - https://javacc.dev.java.net/
SableCC - http://sablecc.org/

I would recommend ANTLR, primarily because of its output generation capabilities via StringTemplate.
What is better is that Terence Parr's book on the same is by far one of the better books oriented towards writing compilers with a parser generator.
Then you have ANTLRWorks which enables you to study and debug your grammar on the fly.
To top it all, the ANTLR wiki + documentation, (although not comprehensive enough to my liking), is a good place to start off for any beginner. It helped me refresh knowledge on compiler writing in a week.

Have a look at JavaCC, a language parser for Java. It's very easy to use and get the hang of

Go classic - Lex + Yacc. In Java it spells JAX and javacc. Javacc even has some Java grammars ready for inspection.

I'd recommend using either a metacompiler like ANTLR, or a simple parser combinator library. Functional Java has a parser combinator API. There's also JParsec. Both of these are based on the Parsec library for Haskell.

JFlex is a scanner generator which, according to the manual, is designed to work with the parser generator CUP.
One of the main design goals of JFlex was to make interfacing with the free Java parser generator CUP as easy as possibly [sic].
It also has support for BYACC/J, which, as its name suggests, is a port of Berkeley YACC to generate Java code.
I have used JFlex itself and liked it. Howeveer, the project I was doing was simple enough that I wrote the parser by hand, so I don't know how good either CUP or BYACC/J is.

I've used SableCC in my compiler course, though not by choice.
I remember finding it very bulky and heavyweight, with more emphasis on cleanliness than convenience (no operator precedence or anything; you have to state that in the grammar).
I'd probably want to use something else if I had the choice. My experiences with yacc (for C) and happy (for Haskell) have both been pleasant.

Parser combinators is a good choice. Popular Java implementation is JParsec.

If you're going to go hardcore, throw in a bit of http://llvm.org in the mix :)

I suggest you look at at the source for BeanShell. It has a compiler for Java and is fairly simple to read.

http://java-source.net/open-source/parser-generators and http://catalog.compilertools.net/java.html contain catalogs of tools for this. Compare also the Stackoverflow question Alternatives to Regular Expressions.

Use a parser combinator, like JParsec. There's a good video tutorial on how to use it.

Related

Generate two parsers for a single DSL

I need to implement two tools for a single DSL: UI editor in Java and interpreter in C/C++. My first idea was to use ANTLR, since it can generate parsers for both Java and C/C++. But all ANTLR examples that I've seen contains some language-specific code or settings.
Is there any way to generate two parsers for a single DSL?
Does this even make sense to generate two parsers from a single grammar?
Is there any commonly used approaches for this problem?

bison can produce C++ and Java parsers, at least according to the documentation (I've never used the Java interface, and only used the C++ interface once, but I'm told that they work). The grammar will not be a problem, but the actions will be, particularly since you're presumably doing different things in the two parsers, and not just using different languages. But you should be able to make every action a simple $$ = method($1, $2, ...); statement.
bison doesn't use the C(++) preprocessor (and it couldn't really, because it's common to put preprocessor directives into the bison input files), but you could use some other macro system -- I hesitate to recommend m4 but it would work if you know how to use it -- or a shell script to assemble the different input files.
The other possibility would be to just create an AST in the parser. You could use any parser generator, including Antlr or bison, to build the AST parser in C or C++, and then wrap the result for use with JNI for Java. If you use Antlr, you can produce an AST generator with very little language-specific code, so with a simple macro processor, you could build native AST parsers in both C++ and Java, I think. But that depends on your language being fairly simple.
I don't know if there is a "commonly used approach" for this problem, but it's certainly a problem that comes up pretty regularly; lots of grammars get shared between different projects, but from what I've seen, the most common approach is to cut-and-paste the grammar, and rewrite the actions. I've done the macro approach a couple of times, and it can be made to work but it's never as elegant as you'd like it to be.

You can try yacc and jacc.
http://web.cecs.pdx.edu/~mpj/jacc/
http://dinosaur.compilertools.net/#yacc
They have very similar syntax, may be with help some
hand maid preprocessing tool you can use one source file.
PS
But why not write parser once in C++ and use it via JNI?

You certainly can use ANTLR. The language specific parts are actions or predicates. If you don't need them then you won't have any language specific stuff in the grammar. Btw. regardless of the parser generator you use (including yacc, bison etc.) you would always have language specific stuff in the grammar if you need that.

Interpreting Java and converting it to another language

I work with a language similar to JavaScript that is used for point-of-sale device programming. This language really s*cks and I'm trying to build some kind of framework in Java that "converts" Java code into this language.
I did this using some Regex and parsed the Java files directly. Now I found that this may be not the right/better way and I'm searching for alternatives. Are there any tools for helping me doing so?
I thought I should use some advanced reflection utilities like ASM (http://asm.ow2.org/index.html). Performance is not crucial, so that may be the way.
What do you think?

ANTLR is a terrific parser-generator. I'd look into it. It has a Java grammar already available; I'm not sure if it's Java 5, 6, or 7 (I'm guessing it's 5).
Once you have the AST, your problem will be walking the tree and generating the target code. Good luck.

I suggest to parse Java syntax with JavaCC or similar tool, Java grammar description written long time ago. It can be used to write compiler so probably can also be used to write a converter. Regular expressions are not very good at parsing programming languages.

I've never done anything with it myself, but you could take a look at one of the framework listed at altjs.org, specifically under the Java Ports section, and take a look at one of those frameworks and modify them to your specific needs.

There are at least three ways:
a) Interpret the bytecode. There are some existing interpreters in JS, e.g. DoppioVM. They can be very slow.
b) Compile bytecode to JS. I've seen at least one such attempt and the resulting JS was ugly and not very fast. But this approach can have a good performance (well, it may result in using HashMap instead of JS object and so on). The biggest issue is IMHO while/if reconstruction.
EDIT: OK, is possibly is not so slow, but it is ugly and contains garbage like j2js.invokeStatic("j2js.client.Engine", "getEngine()j2js.client.Engine", null);. The one compiler was https://github.com/decatur/j2js-compiler .
c) Compile Java to JS. You can try Google Web Toolkit or http://j2s.sourceforge.net/ .

Scala parser combinators vs ANTLR/Java generated parser?

I am writing an expression parser for an app written mostly in Scala. I have built AST objects in Scala, and now need to write the parser. I have heard of Scala's built-in parser combinators, and also of ANTLR3, and am wondering: which would provide better performance and ease of writing code? So far:
ANTLR pros
Well-known
Fast
External DSL
ANTLRWorks (great IDE for parser grammer debugging/testing)
ANTLR cons
Java-based (Scala interop may be challenging, any experience?)
Requires a large dependency at runtime
Parser combinator pros
Part of Scala
One less build step
No need for a runtime dependency; e.g. already included in Scala's runtime library
Parser combinator cons
Internal DSL (may mean slower execution?)
No ANTLRWorks (provides nice parser testing and visualization features)
Any thoughts?
EDIT: This expression parser parses algebraic/calculus expressions. It will be used in the app Magnificalc for Android when it is finalized.

Scala's parser combinators aren't very efficient. They weren't designed to be. They're good for doing small tasks with relatively small inputs.
So it really depends on your requirements. There shouldn't be any interop problems with ANTLR. Calling Scala from Java can get hairy, but calling Java from Scala almost always just works.

I wouldn't worry about the performance limitations of parser combinators unless you were planning on parsing algebraic expressions that are a few pages long. The Programming Scala book does mention that a more efficient implementation of parser combinators is feasible. Maybe somebody will find the time and energy to write one.
I think with ANTLR you are talking about two extra build steps: ANTLR compiles to Java, and you need to compile both Scala and Java to bytecode, instead of just Scala.

I have created external DSLs both with ANTLRv4 and Scalas parser combinators and I clearly prefer the parser combinators, because you get excellent editor support when designing the language and it's very easy to transform your parsing results to any AST case class data structure. Developing ANTLR grammars takes much more time, because, even with the ANTLRWorks editor support, developing grammars is very error-prone. The whole ANTLR workflow feels quite bloated to me compared to the parser combinators' one.

I would be inclined to try to produce an external DSL using parser combinators. It shouldn't need to be an internal DSL. But I don't know that it would be better.
The best approach to figuring this out would be to take a simplified version of the grammar, try it both ways and evaluate the differences.

Just been writing a parser for a home brew 8 bit CPU assembler.
I got so far with Antlr4 before feeling that there had to be a better way.
I decided to have a go at Scala parser combinators and have to say that it is way more productive IMHO. However, I do know scala.

If you still interested about an integer expression parser please take a look at my example interpreter here: https://github.com/scala-szeged/hrank-while-language
. It is 200 hundred lines Scala code using the officail parser combinators. It has expression parsing. It also handle nested if, nested while, variables, and boolean expressions. I also implemented array handling in this github repository. If you need String handling I can help you, too.
An other, somewhat more simple expression parser is also present here in my other public repository https://github.com/scala-szeged/top-calc-dsl

Parsing Java Source Code

I am asked to develop a software which should be able to create Flow chart/ Control Flow of the input Java source code. So I started researching on it and arrived at following solutions:
To create flow chart/control flow I have to recognize controlling statements and function calls made in the given source code Now I have two ways of recognizing:
Parse the Source code by writing my own grammars (A complex solution I think). I am thinking to use Antlr for this.
Read input source code files as text and search for the specific patterns (May become inefficient)
Am I right here? Or I am missing something very fundamental and simple? Which approach would take less time and do the work efficiently? Any other suggestions in this regard will be welcome too. Any other efficient approach would help because the input source code may span multiple files and can be fairly complex.
I am good in .NET languages but this is my first big project in Java. I have basic knowledge of Compiler Design so writing grammars should not be impossible for me.
Sorry If I am being unclear. Please ask for any clarifications.

I'd go with Antlr and use an existing Java grammar: https://github.com/antlr/grammars-v4

All tools handling Java code usually decide first whether they want to process the language Java or Java byte code files. That is a strategic decision and depends on your use case. I could image both for flow chart generation. When you have decided that question. There are already several frameworks or libraries, which could help you on that. For byte code engineering there are: ASM, JavaAssist, Soot, and BCEL, which seems to be dead. For Java language parsing and analyzing, there are: Polyglot, the eclipse compiler, and javac. All of these include a complete compiler frontend for Java and are open source.
I would try to avoid writing my own parser for Java. I did that once. Java has a rather complex grammar, but which can be found elsewhere. The real work begins with name and type resolution. And you would need both, if you want to generate graphs which cover more than one method body.

Eclipse has a library for parsing the source code and creating Abstract Syntax Tree from it which would let you extract what you want.
See here for a tutorial
http://www.vogella.de/articles/EclipseJDT/article.html
See here for api
http://help.eclipse.org/indigo/topic/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/package-summary.html#package_description

Now I have two ways of recognizing:
You have many more ways than that. JavaCC ships with a Java 1.5 grammar already built. I'm sure other parser generators ditto. There is no reason for you to either have to write your own grammar or construct your own parser.
And specifically 'read[ing] input source code files as text and search for the specific patterns' isn't a viable choice at all, as it isn't parsing, and therefore cannot possibly recognize Java programs correctly.

Your input files are written in Java, and the software should be written in Java, but this is your first project in Java? First of all, I'd suggest learning the language with smaller projects. Also you need to learn how to use graphics in Java (there are various libraries). Then, you should focus on what you want to show on your graphs. Or is text sufficient?

The way I would do it is to analyse compiled code. This would allow you to read jars without source and avoid parsing the code yourself. I would use Objectwebs ASM to read the class files.

Smarter solution is to use Eclipse's java parser. Read more here: http://www.ibm.com/developerworks/opensource/library/os-ast/

Or even more easy: Use reflection. You should be able to compile the sources, load the classes with java classloader and analyse them from there. I think this is far more easy than any parsing.

Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery, with built in capability for parsing, building ASTs, constructing symbol tables, extracting control and data flow, transforming the ASTs, prettyprinting ASTs back to text, etc.
DMS is parameterized by an explicit language definition, and has a large set of preexisting definitions.
DMS's Java Front End already computes control and data flow graphs, so your problem would be reduced to exporting them.
EDIT 7/19/2014: Now handles Java 8.

Interrogating Java source code

I have a Java source code that I need to interrogate and apply security policies [for e.g. applying CWE]
I have couple of ideas, for starters using AST and then travel thru the tree. Others include using regular expression.
Are there any options other than AST or regex that I could use for such process.

An AST is a good choice, much better than regular expressions.
There are numerous Java parsers available. ANTLR's java grammar is one example.
You can also adapt the source code of the javac compiler from OpenJDK.
Some static analysis tools like PMD support user-defined rules that would allow you to perform many checks without a lot of work.

There are a number of pre-existing tools that do some or all of what you are asking for. Some on the source code level, and some by parsing the byte code.
Have a look at
- CheckStyle
- FindBugs
- PMD
All of these are extendable in one way or another, so you can probably get them to check what you want to check in addition to the many standard checks they have

Many static source code analysis (SCA) tools use a collection of regular expressions to detect code that maybe vulnerable. There are many SCA tools for Java and I don't know the best open source one off hand. I can tell you that Coverity makes the best Java SCA tool that i have used, its much more advanced than just regular expressions as it can also detect race conditions.
What I can tell you is that this approach is going to produce a lot of false positives and false negatives. The CWE system indexes HUNDREDS of different vulnerabilities and covering all of them is completely and totally impossible.

You either want to get an existing static analysis tool that focuses on the vulnerabilities of interest to you, or you want to get a tool with strong foundations for building custom analyses.
Just parsing to ASTs doesn't get you a lot of support for doing analysis. You need to know what symbols mean where encountered (e.g., scopes, symbol tables, type resolution), and you often need to know how information flows (inheritance graphs, calls graphs, control flows, data flows) across the software elements that make up the system. Tools like ANTLR don't provide this; they are parser generators.
A tool foundation having this information available for Java is our DMS Software Reengineering Toolkit and its Java Front End.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.