Interrogating Java source code

Interrogating Java source code - java

I have a Java source code that I need to interrogate and apply security policies [for e.g. applying CWE]
I have couple of ideas, for starters using AST and then travel thru the tree. Others include using regular expression.
Are there any options other than AST or regex that I could use for such process.

An AST is a good choice, much better than regular expressions.
There are numerous Java parsers available. ANTLR's java grammar is one example.
You can also adapt the source code of the javac compiler from OpenJDK.
Some static analysis tools like PMD support user-defined rules that would allow you to perform many checks without a lot of work.

There are a number of pre-existing tools that do some or all of what you are asking for. Some on the source code level, and some by parsing the byte code.
Have a look at
- CheckStyle
- FindBugs
- PMD
All of these are extendable in one way or another, so you can probably get them to check what you want to check in addition to the many standard checks they have

Many static source code analysis (SCA) tools use a collection of regular expressions to detect code that maybe vulnerable. There are many SCA tools for Java and I don't know the best open source one off hand. I can tell you that Coverity makes the best Java SCA tool that i have used, its much more advanced than just regular expressions as it can also detect race conditions.
What I can tell you is that this approach is going to produce a lot of false positives and false negatives. The CWE system indexes HUNDREDS of different vulnerabilities and covering all of them is completely and totally impossible.

You either want to get an existing static analysis tool that focuses on the vulnerabilities of interest to you, or you want to get a tool with strong foundations for building custom analyses.
Just parsing to ASTs doesn't get you a lot of support for doing analysis. You need to know what symbols mean where encountered (e.g., scopes, symbol tables, type resolution), and you often need to know how information flows (inheritance graphs, calls graphs, control flows, data flows) across the software elements that make up the system. Tools like ANTLR don't provide this; they are parser generators.
A tool foundation having this information available for Java is our DMS Software Reengineering Toolkit and its Java Front End.

Related

Complex decision system modified by user - Java

I have a problem and I would like to ask for alternatives on my existing technologies since the programmed feature would be complex and would be given to users, so it should be as simple as it can be on front-end. I need java based technology.
What I need to do:
I am having a basic structure with lot of datas. These datas are mostly well written like Integers, Dates, Booleans etc, so things what can be compared easily.
I need to model decisions with batches of requirements which can be defined and altered by many sources like inner business processes and governmental laws.
So I am thinking to give a scripting ability to the users (most of them have university degrees, so some complexity is ok).
Let's see a simplified example.
Let A be a structure with the following.
A.budget - Integer
A.bankRelatedDebt - Integer
A.privateRelatedDebt - Integer
A.deadLine - Date
A.hasPermissionFromGovernment - Boolean
A.hasProblematicContracts - Boolean
I need rules to define to decide if the rule stands or falls, so I need boolean back.
Rule1: The budget is over 1 million EUR
Rule2: Has no problematic document or has a permission from government
Rule3: The deadline won't be in a month range.
Rule4: The overall debt (local + private) doesn't exceed 100.000 EUR
These rules could be hardcoded in other cases, but this has to be super-dynamic and based on given datas.
We have the options of drools and antlr I would need alternatives if you can mention. Or if you can mention technologies to avoid, that is helpful as well and welcomed, so I can avoid it.

For what it's worth. I would love to do such an expert system too, so bear with my ramblings. First some negative points as you asked what to avoid.
There are many pitfalls.
The "programming" is done by the users, there probably is no version control system for restoral, there maybe is no staging system but one is working in the production system. Think of extending a common library rule test wise. No unit tests?
Then there is the user acceptance. Especially there is a competitor, Excel programming, which you have to supercede. Generating reports with human electable text blocks, diagrams.
Your nunbered rules still lack some life: the system could assist with providing categories to select from: Rule1 - restriction on monetary resource. Nice would be to propose "would you also like to restrict on limitited resources? (a) Rule1, (b) ... .
Also what is the product? What are the advantages? What are the goals?
Reports, calculation scenarios (what-ifs, tolerances calculated through).
I certainly would first write a technical document along above lines, and than search the tools - as you seem to be doing. Drools is too basic. ANTLR for a DSL I find risky.
Tools
Data mining seems to be the keyword you are searching.
The JVM programming language Scala (not easily acquired), is productive for DSL, parsing.
Many functional languages are a bit easier and offer scripting too (Java scripting API).
What about a web project, maybe using jetty as embedded web server. So you may apply HTML and JavaScript. HTML5?
A rich client platform (eclipse or NetBeans) requires experience for rapid development. For nice graphics, maybe JavaFX (too early).

Develop a DSL for your needs using either Groovy or Scala.
We use CodeMirror to provide syntax highlighting in a web page.
Works great for us with Groovy.

I would vote against drools because I have terrible experiences, but some people like it.
I would propose a language already integrated in java: JavaScript. Why?
Is simple enough and has nice access to java beans: instead of
budget.getDealLine() you can use budget.deadLine
you have tons of places to check for information
you can add simple functions to make it more easy to use
But if you choose JavaScript, Python, Drools, ANTLR remember:
Users do not have version control systems like SVN/GIT, so it is up
to you make it happen.
Give them a tool (a webpage or whatever) that automatically save every version of every script they wrote.
Give them a way to test what they wrote without damaging anything.
Give them tools to rollback whatever changes they made.
Make as much static tests as possible once they commit the code before executing it.
Syntax highlighting will make them happier.
And remember: they will use the tool in ways you don't expect, and you will end up writing (or rewriting) most of the scripts. No university degree means you can trust them to understand what they are doing. (Not even CS!)
So if you can make the system less dynamic, would be in your benefit

It's like strategy pattern,all different rules are different algorithm apply to the Context(A),algorithm can be selected at runtime.
Add a filter chain design pattern to that,so that you can choose different algorithms(rules) at the same time.
Roolie is a very simple java rule engine that meight be helpful for you .As Roolie says:
Roolie is an extremely simple Java Rule Engine (Non-JSR 94) that uses rules you create in Java. Simply create your basic rules, implement the single "passes" method for each, then chain them together in an XML file to create more complex rules.

If you had the records in a database, you could select the matching ones with SQL syntax.
For example:
SELECT * FROM data
WHERE budget > 100000
AND privatCredits < 50000

Java - abstract syntax tree

I am currently looking for a Java 6/7 parser, which generates some (possibly standartized) form abstract syntax tree.
I have already found that ANTLR has a Java 6 grammar, but it seems, that it only generates parse tree, but not syntax tree. I have also read about Java Compiler API - but all the soources mentioned, that it is overdesigned and poorly documented (and I havent found, if it really generates the AST).
Do you know about any good parser library, with possibly as standardized output as possible?
Thanks

Basically JavaCC and ANTLR are the best tools out there at the moment.
You can find a usable Java 6 grammar in the project's grammar repository. JavaCC is a bit oldschool, rarely updated, but easy to start with, Java-oriented, and generates the AST (search for JJTree). It's a bit, well... strange on the first sight, but you can get used to it.
Both tools have a nice IDE support (e.g., Eclipse plug-ins), but I think (based on your description) what you need is JavaCC. Give it a try.

Our DMS Software Reengineering Toolkit with its Java front end can provide an AST (example at SO).
The distinction you draw beween "needed for semantics" (AST) and "is an accident of the grammar" ("Concrete" or "Parse" tree) is interesting. It takes additional effort, somewhere, to drop the CST information to obtain an AST.
You can do that by hand coding the AST construction as semantic actions on rules. That takes effort, and likely gives you a pretty good answer. But this process can pretty much automated completely by observing that literal tokens don't need to be kept in the tree, that unary production chains are unnecessary (except where a unary production introduces semantics), and that lists can be formed automatically. (You can read more about this here: https://stackoverflow.com/a/5732290/120163)
This is the approach taken by DMS. You write the grammar. DMS parses and builds the AST using these idea. No additional work/semantic actions on your part.
For a stone-stable grammer that already has this done for you, there's not a clear advantage, and if all you want is an AST than using JavaCC or ANTLR will work. If the grammar can change, then it is easier with DMS's approach.
But, nobody wants just an AST. Its the first step in a long series of steps that leads to whatever tool you are imagining. As a practical matter with real tools, you will almost surely need "symbol tables" and the abiliy to determine which symbol table entry an identifier node selects. You may need control and data flow analysis. You may need to modify the AST to make changes if your tool is a "change" and not just an analysis tool, and for that you might want something that can match/patch arbitrary chunks of the AST using the surface syntax of your langauge (e.g., Java). Finally, you may want to regenerate source code from you AST as legal, compilable text.
These are not easy mechanisms to build. We think we are competent engineers; it took us some several months on and off over the last 5 years to get the Java grammars (1.3 to 6 and 7) right. It took us about a year to build the symbol table machinery for Java; how symbols are resolved are a lot more complicated than you think; go read the langauge standard.
DMS provides all of these capabilities for many langauges, including Java, out of the box. For those languages with lesser support, it has parsing, prettyprinting, tree transformations, and attribute evaluation out of the box.
I've been hearing, for the last 20 years, If I just had a parser.... My experience (and the reason I built DMS) is that an AST is just not enough, by a long shot.
And I think what DMS provides (far) above and beyond "mere parsing" sets it far apart from "JavaCC and ANTLR". I do not believe they are "the best tools out there at the moment", unless you are optimizing on "free" and not "getting the job done". (If you want a free tool closer to the mark, consider using Eclipse's Java parsing machinery. At least it has, AFAIK, symbol table lookup).

I know two open source project to create and manipulate the Java AST:
javaparser
Eclipse JDT

Parsing Java Source Code

I am asked to develop a software which should be able to create Flow chart/ Control Flow of the input Java source code. So I started researching on it and arrived at following solutions:
To create flow chart/control flow I have to recognize controlling statements and function calls made in the given source code Now I have two ways of recognizing:
Parse the Source code by writing my own grammars (A complex solution I think). I am thinking to use Antlr for this.
Read input source code files as text and search for the specific patterns (May become inefficient)
Am I right here? Or I am missing something very fundamental and simple? Which approach would take less time and do the work efficiently? Any other suggestions in this regard will be welcome too. Any other efficient approach would help because the input source code may span multiple files and can be fairly complex.
I am good in .NET languages but this is my first big project in Java. I have basic knowledge of Compiler Design so writing grammars should not be impossible for me.
Sorry If I am being unclear. Please ask for any clarifications.

I'd go with Antlr and use an existing Java grammar: https://github.com/antlr/grammars-v4

All tools handling Java code usually decide first whether they want to process the language Java or Java byte code files. That is a strategic decision and depends on your use case. I could image both for flow chart generation. When you have decided that question. There are already several frameworks or libraries, which could help you on that. For byte code engineering there are: ASM, JavaAssist, Soot, and BCEL, which seems to be dead. For Java language parsing and analyzing, there are: Polyglot, the eclipse compiler, and javac. All of these include a complete compiler frontend for Java and are open source.
I would try to avoid writing my own parser for Java. I did that once. Java has a rather complex grammar, but which can be found elsewhere. The real work begins with name and type resolution. And you would need both, if you want to generate graphs which cover more than one method body.

Eclipse has a library for parsing the source code and creating Abstract Syntax Tree from it which would let you extract what you want.
See here for a tutorial
http://www.vogella.de/articles/EclipseJDT/article.html
See here for api
http://help.eclipse.org/indigo/topic/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/package-summary.html#package_description

Now I have two ways of recognizing:
You have many more ways than that. JavaCC ships with a Java 1.5 grammar already built. I'm sure other parser generators ditto. There is no reason for you to either have to write your own grammar or construct your own parser.
And specifically 'read[ing] input source code files as text and search for the specific patterns' isn't a viable choice at all, as it isn't parsing, and therefore cannot possibly recognize Java programs correctly.

Your input files are written in Java, and the software should be written in Java, but this is your first project in Java? First of all, I'd suggest learning the language with smaller projects. Also you need to learn how to use graphics in Java (there are various libraries). Then, you should focus on what you want to show on your graphs. Or is text sufficient?

The way I would do it is to analyse compiled code. This would allow you to read jars without source and avoid parsing the code yourself. I would use Objectwebs ASM to read the class files.

Smarter solution is to use Eclipse's java parser. Read more here: http://www.ibm.com/developerworks/opensource/library/os-ast/

Or even more easy: Use reflection. You should be able to compile the sources, load the classes with java classloader and analyse them from there. I think this is far more easy than any parsing.

Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery, with built in capability for parsing, building ASTs, constructing symbol tables, extracting control and data flow, transforming the ASTs, prettyprinting ASTs back to text, etc.
DMS is parameterized by an explicit language definition, and has a large set of preexisting definitions.
DMS's Java Front End already computes control and data flow graphs, so your problem would be reduced to exporting them.
EDIT 7/19/2014: Now handles Java 8.

Java (or bytecode) AST generators available so that I can run a couple of Visitors on top of its result?

I am looking for a tool that'll take either a .java source code file, or .class or .jar and parses it, generating an AST(abstract syntax tree), so I can play with it. I intend to create a couple of Visitors to run on top of it.
Do such tools exist in Java? There exists something similar in .NET, called Mono.Cecil (although it seems that as of today, it's not supporting the Visitor pattern by itself).
Thanks

You might be interested in the ASTParser used by the Eclipse IDE. Here is a nice article on getting started with it.

Our DMS Software Reengineering Toolkit is general purpose compiler machinery with support for parsing, building ASTs, buiding symbol tables, walking/inspecting/modifying the ASTs, and prettyprinting a modified AST back to source code. It also provides for pattern matching with the patterns written in the surface syntax of the target language as defined by the parser it uses. DMS also provides generaic facilities for computing control and data flow, as well as call graphs. DMS provides a complete ecosystem to support the construction of arbitrary analyzers, code transformers, or generators, depending on your needs.
DMS has an optional Java Front End which enables DMS to provide all this capability for processing Java and .class files.

How can I parse code to build a compiler in Java?

I need to write a compiler. It's homework at the univ. The teacher told us that we can use any API we want to do the parsing of the code, as long as it is a good one. That way we can focus more on the JVM we will generate.
So yes, I'll write a compiler in Java to generate Java.
Do you know any good API for this? Should I use regex? I normally write my own parsers by hand, though it is not advisable in this scenario.
Any help would be appreciated.

Regex is good to use in a compiler, but only for recognizing tokens (i.e. no recursive structures).
The classic way of writing a compiler is having a lexical analyzer for recognizing tokens, a syntax analyzer for recognizing structure, a semantic analyzer for recognizing meaning, an intermediate code generator, an optimizer, and last a target code generator. Any of those steps can be merged, or skipped entirely, if makes the compiler easier to write.
There have been many tools developed to help with this process. For Java, you can look at
ANTLR - http://www.antlr.org/
Coco/R - http://ssw.jku.at/Coco/
JavaCC - https://javacc.dev.java.net/
SableCC - http://sablecc.org/

I would recommend ANTLR, primarily because of its output generation capabilities via StringTemplate.
What is better is that Terence Parr's book on the same is by far one of the better books oriented towards writing compilers with a parser generator.
Then you have ANTLRWorks which enables you to study and debug your grammar on the fly.
To top it all, the ANTLR wiki + documentation, (although not comprehensive enough to my liking), is a good place to start off for any beginner. It helped me refresh knowledge on compiler writing in a week.

Have a look at JavaCC, a language parser for Java. It's very easy to use and get the hang of

Go classic - Lex + Yacc. In Java it spells JAX and javacc. Javacc even has some Java grammars ready for inspection.

I'd recommend using either a metacompiler like ANTLR, or a simple parser combinator library. Functional Java has a parser combinator API. There's also JParsec. Both of these are based on the Parsec library for Haskell.

JFlex is a scanner generator which, according to the manual, is designed to work with the parser generator CUP.
One of the main design goals of JFlex was to make interfacing with the free Java parser generator CUP as easy as possibly [sic].
It also has support for BYACC/J, which, as its name suggests, is a port of Berkeley YACC to generate Java code.
I have used JFlex itself and liked it. Howeveer, the project I was doing was simple enough that I wrote the parser by hand, so I don't know how good either CUP or BYACC/J is.

I've used SableCC in my compiler course, though not by choice.
I remember finding it very bulky and heavyweight, with more emphasis on cleanliness than convenience (no operator precedence or anything; you have to state that in the grammar).
I'd probably want to use something else if I had the choice. My experiences with yacc (for C) and happy (for Haskell) have both been pleasant.

Parser combinators is a good choice. Popular Java implementation is JParsec.

If you're going to go hardcore, throw in a bit of http://llvm.org in the mix :)

I suggest you look at at the source for BeanShell. It has a compiler for Java and is fairly simple to read.

http://java-source.net/open-source/parser-generators and http://catalog.compilertools.net/java.html contain catalogs of tools for this. Compare also the Stackoverflow question Alternatives to Regular Expressions.

Use a parser combinator, like JParsec. There's a good video tutorial on how to use it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.