Analyse C++ files from a Java program - java

After several days of research I turn to you.
I search to analyse a C++ file for:
Count the number of parameters in method/function
Count the numbers of line in method/function
etc...
To do this I first tried to with regex, but it has not been successful (Too many cases handled, the regex really get too illegible).
Now I try with ANTLR4. Unfortunately I can not seem to find a grammar for C + + (I find a grammar for C here https://github.com/antlr/grammars-v4)
(I also tried with ANTLR3 but with this grammar, I have a C++ code !!! )
http://www.antlr3.org/grammar/1295920686207/antlr3.2_cpp_parser4.1.0.zip
So do you know where I can find a C++ grammar for ANTLR4?
Or do you know another way to do what I want?
Thank you in advance for your help
PS: sorry for my english, I'm French student

There are some good answers here. If I were you I would use a pre-built parser. After having tried to use ANTLR, I would say it takes a long time to make anything good. Personally I would try Clang.

clang has a library to build AST from where you can get the info you want.
Some existing tools compute some statistics as
cccc
ccccc
...

Related

Obtaining the Subject of a String in Java

Suppose I tell my Java program to find the subject of the sentence:
I enjoy spending time with my family.
My program should output:
Tell me more about your family.
How would it go about doing this?
EDIT
I could do this by having an array of String and have that filled with every noun in the English dictionary, but is there a simpler way?
This is way too open-ended a question. But a good place to start would be to learn about Natural Language Processing concepts and then look at using a framework like CoreNLP. It breaks down sentences into a parse tree and you can use this to identify parts of speech and things like the subject of a sentence. This is probably your best bet if you want a reasonably-reliable method.

Java LR or LL Parsing

a teacher of mine said, that Java cannot be LL parsed.
I dont understand this and wonder if this is true.
I searched for a grammar of Java 8 and found this: https://github.com/antlr/grammars-v4/blob/master/java8/Java8.g4
But even if I try to analyze the grammar, I dont get the problem for LL parsing.
Does anyone know if this is true, know a scientific proof or just can explain to me why it should be not possible to find a grammar construct of Java which can be LL parsed?
Thanks a lot guys and girls.
The Java Language Specification for Java 7 says it is not LL(1):
The grammar presented in this chapter is the basis for the
reference implementation. Note that it is not an LL(1) grammar, though
in many cases it minimizes the necessary look ahead.
If you either find:
left recursion, or
an alternative (A|B) that the intersection of two or more alternatives share the same FIRST set; FIRST(A) has one or more symbols also in FIRST(B)
Your grammar won't be LL(1).
I think it's due to the left recursion. LL parsers cannot handle left recursion and the current Java grammar is specified in some cases using them, at least Java 7.
Of course, it is well known that one can construct equivalent grammars getting rid of left recursions, but in its current specification Java language could not be LL parsed.

Text processing in Java

Now this is a tricky problem for which I'm not able to figure out a good solution. Suppose we have a String in Java:- "He ate 3 apples today." Now the digit 3 can be easily identified in Java using isNumeric function or using regular expressions. But what if I have a String like: "He ate three apples today."? How can I identify that three is actually a number? I used OpenNlp and used its POS tagger but the time it takes to do is really too much! Can anyone suggest a better solution for this? Also among the ".bin" of OpenNlp, there is one file-"num.bin", but I don't know how to use this file. OpenNlp documentation also say nothing about it. Can anyone tell me if this is exactly what I've been looking for, and if yes then how to use it.
/*********************************************************************************************************************************/
I'm actually short of time here, so I've settled on a temporary solution here. Make a file/dictionary and take all the entries in a hashtable. Then I'll tokenize my sentence and check word by word for numbers, similar to what you guys suggested. I'll keep on updating the file as and when required. Thanks for your valuable suggestions guys, and if you have got something better than this I'd be really glad. OpenNlp implements this in a very good way, the only problem with it is time complexity and I want to do this in minimum time possible.
Create a dictionary of numbers. Search for elements from that dictionary in the text.
Check asympotic complexity, it may be cheaper to sort the text first.
You have to keep all that words in arrays and then use it. Here is an example how to convert number to string. It may help you... I think you have to split your text into words and check if a word is a number (three). If yes check the next word because it can be say "million", then check the next word and so on. It's not easy and seems like a little library.I think you'll spend a lot of time writing this. Or try to search in google for a library like this. Maybe someone have already got this problem, wrote a library and shares it for free )) Good luck.

Java API for plural forms of English words

Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?
Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"
jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"
I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.
considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".
Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).
If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.
Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/

implementing unification algorithm

I worked the last 5 days to understand how unification algorithm works in Prolog .
Now ,I want to implement such algorithm in Java ..
I thought maybe best way is to manipulate the string and decompose its parts using some datastructure such as Stacks ..
to make it clear :
suppose user inputs is:
a(X,c(d,X)) = a(2,c(d,Y)).
I already take it as one string and split it into two strings (Expression1 and 2 ).
now, how can I know if the next char(s) is Variable or constants or etc.. ,
I can do it by nested if but it seems to me not good solution ..
I tried to use inheritance but the problem still ( how can I know the type of chars being read ?)
First you need to parse the inputs and build expression trees. Then apply Milner's unification algorithm (or some other unification algorithm) to figure out the mapping of variables to constants and expressions.
A really good description of Milner's algorithm may be found in the Dragon Book: "Compilers: Principles, Techniques and Tools" by Aho, Sethi and Ullman. (Milners algorithm can also cope with unification of cyclic graphs, and the Dragon Book presents it as a way to do type inference). By the sounds of it, you could benefit from learning a bit about parsing ... which is also covered by the Dragon Book.
EDIT: Other answers have suggested using a parser generator; e.g. ANTLR. That's good advice, but (judging from your example) your grammar is so simple that you could also get by with using StringTokenizer and a hand-written recursive descent parser. In fact, if you've got the time (and inclination) it is worth implementing the parser both ways as a learning exercise.
It sounds like this problem is more to do with parsing than unification specifically. Using something like ANTLR might help in terms of turning the original string into some kind of tree structure.
(It's not quite clear what you mean by "do it by nested", but if you mean that you're doing something like trying to read an expression, and recursing when meeting each "(", then that's actually one of the right ways to do it -- this is at heart what the code that ANTLR generates for you will do.)
If you are more interested in the mechanics of unifying things than you are in parsing, then one perfectly good way to do this is to construct the internal representation in code directly, and put off the parsing aspect for now. This can get a bit annoying during development, as your Prolog-style statements are now a rather verbose set of Java statements, but it lets you focus on one problem at a time, which is usually helpful.
(If you structure things this way, this should make it straightforward to insert a proper parser later, that will produce the same sort of tree as you have until then been constructing by hand. This will let you attack the two problems separately in a reasonably neat fashion.)
Before you get to do the semantics of the language, you have to convert the text into a form that's easy to operate on. This process is called parsing and the semantic representation is called an abstract syntax tree (AST).
A simple recursive descent parser for Prolog might be hand written, but it's more common to use a parser toolkit such as Rats! or Antlr
In an AST for Prolog, you might have classes for Term, and CompoundTerm, Variable, and Atom are all Terms. Polymorphism allows the arguments to a compound term to be any Term.
Your unification algorithm then becomes unifying the name of any compound term, and recursively unifying the value of each argument of corresponding compound terms.

Categories