Is there a quick way to find all the commented-out code across Java files in Eclipse?
Any option in Search, perhaps, or any add-on that can do this?
It should be able to find only code which is commented out, but not ordinary comments.
In Eclipse, I just do a file search with the regular expression checkbox turned on:
(/\*.*;.*\*/)|(//.*;)
It will find semicolons in
// These;
and /* these; */
Works for me.
Sonar can do it: http://www.sonarsource.org/commented-out-code-eradication-with-sonar/
You can mark your own commented code with a task tag. You can create your own task tags in Eclipse.
From the menu, go to Window -> Preferences. In the Preferences dialog, go to General -> Editors -> Structured Text Editors -> Task Tags.
Add an appropriate task tag, like COMMENTED. Set the priority to Low.
Then, any code you comment out, you can mark with the COMMENTED task tag. A list of these task tags, along with their locations, appears in the Tasks view.
#Jorn said:
I think [the OP] wants to find code that is commented out, not code that has a comment.
If the intention is to find commented out code, then I don't think it is possible in general. The problem is that it is impossible to distinguish between comments that were written as code or pseudo-code, and code that is commented out. Making that distinction requires human intelligence.
Now IDE's typically have a "toggle comments" function that comments out code in a particular way. It would be feasible to write a tool / plugin that matches the style produced by a
particular IDE. But that's probably not good enough, especially since reformatting the code typically gets rid of the characteristics that made the commented out code recognizable.
If the problem is to find commented-out code, what is needed is a way to find comments, and way to decide if a comment might contain code.
A simple way to do this is to search for comment that contain code-like things. I'd be tempted to hunt for comments containing a ";" character (or some other rare indicator such as "="); it will be pretty hard to have any interesting commented code that doesn't contain this and in my experience with comments, I don't see many that people write that contain this. A regexp search for this should be pretty straightforward, even if it picked up a few addtional false positives (e.g. // in a string literal).
A more sophisticated way to accomplish this is to use a Java lexer or parser. If you have a lexer that returns comments at tokens (not all of them do, Java compilers aren't interested in comments), then you can simply scan the lexemes for a comment and do the semicolon check I described above. You won't get any false positives hits for comment like things in string literals with this approach.
If you have a re-engineering parser that captures comments as part of the AST ( such as our SD Java Front End),
you can mechanically scan the parse tree for comments, feed the comment context back to the parser
to see if the content is code like, and report any that passes that test modulo some size-depedent error rate
(10 errors in 15 characters implies "really is a comment"). Now the "code-like" test requires
the reengineering parser be willing to recognize any substring of the (Java) language.
Our DMS Software Reengineering Toolkit underlying the Java Front End can actually do that, using access to the grammar buried in the front end, as it is willing to start a parse for any language (non)terminal,
and this question is "can you find a sequuence of (non)terminals that consumes the string?".
The lexer and parser approaches are small and big sledgehammers respectively. If OP is going to do this just once, he can stick to the manual regex search. If the problem is to vet the code base repeatedly (needed in big organizations), he'd want a tool that can be run on regular basis.
You can do a search in Eclipse.
All you need to search for is /* and //
However, you will only find the files which contain that expression, and not the actual content which I believe you are after.
However, if you are using Linux you can easily get all the comments with a one liner.
Related
A colleague's Eclipse formatting rules cause extreme indentation, and wraps comments to look like this:
// this
// is
// indented
// several
// tabs
They're wrapped at every word boundary because they're indented all the way past the line wrap width. I can reformat them to look only slightly more sane:
// this
// is
// indented
// several
// tabs
but the wrapping remains. Is there any way I can automagically undo this wrapping so that I won't have to spend 30 minutes every time I commit to manually reformat these comments and make the comments readable again? I don't care if other line breaks are not preserved; that would be a reasonable trade. Target result:
// this is indented several tabs
You are not the first to be frustrated by this (me included) I have never found an answer on how to undo this kind of formatting in a great way.
To prevent it happening again, there is of course the option of fixing the formatting rules and then applying them to the project instead of to the workspace. That ensures that if your colleague does format it won't be ruined like this. I recommend setting (in Comments tab of formatter):
turn on Never indent line comments on first column --> this prevents commented out code from being indented and lost
turn off Enable line comment formatting --> this fixes the wrapping problem
Take those settings, followed by using block /* */ comments for actual block comments (instead of what I often see of using line // comments for block comments).
Some other SO users who have posted similar questions with no full resolution, but some suggestions that may help. Such as using a third party formatter (perhaps only once to recover your code state and then continuing as above?)
How to reformat multi-line comments in Eclipse PDT?
Join Lines in Eclipse
I don't think this will be possible with standard IDEs; whilst they're usually good at reflowing lines over the margin, most are reticent to reflow legitimate lines (for good reasons, generally).
That said, I know you do this semi-automatically in IntelliJ IDEA at (sorry haven't used Eclipse for a while but I think it's now possible):
Select the lines with comments and type Ctrl-Shift-J (or equivalent mapping to join lines). This immediately pulls them into one line. Ctrl-Alt-L to reformat that selection if you need to.
It's that or regex I guess.. but be warned...
Try going to Windows -> Preferences -> Java -> Code Style -> Formatter and then click the Edit... button towards the top right under the Active Profile header.
Mess around with the settings in there, specifically the Line Wrapping and Comments tab.
I have a small working text editor written in Java using JTextPane and I would like to color specific words. It would work in the same fashion that keywords in Java (within Eclipse) are colored; if they are a keyword, they are highlighted after the user is done typing them. I am new to text editors implemented in Java, any ideas?
In computer science, lexical analysis is the process of converting a
sequence of characters into a sequence of tokens. A program or
function that performs lexical analysis is called a lexical analyzer,
lexer, tokenizer,[1] or scanner. A lexer often exists as a single
function which is called by a parser or another function, or can be
combined with the parser in scannerless parsing.
Having said that, it is no trivial task. You need a high level library to do that. It will ease your task. What is the way out ?
Use ANTLR. Here is what its site says:
ANTLR is a powerful parser generator that you can use to read,
process, execute, or translate structured text or binary files. It’s
widely used in academia and industry to build all sorts of languages,
tools, and frameworks....
NetBeans IDE parses C++ with ANTLR.
There, problem solved. The author of ANTLR also has a book on how to use ANTLR which you may want to buy if you wanna learn how to use it.
Having given you enough brain melt, there is an out of the box solution available for you: JSyntaxPane. Just like any JCOmponent, you initialize it and pop it into a JFrame. It works like a charm. It supports a whole lot of languages apart from Java
Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.
The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.
Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.
How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.
If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)
Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.
I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)
I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.
But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.
For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.
I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!
Any ideas??
Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...
First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.
Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.
Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.
Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.
I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.
EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.
There's one more excellent toolkit for natural language processing - GATE. It has number of sentence splitters, including standard ANNIE sentence splitter (doesn't fit you needs completely) and RegEx sentence splitter. Use later for any tricky splitting.
Exact pipeline for your purpose is:
Document Reset PR.
ANNIE English Tokenizer.
ANNIE RegEx Sentence Splitter.
Also you can use GATE's JAPE rules for even more flexible pattern searching. (See Tao for full GATE documentation).
If you would like to stick on Stanford NLP or OpenNLP, then you'd better retrain the model. Almost all of the tools in these packages are machine learning based. Only with customized training data, can they give you a ideal model and performance.
Here is my suggestion: manually split the sentences base on your criteria. I guess couple of thousand sentences is enough. Then call the API or command-line to retrain sentence splitters. Then you're done!
But first of all, one thing you need to figure out is, as said in previous threads: "First you have to clearly define the task. What, precisely, is your definition of 'a sentence?"
I'm using Stanford NLP and OpenNLP in my project, Dishes Map, A delicious dishes discovery engine, based on NLP and machine learning. They're working very well!
For similar case what I did was separated the text into different sentences (separated by new lines) based on where I want the text to split. As in your case it is texts starting with bullets (or exactly the text with "line break tag " at end). This will also solve similar problem which may occur if you are working with the HTML for the same.
And after separating those into different lines you can send the individual lines for the sentence detection, that will be more correct.
Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:
Get all the text that will be displayed to the user in a browser from HTML.
My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).
If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:
items in an ul or option tag could be separated by full stops (or to be honest just ignored).
I am working Java, but would be interested in seeing any code that does this.
I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).
An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.
Any pointers or suggestions very welcome.
Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.
I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.
If you have
<strong> test </strong>
(and other tags that change the look of the test) you could place in conditions to not place a full stop here.
HTML parsers seem to be a reasonable starting point for this.
there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.
They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.
But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.
there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)