Programmatically remove comments from Java File [duplicate] - java

I have a java project and i have used comments in many location in various java files in the project. Now i need to remove all type of comments : single line , multiple line comments .
Please provide automation for removing comments. using tools or in eclipse etc.
Currently i am manually trying to remove all commetns

You can remove all single- or multi-line block comments (but not line comments with //) by searching for the following regular expression in your project(s)/file(s) and replacing by $1:
^([^"\r\n]*?(?:(?<=')"[^"\r\n]*?|(?<!')"[^"\r\n]*?"[^"\r\n]*?)*?)(?<!/)/\*[^\*]*(?:\*+[^/][^\*]*)*?\*+/
It's possible that you have to execute it more than once.
This regular expression avoids the following pitfalls:
Code between two comments /* Comment 1 */ foo(); /* Comment 2 */
Line comments starting with an asterisk: //***NOTE***
Comment delimiters inside string literals: stringbuilder.append("/*");; also if there is a double quote inside single quotes before the comment
To remove all single-line comments, search for the following regular expression in your project(s)/file(s) and replace by $1:
^([^"\r\n]*?(?:(?<=')"[^"\r\n]*?|(?<!')"[^"\r\n]*?"[^"\r\n]*?)*?)\s*//[^\r\n]*
This regular expression also avoids comment delimiters inside double quotes, but does NOT check for multi-line comments, so /* // */ will be incorrectly removed.

I had to write somehting to do this a few weeks ago. This should handle all comments, nested or otherwise. It is long, but I haven't seen a regex version that handled nested comments properly. I didn't have to preserve javadoc, but I presume you do, so I added some code that I belive should handle that. I also added code to support the \r\n and \r line separators. The new code is marked as such.
public static String removeComments(String code) {
StringBuilder newCode = new StringBuilder();
try (StringReader sr = new StringReader(code)) {
boolean inBlockComment = false;
boolean inLineComment = false;
boolean out = true;
int prev = sr.read();
int cur;
for(cur = sr.read(); cur != -1; cur = sr.read()) {
if(inBlockComment) {
if (prev == '*' && cur == '/') {
inBlockComment = false;
out = false;
}
} else if (inLineComment) {
if (cur == '\r') { // start untested block
sr.mark(1);
int next = sr.read();
if (next != '\n') {
sr.reset();
}
inLineComment = false;
out = false; // end untested block
} else if (cur == '\n') {
inLineComment = false;
out = false;
}
} else {
if (prev == '/' && cur == '*') {
sr.mark(1); // start untested block
int next = sr.read();
if (next != '*') {
inBlockComment = true; // tested line (without rest of block)
}
sr.reset(); // end untested block
} else if (prev == '/' && cur == '/') {
inLineComment = true;
} else if (out){
newCode.append((char)prev);
} else {
out = true;
}
}
prev = cur;
}
if (prev != -1 && out && !inLineComment) {
newCode.append((char)prev);
}
} catch (IOException e) {
e.printStackTrace();
}
return newCode.toString();
}

you can try it with the java-comment-preprocessor:
java -jar ./jcp-6.0.0.jar --i:/sourceFolder --o:/resultFolder -ef:none --r
source

I made a open source library and uploaded to github, its called CommentRemover you can remove single line and multiple line Java Comments.
It supports remove or NOT remove TODO's.
Also it supports JavaScript , HTML , CSS , Properties , JSP and XML Comments too.
There is a little code snippet how to use it (There is 2 type usage):
First way InternalPath
public static void main(String[] args) throws CommentRemoverException {
// root dir is: /Users/user/Projects/MyProject
// example for startInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc.. goes like that
.removeTodos(false) // Do Not Touch Todos (leave them alone)
.removeSingleLines(true) // Remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startInternalPath("src.main.app") // Starts from {rootDir}/src/main/app , leave it empty string when you want to start from root dir
.setExcludePackages(new String[]{"src.main.java.app.pattern"}) // Refers to {rootDir}/src/main/java/app/pattern and skips this directory
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
Second way ExternalPath
public static void main(String[] args) throws CommentRemoverException {
// example for externalInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc..
.removeTodos(true) // Remove todos
.removeSingleLines(false) // Do not remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startExternalPath("/Users/user/Projects/MyOtherProject")// Give it full path for external directories
.setExcludePackages(new String[]{"src.main.java.model"}) // Refers to /Users/user/Projects/MyOtherProject/src/main/java/model and skips this directory.
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}

This is an old post but this may help someone who enjoys working on command line like myself:
The perl one-liner below will remove all comments:
perl -0pe 's|//.*?\n|\n|g; s#/\*(.|\n)*?\*/##g;' test.java
Example:
cat test.java
this is a test
/**
*This should be removed
*This should be removed
*/
this should not be removed
//this should be removed
this should not be removed
this should not be removed //this should be removed
Output:
perl -0pe 's#/\*\*(.|\n)*?\*/##g; s|//.*?\n|\n|g' test.java
this is a test
this should not be removed
this should not be removed
this should not be removed
If you want get rid of multiple blank lines as well:
perl -0pe 's|//.*?\n|\n|g; s#/\*(.|\n)*?\*/##g; s/\n\n+/\n\n/g' test.java
this is a test
this should not be removed
this should not be removed
this should not be removed
EDIT: Corrected regex

Dealing with source code is hard unless you know more on the writing of comment.
In the more general case, you could have // or /* in text constants. So your really need to parse the file at a syntaxic level, not only lexical. IMHO the only bulletproof solution would be to start for example with the java parser from openjdk.
If you know that your comments are never deeply mixed with the code (in my exemple comments MUST be full lines), a python script could help
multiple = False
for line in text:
stripped = line.strip()
if multiple:
if stripped.endswith('*/'):
multiple = False
continue
elif stripped.startswith('/*'):
multiple = True
elif stripped.startswith('//'):
pass
else:
print(line)

If you are using Eclipse IDE, you could make regex do the work for you.
Open the search window (Ctrl+F), and check 'Regular Expression'.
Provide the expression as
/\*\*(?s:(?!\*/).)*\*/
Prasanth Bhate has explained it in Tool to remove JavaDoc comments?

public class TestForStrings {
/**
* The main method.
*
* #param args
* the arguments
* #throws Exception
* the exception
*/
public static void main(String args[]) throws Exception {
String[] imports = new String[100];
String fileName = "Menu.java";
// This will reference one API at a time
String line = null;
try {
FileReader fileReader = new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader = new BufferedReader(fileReader);
int startingOffset = 0;
// This will reference one API at a time
List<String> lines = Files.readAllLines(Paths.get(fileName),
Charset.forName("ISO-8859-1"));
// remove single line comments
for (int count = 0; count < lines.size(); count++) {
String tempString = lines.get(count);
lines.set(count, removeSingleLineComment(tempString));
}
// remove multiple lines comment
for (int count = 0; count < lines.size(); count++) {
String tempString = lines.get(count);
removeMultipleLineComment(tempString, count, lines);
}
for (int count = 0; count < lines.size(); count++) {
System.out.println(lines.get(count));
}
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file '" + fileName + "'");
} catch (Exception e) {
}
}
/**
* Removes the multiple line comment.
*
* #param tempString
* the temp string
* #param count
* the count
* #param lines
* the lines
* #return the string
*/
private static List<String> removeMultipleLineComment(String tempString,
int count, List<String> lines) {
try {
if (tempString.contains("/**") || (tempString.contains("/*"))) {
int StartIndex = count;
while (!(lines.get(count).contains("*/") || lines.get(count)
.contains("**/"))) {
count++;
}
int endIndex = ++count;
if (StartIndex != endIndex) {
while (StartIndex != endIndex) {
lines.set(StartIndex, "");
StartIndex++;
}
}
}
} catch (Exception e) {
// Do Nothing
}
return lines;
}
/**
* Remove single line comments .
*
* #param line
* the line
* #return the string
* #throws Exception
* the exception
*/
private static String removeSingleLineComment(String line) throws Exception {
try {
if (line.contains(("//"))) {
int startIndex = line.indexOf("//");
int endIndex = line.length();
String tempoString = line.substring(startIndex, endIndex);
line = line.replace(tempoString, "");
}
if ((line.contains("/*") || line.contains("/**"))
&& (line.contains("**/") || line.contains("*/"))) {
int startIndex = line.indexOf("/**");
int endIndex = line.length();
String tempoString = line.substring(startIndex, endIndex);
line = line.replace(tempoString, "");
}
} catch (Exception e) {
// Do Nothing
}
return line;
}
}

This is what I came up with yesterday.
This is actually homework I got from school so if anybody reads this and finds a bug before I turn it in, please leave a comment =)
ps. 'FilterState' is a enum class
public static String deleteComments(String javaCode) {
FilterState state = FilterState.IN_CODE;
StringBuilder strB = new StringBuilder();
char prevC=' ';
for(int i = 0; i<javaCode.length(); i++){
char c = javaCode.charAt(i);
switch(state){
case IN_CODE:
if(c=='/')
state = FilterState.CAN_BE_COMMENT_START;
else {
if (c == '"')
state = FilterState.INSIDE_STRING;
strB.append(c);
}
break;
case CAN_BE_COMMENT_START:
if(c=='*'){
state = FilterState.IN_COMMENT_BLOCK;
}
else if(c=='/'){
state = FilterState.ON_COMMENT_LINE;
}
else {
state = FilterState.IN_CODE;
strB.append(prevC+c);
}
break;
case ON_COMMENT_LINE:
if(c=='\n' || c=='\r') {
state = FilterState.IN_CODE;
strB.append(c);
}
break;
case IN_COMMENT_BLOCK:
if(c=='*')
state=FilterState.CAN_BE_COMMENT_END;
break;
case CAN_BE_COMMENT_END:
if(c=='/')
state = FilterState.IN_CODE;
else if(c!='*')
state = FilterState.IN_COMMENT_BLOCK;
break;
case INSIDE_STRING:
if(c == '"' && prevC!='\\')
state = FilterState.IN_CODE;
strB.append(c);
break;
default:
System.out.println("unknown case");
return null;
}
prevC = c;
}
return strB.toString();
}

private static int find(String s, String t, int start) {
int ret = s.indexOf(t, start);
return ret < 0 ? Integer.MAX_VALUE : ret;
}
private static int findSkipEsc(String s, String t, int start) {
while(true) {
int ret = find(s, t, start);
if( ret == Integer.MAX_VALUE) return -1;
int esc = find(s, "\\", start);
if( esc > ret) return ret;
start += 2;
}
}
private static String removeLineCommnt(String s) {
int i, start = 0;
while (0 <= (i = find(s, "//", start))) { //Speed it up
int j = find(s, "'", start);
int k = find(s, "\"", start);
int first = min(i, min(j, k));
if (first == Integer.MAX_VALUE) return s;
if (i == first) return s.substring(0, i);
//skipp quoted string
start = first+1;
if (k == first) { // " asdas\"dasd "
start = findSkipEsc(s,"\"",start);
if (start < 0) return s;
start++;
continue;
}
//if j == first ' asda\'sasd ' --- not in JSON
start = findSkipEsc(s,"'\"'",start);
if (start < 0) return s;
start++;
}
return s;
}
static String removeLineCommnts(String s) {
if (!s.contains("//")) return s; //Speed it up
return Arrays.stream(s.split("[\\n\\r]+")).
map(Common::removeLineCommnt).
collect(Collectors.joining("\n"));
}

Related

Find Comments with StringTokenizer

I used the following code to count the number of comments in a code:
StringTokenizer stringTokenizer = new StringTokenizer(str);
int x = 0;
while (stringTokenizer.hasMoreTokens()) {
if (exists == false && stringTokenizer.nextToken().contains("/*")) {
exists = true;
} else if (exists == true && stringTokenizer.nextToken().contains("*/")) {
x++;
exists = false;
}
}
System.out.println(x);
It works if comments have spaces:
e.g.: "/* fgdfgfgf */ /* fgdfgfgf */ /* fgdfgfgf */".
But it does not work for comments without spaces:
e.g.: "/*fgdfgfgf *//* fgdfgfgf*//* fgdfgfgf */".
Using StringUtils class in commons lang, you can very easily archive this
String str = "Your String"
if (&& StringUtils.countMatches(str,"/*") != 0) {
//no need this if condition
} else if (StringUtils.countMatches(str,"*/") != 0) {
x = StringUtils.countMatches(str,"*/");
}
System.out.println(x);
new StringTokenizer(str,"\n") tokenizes/splits str into lines rather than using the default delimiter which is \t\n\r\f, a combination of spaces, tabs, formfeed, carriage and newline
StringTokenizer stringTokenizer = new StringTokenizer(str,"\n");
This specifies newline as the only delimiter to use for Tokenizing
Using your current approach:
String line;
while(stringTokenizer.hasMoreTokens()){
line=stringTokenizer.nextToken();
if(!exists && line.contains("/*")){
exists = true;
}
if(exists && line.contains("*/")){
x++;
exists = false;
}
}
For multiple comments I tried to use /\\* & \\*/ as patterns in split() and got length for their occurrence in the string, but unfortunately length were not exact due to uneven splitting.
Multiple/Single Comments can be: (IMO)
COMMENT=/* .* */
A = COMMENT;
B = CODE;
C = AB/BA/ABA/BAB/AAB/BAA/A;
This reminds me of flip-flops in Ruby/Perl/Awk et al. There is no need to use a StringTokenizer. You just need to keep states to count the number of lines with comments.
You are inside a comment block. You start printing or collecting all the characters. As soon as you encounter a */ in its entirety you toggle the comment block switch. And switch to state 2
You reject everything until you encounter a /* and are back to state 1.
Something like this
public static int countLines(Reader reader) throws IOException {
int commentLines = 0;
boolean inComments = false;
StringBuilder line = new StringBuilder();
for (int ch = -1, prev = -1; ((ch = reader.read())) != -1; prev = ch) {
System.out.println((char)ch);
if (inComments) {
if (prev == '*' && ch == '/') { //flip state
inComments = false;
}
if (ch != '\n' && ch != '\r') {
line.append((char)ch);
}
if (!inComments || ch == '\n' || ch == '\r') {
String actualLine = line.toString().trim();
//ignore lines which only have '*/' in them
commentLines += actualLine.length() > 0 && !"*/".equals(actualLine) ? 1 : 0;
line = new StringBuilder();
}
} else {
if (prev == '/' && ch == '*') { //flip state
inComments = true;
}
}
}
return commentLines;
}
public static void main(String[] args) throws FileNotFoundException, IOException {
System.out.println(countLines(new FileReader(new File("/tmp/b"))));
}
Above program ignores empty line comments or lines with only /* or */ in them. We also need to ignore nested comments which string tokenizer may fail todo.
Example file /tmp/b
#include <stdio.h>
int main()
{
/* /******* wrong! The variable declaration must appear first */
printf( "Declare x next" );
int x;
return 0;
}
returns 1.

How to convert arabic char into hexstring in java

The following code is returning in ?????? as output , when str has Arabic string :
String str="مرحبا",str2="";
for (int i = 0; i < str.length(); ++i) {
str2 += displayChar(str.charAt(reorder[i]));
System.out.print(reorder[i]);
}
System.out.println(str2); // output is : ?????
and :
String displayChar(char c) {
if (c < '\u0010') {
return "0x0" + Integer.toHexString(c);
} else if (c < '\u0020' || c >= '\u007f') {
return "0x" + Integer.toHexString(c);
} else {
return c+"";
}
}
For
reorder is integer array only carries the new index (order) of the character in the given str
Here is the complete code, .. hope it will help you to understand the problem :
/*
* (C) Copyright IBM Corp. 1999, All Rights Reserved
*
* version 1.0
*/
import java.io.*;
/**
* A simple command-line interface to the BidiReference class.
* <p>
* This prompts the user for an ASCII string, runs the reference
* algorithm on the string, and displays the results to the terminal.
* An empty return to the prompt exits the program.
* <p>
* ASCII characters are preassigned various bidi direction types.
* These types can be displayed by the user for reference by
* typing <code>-display</code> at the prompt. More help can be
* obtained by typing <code>-help</code> at the prompt.
*/
public class BidiReferenceTest {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
PrintWriter writer = new PrintWriter(new BufferedOutputStream(System.out));
BidiReferenceTestCharmap charmap = BidiReferenceTestCharmap.TEST_ARABIC;
byte baseDirection = -1;
/**
* Run the interactive test.
*/
public static void main(String args[]) {
new BidiReferenceTest().run();
}
void run() {
//printHelp();
while (true) {
writer.print("> ");
writer.flush();
String input;
try {
input = reader.readLine();
}
catch (Exception e) {
writer.println(e);
continue;
}
if (input.length() == 0) {
writer.println("Bye!");
writer.flush();
return;
}
if (input.charAt(0) == '-') { // command
int limit = input.indexOf(' ');
if (limit == -1) {
limit = input.length();
}
String cmd = input.substring(0, limit);
if (cmd.equals("-display")) {
charmap.dumpInfo(writer);
} else if (cmd.equals("-english")) {
charmap = BidiReferenceTestCharmap.TEST_ENGLISH;
charmap.dumpInfo(writer);
} else if (cmd.equals("-hebrew")) {
charmap = BidiReferenceTestCharmap.TEST_HEBREW;
charmap.dumpInfo(writer);
} else if (cmd.equals("-arabic")) {
charmap = BidiReferenceTestCharmap.TEST_ARABIC;
charmap.dumpInfo(writer);
} else if (cmd.equals("-mixed")) {
charmap = BidiReferenceTestCharmap.TEST_MIXED;
charmap.dumpInfo(writer);
} else if (cmd.equals("-baseLTR")) {
baseDirection = 0;
} else if (cmd.equals("-baseRTL")) {
baseDirection = 1;
} else if (cmd.equals("-baseDefault")) {
baseDirection = -1;
} else {
}
} else {
String ss= runSample(input);
System.out.println(ss);
Character.UnicodeBlock block = Character.UnicodeBlock.of(Character.codePointAt(ss, 0));
}
}
}
String runSample(String str) {
String str2 = "";
try {
charmap = BidiReferenceTestCharmap.TEST_ARABIC;
byte[] codes = charmap.getCodes(str);
baseDirection = 1;
BidiReference bidi = new BidiReference(codes, baseDirection); // baseDirection = 1
int[] reorder = bidi.getReordering(new int[] { codes.length });
/*
writer.println("base level: " + bidi.getBaseLevel() + (baseDirection != -1 ? " (forced)" : ""));
// output original text
for (int i = 0; i < str.length(); ++i) {
displayChar(str.charAt(i));
}
writer.println();
*/
// output visually ordered text
for (int i = 0; i < str.length(); ++i) {
str2 += displayChar(str.charAt(reorder[i]));
System.out.print(reorder[i]);
}
return str2;
}
catch (Exception e) {
return "";
}
}
String displayChar(char c) {
if (c < '\u0010') {
return "0x0" + Integer.toHexString(c);
} else if (c < '\u0020' || c >= '\u007f') {
return "0x" + Integer.toHexString(c);
} else {
return c+"";
}
}
}
If I were to guess I'd say you run under Windows with the default console settings (i.e. Raster fonts) and you run the Java program from the console and not within Eclipse.
If that is the case, then just change the console settings to use a TrueType font (Lucida Console or Consolas) and you should see boxes instead of question marks. Those won't look right either, but at least it's the actual text instead of question marks.
Side note: Question marks are a common occurrence if something does support Unicode but converts it into another encoding somewhere, e.g. Latin 1.
One problem is that your terminal probably does not support Unicode characters correctly (this might not be the only problem).

Java: Removing comments from string

I'd like to do a function which gets a string and in case it has inline comments it removes it. I know it sounds pretty simple but i wanna make sure im doing this right, for example:
private String filterString(String code) {
// lets say code = "some code //comment inside"
// return the string "some code" (without the comment)
}
I thought about 2 ways: feel free to advice otherwise
Iterating the string and finding double inline brackets and using substring method.
regex way.. (im not so sure bout it)
can u tell me what's the best way and show me how it should be done? (please don't advice too advanced solutions)
edited: can this be done somehow with Scanner object? (im using this object anyway)
If you want a more efficient regex to really match all types of comments, use this one :
replaceAll("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)","");
source : http://ostermiller.org/findcomment.html
EDIT:
Another solution, if you're not sure about using regex is to design a small automata like follows :
public static String removeComments(String code){
final int outsideComment=0;
final int insideLineComment=1;
final int insideblockComment=2;
final int insideblockComment_noNewLineYet=3; // we want to have at least one new line in the result if the block is not inline.
int currentState=outsideComment;
String endResult="";
Scanner s= new Scanner(code);
s.useDelimiter("");
while(s.hasNext()){
String c=s.next();
switch(currentState){
case outsideComment:
if(c.equals("/") && s.hasNext()){
String c2=s.next();
if(c2.equals("/"))
currentState=insideLineComment;
else if(c2.equals("*")){
currentState=insideblockComment_noNewLineYet;
}
else
endResult+=c+c2;
}
else
endResult+=c;
break;
case insideLineComment:
if(c.equals("\n")){
currentState=outsideComment;
endResult+="\n";
}
break;
case insideblockComment_noNewLineYet:
if(c.equals("\n")){
endResult+="\n";
currentState=insideblockComment;
}
case insideblockComment:
while(c.equals("*") && s.hasNext()){
String c2=s.next();
if(c2.equals("/")){
currentState=outsideComment;
break;
}
}
}
}
s.close();
return endResult;
}
The best way to do this is to use regular expressions.
At first to find the /**/ comments and then remove all // commnets. For example:
private String filterString(String code) {
String partialFiltered = code.replaceAll("/\\*.*\\*/", "");
String fullFiltered = partialFiltered.replaceAll("//.*(?=\\n)", "")
}
Just use the replaceAll method from the String class, combined with a simple regular expression. Here's how to do it:
import java.util.*;
import java.lang.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String s = "private String filterString(String code) {\n" +
" // lets say code = \"some code //comment inside\"\n" +
" // return the string \"some code\" (without the comment)\n}";
s = s.replaceAll("//.*?\n","\n");
System.out.println("s=" + s);
}
}
The key is the line:
s = s.replaceAll("//.*?\n","\n");
The regex //.*?\n matches strings starting with // until the end of the line.
And if you want to see this code in action, go here: http://www.ideone.com/e26Ve
Hope it helps!
To find the substring before a constant substring using a regular expression replacement is a bit much.
You can do it using indexOf() to check for the position of the comment start and substring() to get the first part, something like:
String code = "some code // comment";
int offset = code.indexOf("//");
if (-1 != offset) {
code = code.substring(0, offset);
}
#Christian Hujer has been correctly pointing out that many or all of the solutions posted fail if the comments occur within a string.
#Loïc Gammaitoni suggests that his automata approach could easily be extended to handle that case. Here is that extension.
enum State { outsideComment, insideLineComment, insideblockComment, insideblockComment_noNewLineYet, insideString };
public static String removeComments(String code) {
State state = State.outsideComment;
StringBuilder result = new StringBuilder();
Scanner s = new Scanner(code);
s.useDelimiter("");
while (s.hasNext()) {
String c = s.next();
switch (state) {
case outsideComment:
if (c.equals("/") && s.hasNext()) {
String c2 = s.next();
if (c2.equals("/"))
state = State.insideLineComment;
else if (c2.equals("*")) {
state = State.insideblockComment_noNewLineYet;
} else {
result.append(c).append(c2);
}
} else {
result.append(c);
if (c.equals("\"")) {
state = State.insideString;
}
}
break;
case insideString:
result.append(c);
if (c.equals("\"")) {
state = State.outsideComment;
} else if (c.equals("\\") && s.hasNext()) {
result.append(s.next());
}
break;
case insideLineComment:
if (c.equals("\n")) {
state = State.outsideComment;
result.append("\n");
}
break;
case insideblockComment_noNewLineYet:
if (c.equals("\n")) {
result.append("\n");
state = State.insideblockComment;
}
case insideblockComment:
while (c.equals("*") && s.hasNext()) {
String c2 = s.next();
if (c2.equals("/")) {
state = State.outsideComment;
break;
}
}
}
}
s.close();
return result.toString();
}
I made an open source library (on GitHub) for this purpose , its called CommentRemover you can remove single line and multiple line Java Comments.
It supports remove or NOT remove TODO's.
Also it supports JavaScript , HTML , CSS , Properties , JSP and XML Comments too.
Little code snippet how to use it (There is 2 type usage):
First way InternalPath
public static void main(String[] args) throws CommentRemoverException {
// root dir is: /Users/user/Projects/MyProject
// example for startInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc.. goes like that
.removeTodos(false) // Do Not Touch Todos (leave them alone)
.removeSingleLines(true) // Remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startInternalPath("src.main.app") // Starts from {rootDir}/src/main/app , leave it empty string when you want to start from root dir
.setExcludePackages(new String[]{"src.main.java.app.pattern"}) // Refers to {rootDir}/src/main/java/app/pattern and skips this directory
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
Second way ExternalPath
public static void main(String[] args) throws CommentRemoverException {
// example for externalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc..
.removeTodos(true) // Remove todos
.removeSingleLines(false) // Do not remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startExternalPath("/Users/user/Projects/MyOtherProject")// Give it full path for external directories
.setExcludePackages(new String[]{"src.main.java.model"}) // Refers to /Users/user/Projects/MyOtherProject/src/main/java/model and skips this directory.
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
for scanner, use a delimiter,
delimiter example.
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;
public class MainClass {
public static void main(String args[]) throws IOException {
FileWriter fout = new FileWriter("test.txt");
fout.write("2, 3.4, 5,6, 7.4, 9.1, 10.5, done");
fout.close();
FileReader fin = new FileReader("Test.txt");
Scanner src = new Scanner(fin);
// Set delimiters to space and comma.
// ", *" tells Scanner to match a comma and zero or more spaces as
// delimiters.
src.useDelimiter(", *");
// Read and sum numbers.
while (src.hasNext()) {
if (src.hasNextDouble()) {
System.out.println(src.nextDouble());
} else {
break;
}
}
fin.close();
}
}
Use a tokenizer for a normal string
tokenizer:
// start with a String of space-separated words
String tags = "pizza pepperoni food cheese";
// convert each tag to a token
StringTokenizer st = new StringTokenizer(tags," ");
while ( st.hasMoreTokens() )
{
String token = (String)st.nextToken();
System.out.println(token);
}
http://www.devdaily.com/blog/post/java/java-faq-stringtokenizer-example
It will be better if code handles single line comment and multi line comment separately . Any suggestions ?
public class RemovingCommentsFromFile {
public static void main(String[] args) throws IOException {
BufferedReader fin = new BufferedReader(new FileReader("/home/pathtofilewithcomments/File"));
BufferedWriter fout = new BufferedWriter(new FileWriter("/home/result/File1"));
boolean multilinecomment = false;
boolean singlelinecomment = false;
int len,j;
String s = null;
while ((s = fin.readLine()) != null) {
StringBuilder obj = new StringBuilder(s);
len = obj.length();
for (int i = 0; i < len; i++) {
for (j = i; j < len; j++) {
if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '*') {
j += 2;
multilinecomment = true;
continue;
} else if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '/') {
singlelinecomment = true;
j = len;
break;
} else if (obj.charAt(j) == '*' && obj.charAt(j + 1) == '/') {
j += 2;
multilinecomment = false;
break;
} else if (multilinecomment == true)
continue;
else
break;
}
if (j == len)
{
singlelinecomment=false;
break;
}
else
i = j;
System.out.print((char)obj.charAt(i));
fout.write((char)obj.charAt(i));
}
System.out.println();
fout.write((char)10);
}
fin.close();
fout.close();
}
Easy solution that doesn't remove extra parts of code (like those above)
// works for any reader, you can also iterate over list of strings instead
String str="";
String s;
while ((s = reader.readLine()) != null)
{
s=s.replaceAll("//.*","\n");
str+=s;
}
str=str.replaceAll("/\\*.*\\*/"," ");

How to truncate a HTML fragment to a given length(for preview) in Java? [duplicate]

Is there any utility (or sample source code) that truncates HTML (for preview) in Java? I want to do the truncation on the server and not on the client.
I'm using HTMLUnit to parse HTML.
UPDATE:
I want to be able to preview the HTML, so the truncator would maintain the HTML structure while stripping out the elements after the desired output length.
I've written another java version of truncateHTML. This function truncates a string up to a number of characters while preserving whole words and HTML tags.
public static String truncateHTML(String text, int length, String suffix) {
// if the plain text is shorter than the maximum length, return the whole text
if (text.replaceAll("<.*?>", "").length() <= length) {
return text;
}
String result = "";
boolean trimmed = false;
if (suffix == null) {
suffix = "...";
}
/*
* This pattern creates tokens, where each line starts with the tag.
* For example, "One, <b>Two</b>, Three" produces the following:
* One,
* <b>Two
* </b>, Three
*/
Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");
/*
* Checks for an empty tag, for example img, br, etc.
*/
Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");
/*
* Modified the pattern to also include H1-H6 tags
* Checks for closing tags, allowing leading and ending space inside the brackets
*/
Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");
/*
* Modified the pattern to also include H1-H6 tags
* Checks for opening tags, allowing leading and ending space inside the brackets
*/
Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");
/*
* Find > ...
*/
Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");
// splits all html-tags to scanable lines
Matcher tagMatcher = tagPattern.matcher(text);
int numTags = tagMatcher.groupCount();
int totalLength = suffix.length();
List<String> openTags = new ArrayList<String>();
boolean proposingChop = false;
while (tagMatcher.find()) {
String tagText = tagMatcher.group(1);
String plainText = tagMatcher.group(2);
if (proposingChop &&
tagText != null && tagText.length() != 0 &&
plainText != null && plainText.length() != 0) {
trimmed = true;
break;
}
// if there is any html-tag in this line, handle it and add it (uncounted) to the output
if (tagText != null && tagText.length() > 0) {
boolean foundMatch = false;
// if it's an "empty element" with or without xhtml-conform closing slash
Matcher matcher = emptyTagPattern.matcher(tagText);
if (matcher.find()) {
foundMatch = true;
// do nothing
}
// closing tag?
if (!foundMatch) {
matcher = closingTagPattern.matcher(tagText);
if (matcher.find()) {
foundMatch = true;
// delete tag from openTags list
String tagName = matcher.group(1);
openTags.remove(tagName.toLowerCase());
}
}
// opening tag?
if (!foundMatch) {
matcher = openingTagPattern.matcher(tagText);
if (matcher.find()) {
// add tag to the beginning of openTags list
String tagName = matcher.group(1);
openTags.add(0, tagName.toLowerCase());
}
}
// add html-tag to result
result += tagText;
}
// calculate the length of the plain text part of the line; handle entities (e.g. ) as one character
int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
if (totalLength + contentLength > length) {
// the number of characters which are left
int numCharsRemaining = length - totalLength;
int entitiesLength = 0;
Matcher entityMatcher = entityPattern.matcher(plainText);
while (entityMatcher.find()) {
String entity = entityMatcher.group(1);
if (numCharsRemaining > 0) {
numCharsRemaining--;
entitiesLength += entity.length();
} else {
// no more characters left
break;
}
}
// keep us from chopping words in half
int proposedChopPosition = numCharsRemaining + entitiesLength;
int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
if (endOfWordPosition == -1) {
endOfWordPosition = plainText.length();
}
int endOfWordOffset = endOfWordPosition - proposedChopPosition;
if (endOfWordOffset > 6) { // chop the word if it's extra long
endOfWordOffset = 0;
}
proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
if (plainText.length() >= proposedChopPosition) {
result += plainText.substring(0, proposedChopPosition);
proposingChop = true;
if (proposedChopPosition < plainText.length()) {
trimmed = true;
break; // maximum length is reached, so get off the loop
}
} else {
result += plainText;
}
} else {
result += plainText;
totalLength += contentLength;
}
// if the maximum length is reached, get off the loop
if(totalLength >= length) {
trimmed = true;
break;
}
}
for (String openTag : openTags) {
result += "</" + openTag + ">";
}
if (trimmed) {
result += suffix;
}
return result;
}
I think you're going to need to write your own XML parser to accomplish this. Pull out the body node, add nodes until binary length < some fixed size, and then rebuild the document. If HTMLUnit doesn't create semantic XHTML, I'd recommend tagsoup.
If you need an XML parser/handler, I'd recommend XOM.
There is a PHP function that does it here: http://snippets.dzone.com/posts/show/7125
I've made a quick and dirty Java port of the initial version, but there are subsequent improved versions in the comments that could be worth considering (especially one that deals with whole words):
public static String truncateHtml(String s, int l) {
Pattern p = Pattern.compile("<[^>]+>([^<]*)");
int i = 0;
List<String> tags = new ArrayList<String>();
Matcher m = p.matcher(s);
while(m.find()) {
if (m.start(0) - i >= l) {
break;
}
String t = StringUtils.split(m.group(0), " \t\n\r\0\u000B>")[0].substring(1);
if (t.charAt(0) != '/') {
tags.add(t);
} else if ( tags.get(tags.size()-1).equals(t.substring(1))) {
tags.remove(tags.size()-1);
}
i += m.start(1) - m.start(0);
}
Collections.reverse(tags);
return s.substring(0, Math.min(s.length(), l+i))
+ ((tags.size() > 0) ? "</"+StringUtils.join(tags, "></")+">" : "")
+ ((s.length() > l) ? "\u2026" : "");
}
Note: You'll need Apache Commons Lang for the StringUtils.join().
I can offer you a Python script I wrote to do this: http://www.ellipsix.net/ext-tmp/summarize.txt. Unfortunately I don't have a Java version, but feel free to translate it yourself and modify it to suit your needs if you want. It's not very complicated, just something I hacked together for my website, but I've been using it for a little more than a year and it generally seems to work pretty well.
If you want something robust, an XML (or SGML) parser is almost certainly a better idea than what I did.
I found this blog: dencat: Truncating HTML in Java
It contains a java port of Pythons, Django template function truncate_html_words
public class SimpleHtmlTruncator {
public static String truncateHtmlWords(String text, int max_length) {
String input = text.trim();
if (max_length > input.length()) {
return input;
}
if (max_length < 0) {
return new String();
}
StringBuilder output = new StringBuilder();
/**
* Pattern pattern_opentag = Pattern.compile("(<[^/].*?[^/]>).*");
* Pattern pattern_closetag = Pattern.compile("(</.*?[^/]>).*"); Pattern
* pattern_selfclosetag = Pattern.compile("(<.*?/>).*");*
*/
String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";
Pattern pattern_overall = Pattern.compile(HTML_TAG_PATTERN + "|" + "\\s*\\w*\\s*");
Pattern pattern_html = Pattern.compile("(" + HTML_TAG_PATTERN + ")" + ".*");
Pattern pattern_words = Pattern.compile("(\\s*\\w*\\s*).*");
int characters = 0;
Matcher all = pattern_overall.matcher(input);
while (all.find()) {
String matched = all.group();
Matcher html_matcher = pattern_html.matcher(matched);
Matcher word_matcher = pattern_words.matcher(matched);
if (html_matcher.matches()) {
output.append(html_matcher.group());
} else if (word_matcher.matches()) {
if (characters < max_length) {
String word = word_matcher.group();
if (characters + word.length() < max_length) {
output.append(word);
} else {
output.append(word.substring(0,
(max_length - characters) > word.length()
? word.length() : (max_length - characters)));
}
characters += word.length();
}
}
}
return output.toString();
}
public static void main(String[] args) {
String text = SimpleHtmlTruncator.truncateHtmlWords("<html><body><br/><p>abc</p><p>defghij</p><p>ghi</p></body></html>", 4);
System.out.println(text);
}
}

arrayListOutOfBoundsException

This is my class Debugger. Can anyone try and run it and see whens wrong? Ive spent hours on it already. :(
public class Debugger {
private String codeToDebug = "";
public Debugger(String code) {
codeToDebug = code;
}
/**
* This method itterates over a css file and adds all the properties to an arraylist
*/
public void searchDuplicates() {
boolean isInside = false;
ArrayList<String> methodStorage = new ArrayList();
int stored = 0;
String[] codeArray = codeToDebug.split("");
try {
int i = 0;
while(i<codeArray.length) {
if(codeArray[i].equals("}")) {
isInside = false;
}
if(isInside && !codeArray[i].equals(" ")) {
boolean methodFound = false;
String method = "";
int c = i;
while(!methodFound) {
method += codeArray[c];
if(codeArray[c+1].equals(":")) {
methodFound = true;
} else {
c++;
}
}
methodStorage.add(stored, method);
System.out.println(methodStorage.get(stored));
stored++;
boolean stillInside = true;
int skip = i;
while(stillInside) {
if(codeArray[skip].equals(";")) {
stillInside = false;
} else {
skip++;
}
}
i = skip;
}
if(codeArray[i].equals("{")) {
isInside = true;
}
i++;
}
} catch(ArrayIndexOutOfBoundsException ar) {
System.out.println("------- array out of bounds exception -------");
}
}
/**
* Takes in String and outputs the number of characters it contains
* #param input
* #return Number of characters
*/
public static int countString(String input) {
String[] words = input.split("");
int counter = -1;
for(int i = 0; i<words.length; i++){
counter++;
}
return counter;
}
public static void main(String[] args) {
Debugger h = new Debugger("body {margin:;\n}");
h.searchDuplicates();
}
}
Any place where an element of an array is being obtained without a bounds check after the index is manipulated is an candidate for an ArrayIndexOutOfBoundsException.
In the above code, there are at least two instances where the index is being manipulated without being subject to a bounds check.
The while loop checking the !methodFound condition
The while loop checking the stillInside condition
In those two cases, the index is being manipulated by incrementing or adding a value to the index, but there are no bound checks before an element is being obtained from the String[], therefore there is no guarantee that the index being specified is not outside the bounds of the array.
I think this block of codes can create your problem
int c = i;
while(!methodFound) {
method += codeArray[c];
if(codeArray[c+1].equals(":")) {
methodFound = true;
} else {
c++;
}
}
int skip = i;
while(stillInside) {
if(codeArray[skip].equals(";")) {
stillInside = false;
} else {
skip++;
}
}
i = skip;
The reason is that if the condition is true, and i = codeArray.length - 1. The c + 1 will create the error of ArrayIndexOutOfBound
Try evaluating if your index exists in the array...
adding:
while (!methodFound && c < codeArray.length) {
while (stillInside && skip < codeArray.length) {
if (i < codeArray.length && codeArray[i].equals("{")) {
so, your code looks like:
public class Debugger {
private String codeToDebug = "";
public Debugger(String code) {
codeToDebug = code;
}
/**
* This method itterates over a css file and adds all the properties to an
* arraylist
*/
public void searchDuplicates() {
boolean isInside = false;
List<String> methodStorage = new ArrayList<String>();
int stored = 0;
String[] codeArray = codeToDebug.split("");
try {
int i = 0;
while (i < codeArray.length) {
if (codeArray[i].equals("}")) {
isInside = false;
}
if (isInside && !codeArray[i].equals(" ")) {
boolean methodFound = false;
String method = "";
int c = i;
while (!methodFound && c < codeArray.length) {
method += codeArray[c];
if (codeArray[c].equals(":")) {
methodFound = true;
} else {
c++;
}
}
methodStorage.add(stored, method);
System.out.println(methodStorage.get(stored));
stored++;
boolean stillInside = true;
int skip = i;
while (stillInside && skip < codeArray.length) {
if (codeArray[skip].equals(";")) {
stillInside = false;
} else {
skip++;
}
}
i = skip;
}
if (i < codeArray.length && codeArray[i].equals("{")) {
isInside = true;
}
i++;
}
} catch (ArrayIndexOutOfBoundsException ar) {
System.out.println("------- array out of bounds exception -------");
ar.printStackTrace();
}
}
/**
* Takes in String and outputs the number of characters it contains
*
* #param input
* #return Number of characters
*/
public static int countString(String input) {
String[] words = input.split("");
int counter = -1;
for (int i = 0; i < words.length; i++) {
counter++;
}
return counter;
}
public static void main(String[] args) {
Debugger h = new Debugger("body {margin:prueba;\n}");
h.searchDuplicates();
}
}
Also, declaring implementation types is a bad practice, because of that in the above code i Change the ArrayList variable = new ArrayList() to List variable = new ArrayList()
I couldn't resist to implement this task of writing a CSS parser in a completely different way. I have split the task of parsing into many small ones.
The smallest is called skipWhitespace, since you will need it everywhere when parsing text files.
The next one is parseProperty, which reads one property of the form name:value;.
Based on that, parseSelector reads a complete CSS selector, starting with the selector name, an opening brace, possibly many properties, and finishing with the closing brace.
Still based on that, parseFile reads a complete file, consisting of possibly many selectors.
Note how carefully I checked whether the index is small enough. I did that before every access to the chars array.
I used LinkedHashMaps to save the properties and the selectors, because these kinds of maps remember in which order the things have been inserted. Normal HashMaps don't do that.
The task of parsing a text file is generally quite complex, and this program only attempts to handle the basics of CSS. If you need a full CSS parser, you should definitely look for a ready-made one. This one cannot handle #media or similar things where you have nested blocks. But it shouldn't bee too difficult to add it to the existing code.
This parser will not handle CSS comments very well. It only expects them at a few places. If comments appear in other places, the parser will not treat them as comments.
import java.util.LinkedHashMap;
import java.util.Map;
public class CssParser {
private final char[] chars;
private int index;
public Debugger(String code) {
this.chars = code.toCharArray();
this.index = 0;
}
private void skipWhitespace() {
/*
* Here you should also skip comments in the CSS file, which either look
* like this comment or start with a // and go until the end of line.
*/
while (index < chars.length && Character.isWhitespace(chars[index]))
index++;
}
private void parseProperty(String selector, Map<String, String> properties) {
skipWhitespace();
// get the CSS property name
StringBuilder sb = new StringBuilder();
while (index < chars.length && chars[index] != ':')
sb.append(chars[index++]);
String propertyName = sb.toString().trim();
if (index == chars.length)
throw new IllegalArgumentException("Expected a colon at index " + index + ".");
// skip the colon
index++;
// get the CSS property value
sb.setLength(0);
while (index < chars.length && chars[index] != ';' && chars[index] != '}')
sb.append(chars[index++]);
String propertyValue = sb.toString().trim();
/*
* Here is the check for duplicate property definitions. The method
* Map.put(Object, Object) always returns the value that had been stored
* under the given name before.
*/
String previousValue = properties.put(propertyName, propertyValue);
if (previousValue != null)
throw new IllegalArgumentException("Duplicate property \"" + propertyName + "\" in selector \"" + selector + "\".");
if (index < chars.length && chars[index] == ';')
index++;
skipWhitespace();
}
private void parseSelector(Map<String, Map<String, String>> selectors) {
skipWhitespace();
// get the CSS selector
StringBuilder sb = new StringBuilder();
while (index < chars.length && chars[index] != '{')
sb.append(chars[index++]);
String selector = sb.toString().trim();
if (index == chars.length)
throw new IllegalArgumentException("CSS Selector name \"" + selector + "\" without content.");
// skip the opening brace
index++;
skipWhitespace();
Map<String, String> properties = new LinkedHashMap<String, String>();
selectors.put(selector, properties);
while (index < chars.length && chars[index] != '}') {
parseProperty(selector, properties);
skipWhitespace();
}
// skip the closing brace
index++;
}
private Map<String, Map<String, String>> parseFile() {
Map<String, Map<String, String>> selectors = new LinkedHashMap<String, Map<String, String>>();
while (index < chars.length) {
parseSelector(selectors);
skipWhitespace();
}
return selectors;
}
public static void main(String[] args) {
CssParser parser = new CssParser("body {margin:prueba;A:B;a:Arial, Courier New, \"monospace\";\n}");
Map<String, Map<String, String>> selectors = parser.parseFile();
System.out.println("There are " + selectors.size() + " selectors.");
for (Map.Entry<String, Map<String, String>> entry : selectors.entrySet()) {
String selector = entry.getKey();
Map<String, String> properties = entry.getValue();
System.out.println("Selector " + selector + ":");
for (Map.Entry<String, String> property : properties.entrySet()) {
String name = property.getKey();
String value = property.getValue();
System.out.println(" Property name \"" + name + "\" value \"" + value + "\"");
}
}
}
}

Categories