Java: Count empty lines in a text file/string - java

I am using the following code to count empty lines in Java, but this code returns a greater number of empty lines than there are.
int countEmptyLines(String s) {
int result=0;
Pattern regex = Pattern.compile("(?m)^\\s*$");
Matcher testMatcher = regex.matcher(s);
while (testMatcher.find())
{
result++;
}
return result;}
What am I doing wrong or is there a better way to do it?

Try this:
final BufferedReader br = new BufferedReader(new StringReader("hello\n\nworld\n"));
String line;
int empty = 0;
while ((line = br.readLine()) != null) {
if (line.trim().isEmpty()) {
empty++;
}
}
System.out.println(empty);

I found a way to fix my own regex while I was at lunch:
Pattern regex = Pattern.compile("(?m)^\\s*?$");
The '?' makes the \s* reluctant, meaning it will somehow not match the character that '$' will match.

\s matches any whitespace, which is either a space, a tab or a carriage return/linefeed.
The easiest way to do this is to count chains of successive EOL characters. I write EOL, because you need to determine which character denotes the end of line in your file. While under Windows, an end of line amounts to a Carriage Return and a Linefeed character. Under Unix, this is different, so for a file written under Unix your programm will have to be adjusted.
Then, count every the successive number of the end of line character(s) and each time add this number minus 1 to a count. At the end, you will have the empty line count.

Related

Issue with replacing part of string in Java

I have a method that reads a sql query as a string and needs to replace all '\' with '\\'. This is done to prevent \n and \r being processed as line breaks in the output.
private static String cleanTokenForCsv(String inputToken) {
if (inputToken == null) {
return "";
}
if (inputToken.length() == 0 || inputToken.contains(",") || inputToken.contains("\"")
|| inputToken.contains("\n") || inputToken.contains("\r")) {
String replacedToken = inputToken.replace(",", ";");
return String.format("\"%s\"", replacedToken.replace("\"", "\"\""));
} else {
return inputToken;
}
}
Sample Input
(\nSELECT\n a.population_id\n ,a.empi_id\n\r ,a.encounter_id\n ,SPLIT_PART(MIN(a.service_date||'|'||a.norm_numeric_value),'|',2)::FLOAT as earliest_temperature\nFROM\n ph_f_result a)
The expected output for the query would be along the lines of
"(\nSELECT\n a.population_id\n ;a.empi_id\n\r ;a.encounter_id\n ;SPLIT_PART(MIN(a.service_date||'|'||a.norm_numeric_value);'|';2)::FLOAT as earliest_temperature\nFROM\n ph_f_result a)"
The entire query in one line with the line breaks intact
However, the output instead is
"(
SELECT
a.population_id
;a.empi_id
;a.encounter_id
;SPLIT_PART(MIN(a.service_date||'|'||a.norm_numeric_value);'|';2)::FLOAT as earliest_temperature
FROM
ph_f_result a)"
I also tried the following:
replacedToken = replacedToken.replace("\\", "\\\\");
With Regex
replacedToken = replacedToken.replaceAll("\\\\", "\\\\\\\\");
Edit: So I can get it to work if I add individual replace calls for \n and \r like below
replacedToken = replacedToken.replace("\n","\\n");
replacedToken = replacedToken.replace("\r", "\\r");
But I am looking for something more generic for all '\n' instances
You have carriage return and newline characters, not escaped r or n.
replacedToken = replacedToken.replace("\n", "\\n").replace("\r", "\\r");
This replaces all carriage return and newline characters with their escaped equivalents.
I assume that your goal is simply to convert characters \r, \t and \n in an input String to double-quoted two-character strings "\\r" and so on, so that printing the string does not result in newlines or tabs.
Note that the character \n does not really contain the character \ at all. We simply agree to write \n to represent it. This should work:
public static String escapeWhitespaceEscapes(String input) {
return input
.replace("\n", "\\n")
.replace("\r", "\\r")
.replace("\t", "\\t");
}
But note that you will have to perform the reverse operation to get back the original string.

Java : Skip Unicode characters while reading a file

I am reading a text file using the below code,
try (BufferedReader br = new BufferedReader(new FileReader(<file.txt>))) {
for (String line; (line = br.readLine()) != null;) {
//I want to skip a line with unicode character and continue next line
if(line.toLowerCase().startsWith("\\u")){
continue;
//This is not working because i get the character itself and not the text
}
}
}
The text file:
How to skip all the unicode characters while reading a file ?
You can skip all lines that contains non ASCII characters:
if(Charset.forName("US-ASCII").newEncoder().canEncode(line)){
continue;
}
All characters in a String are Unicode. A String is a counted sequence of UTF-16 code units. By "Unicode", you must mean not also in some unspecified set of other character sets. For sake of argument, let's say ASCII.
A regular expression can sometimes be the simplest expression of a pattern requirement:
if (!line.matches("\\p{ASCII}*")) continue;
That is, if the string does not consist only of any number, including 0, (that's what * means) of "ASCII" characters, then continue.
(String.matches looks for a match on the whole string, so the actual regular expression pattern is ^\p{ASCII}*$. )
Something like this might get you going:
for (char c : line.toCharArray()) {
if (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.BASIC_LATIN) {
// do something with this character
}
}
You could use that as a starting point to either discard each non-basic character, or discard the entire line if it contains a single non-basic character.

How to fetch substring from java string starting with the range of \n+ to \n and \n- to \n in below given string?

Input string is as below:
## -106,12 +106,12 ## end loop\n loop map dummm56\n \tdummy data/path/u/u_op/kl-sc45\n end loop\n-\n-loop map {$df=56,$kl=20564300,$testId=\"jk: message1 message2:48667697\",$kl3=true,$kl=true, $kl=[2],$kl=$kl1, $kl1=true, $kl1=[1],$kl14=$kl15,$kl16=$kl14}\n+##### message2:48667697\n+loop map {val56}\n \tl1 l2/l3/i3/l7_l90/l90-SC21_l90/l90-l90_l90_l90\n end loop\n-\n-loop map {val56}\n+#####kl message45:48667697\n+loop map {val34}\n \ttestcases data/testcases/path1/[ath5+path6/UC20015-SC21/UC20015-SC21\n end loop\n
In above string, I want to fetch all the sub-strings starting from \n+ and ending with \n. Also, starting from \n- and ending with \n using Java regular expression.
Expected Output is as below:
blank // here its blank as first \n- to next \n nothing is there.
loop map {$df=56,$kl=20564300,$testId=\"jk: message1 message2:48667697\",$kl3=true,$kl=true, $kl=[2],$kl=$kl1, $kl1=true, $kl1=[1],$kl14=$kl15,$kl16=$kl14} // as second \n- to next \n
##### message2:48667697 //third \n+ to next \n
loop map {val56}\n \tl1 l2/l3/i3/l7_l90/l90-SC21_l90/l90-l90_l90_l90 //fourth \n+ to next \n
blank // \n- to next \n
loop map {val56} // \n- to next \n
#####kl message45:48667697 // as \n+ to \n
loop map {val34} // as \n+ to \n
Actually I wanted to make two different sets, one for \n+ to \n and another for \n- to \n. As I wanted to use this for later purposes.
below is the java code I have tried with:
String str="Pasted above string making sure no additional string literals should come.";
Pattern p0 = Pattern.compile("^(\\\\n[+|-])(.*?)(\\\\n.*)?$");
Matcher m = p0.matcher(str);
while (m.find()) {
System.out.printf( m.group(0));// here I am expecting my output to get printed.
}
Can anyone help me out with the same. Any help would be highly appreciated. Thanks.
This is the regex that might help you
^(\\\\n[+|-])(.*)$
^ - Start of line
\\\\n - escaping \
^(\\\\n[+|-]) - Starting point of line followed by \n with either + or -
(.*?) anything can follow after that (? for non-greedy search if \n follows after that)
(\\\\n.*)? - Might be that \n follows (4th line in text)
$ - End of line. Assuming that after end of line what follows is \n-new line
From the output that you have provided, it seems that there are other conditions also that you have forgot to mention in your question, like why ###problem statement1 is not in the output because it starts with \n+ and ends with new line. Below is the code snippet for the same
public static void main(String[] args) throws Exception {
Pattern pattern = Pattern.compile("^(\\\\n[+|-])(.*?)(\\\\n.*)?$");
Matcher matcher = null;
BufferedReader br = new BufferedReader(new FileReader(
"filePath"));
String line = null;
while ((line = br.readLine()) != null) {
matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
}
}

How to clean a file, replacing unwanted seperators, operators, string literals

I'm working on a concordance problem where I must: "Clean the file. For this, remove all string literals (anything enclosed
in double quotes, the second of which is not preceded by an odd number
of backslashes), remove all // comments, remove all separator characters
(look these up), and operators (look these up). Do not worry about ".class literals" (we will assume they will not appear in the input file)."
I think I know how the replaceAll() method works, but I don't know what's going to be in the file. For starters, how would I go about removing all string literals? Is there a way to replace everything within two double quotes? I.E. String someString = "I want to remove this from a file plz help me, thx";
I've currently put each line of text within an ArrayList of Strings.
Here's what I've got: http://pastebin.com/N84QdLqz
I think I've come up with a solution for your string literal regex. Something like:
inputLine.replaceAll("\"([^\\\\\"]*(\\\\\")*)*([\\\\]{2})*(\\\\\")*[^\"]*\"");
should do the trick. The regex is actually significantly more readable if you print it out to the console after Java has had a chance to escape all of the characters. So if you call System.out.println() with that String, you'll get:
"([^\\"]*(\\")*)*([\\]{2})*(\\")*[^"]*"
I'll break down the original regex to explain. First there's:
"\"([^\\\\\"]*(\\\\\")*)*
This says to match a quote character (") followed by 0 or more patterns of characters that are neither backslashes (\) nor quote characters (") which are followed by 0 or more escaped quotes (\"). As you can see, since \ is typically used as an escape character in Java, any regexes using them become pretty verbose.
([\\\\]{2})*
This says to next match 0 or more sets of 2 (i.e. even-numbered amounts) of backslashes.
(\\\\\")*
This says to match a single backslash followed by a quote character, and to find 0 or more of those together.
[^\"]*\"
This says to match anything that is not a quote character, 0 or more times, followed by a quote character.
I tested my regex with an example similar to what you were asking for:
string literals (anything enclosed in double quotes, the second of which is not preceded by an odd number of backslashes)
Emphasis mine. So by this statement, if the first quote in a literal has a backslash in front of it, it doesn't matter.
String s = "This is "a test\" + "So is this"
Applying the regex with replaceAll and a replacement of \"\", you'll get:
String s = ""a test\""So is this"
which should be correct. You can completely remove the matching literal's quotes, if you want, by calling replaceAll with a replacement of "":
String s = a test\So is this"
Alternately, using this regex on something much less contrived to cause headaches:
String s = "This is \"a test\\" + "So is this"
will return:
String s = +
Yo can do something like this:
private static final String REGEX = "(\"[\\w|\\s]*\")";
private static Pattern P;
private static Matcher M;
public static void main(String args[]){
P = Pattern.compile(REGEX);
//.... your code here ....
}
public static ArrayList<String> readStringsFromFile(String fileName) throws FileNotFoundException
{
Scanner scanner = null;
scanner = new Scanner(new File(fileName));
ArrayList<String> list = new ArrayList<>();
String str = new String();
try
{
while(scanner.hasNext())
{
str = scanner.nextLine();
str = cleanLine(str);//clean the line after read
list.add(str);
}
}
catch (InputMismatchException ex)
{
}
return list;
}
public static String cleanLine(String line) {
int index;
//remove comment lines
index = line.indexOf("//");
if (index != -1) {
line = line.substring(0, index);
}
//remove everything within two double quotes
M = P.matcher(line);
String tmp = "";
while(M.find()) {
tmp = line.substring(0,M.start());
tmp += line.substring(M.end());
line = tmp;
M = P.matcher(line);
}
return line;
}

reading text file java

I'm trying to read a text file(.txt) in java. I need to eventually put the text I extract word by word in a binary tree's nodes . If for example, I have the text: "Hi, I'm doing a test!", I would like to split it into "Hi" "I" "m" "doing" "a" "test", basically skipping all punctuation and empty spaces and considering a word to be a sequence of contiguous alphabet letters. I am so far able to extract the words and put them in an array for testing. However, if I have a completely empty line in my .txt file, the code will consider it as a word and return an empty space. Also, punctuation at the end of a line works but if there's a comma for example and then text, I will get an empty space as well ! Here is what I tried so far:
public static void main(String[] args) throws Exception
{
FileReader file = new FileReader("File.txt");
BufferedReader reader = new BufferedReader(file);
String text = "";
String line = reader.readLine();
while (line != null)
{
text += line;
line = reader.readLine();
}
System.out.println(text);
String textnospaces=text.replaceAll("\\s+", " ");
System.out.println(textnospaces);
String [] tokens = textnospaces.split("[\\W+]");
for(int i=0;i<=tokens.length-1;i++)
{
tokens[i]=tokens[i].toLowerCase();
System.out.println(tokens[i]);
}
}
Using the following text:
I can't, come see you. Today my friend is hard
s
I get the following output:
i
can
t
(extra space between "t" and "come")
come
see
you
(extra space again)
today
my
friend
is
hards
Any help would be appreciated ! Thanks
use the trim() method of String. From documentation http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim%28%29:
"Returns a copy of the string, with leading and trailing whitespace omitted.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
Otherwise, if there is no character with a code greater than '\u0020' in the string, then a new String object representing an empty string is created and returned.
Otherwise, let k be the index of the first character in the string whose code is greater than '\u0020', and let m be the index of the last character in the string whose code is greater than '\u0020'. A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1).
This method may be used to trim whitespace (as defined above) from the beginning and end of a string.
Returns:
A copy of this string with leading and trailing white space removed, or this string if it has no leading or trailing white space."
If you really are just looking for each contiguous sequence of characters, you can accomplish this with regex matching quite simply.
String patternString1 = "([a-zA-Z]+)";
String text = "I can't, come see you. Today my friend is hard";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}

Categories