I am having some trouble reading in a file and removing all of the punctuation from the file.
Below is what I currently have and I can not figure out why "----" and "*****" would still occur.
Can anyone point me in a direction to figure out how I need to adjust my replaceAll() in order to make sure repeated occurrences of punctuation can be removed?
public void analyzeFile(File filepath) {
try {
FileInputStream fStream = new FileInputStream(filepath);
DataInputStream in = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String textFile = "";
String regex = "[a-zA-Z0-9\\s]";
String putString = "";
wordCount = 0;
while ((textFile = br.readLine()) != null) {
if (!textFile.equals("") && textFile.length() > 0) {
String[] words = textFile.split(" ");
wordCount += words.length;
for (int i = 0; i < words.length; i++) {
putString = cleanString(regex, words[i]);
if(putString.length() > 0){
mapInterface.put(putString, 1);
}
}
putString = "";
}
}
in.close();
} catch (Exception e) {
System.out.println("Error while attempting to read file: "
+ filepath + " " + e.getMessage());
}
}
private String cleanString(String regex, String str){
String newString = "";
Pattern regexChecker = Pattern.compile(regex);
Matcher regexMatcher = regexChecker.matcher(str);
while(regexMatcher.find()){
if(regexMatcher.group().length() != 0){
newString += regexMatcher.group().toString();
}
}
return newString;
}
Surely you can use the \w escaped alphanumeric character? This will recognise all letters and numbers, but not punctuation.
putString = words[i].replaceAll("[^\w]+", "");
This replaces any non-word character with an empty string.
Related
This is what i have for now. I want to know, how many times i have some word in .txt document . Now i am trying to use BufferedReader didn't manage well enough. I guess here is a easier way to solve this, but i don't know.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
public class TekstiAnalüsaator {
public static void main(String[] args) throws Exception {
InputStream baidid = new FileInputStream("test.txt");
InputStreamReader tekst = new InputStreamReader(baidid, "UTF-8");
BufferedReader puhverdab = new BufferedReader(tekst);
String rida = puhverdab.readLine();
while (rida != null){
System.out.println("Reading: " + rida);
rida = puhverdab.readLine();
}
puhverdab.close();
}
}
I want to search words using this structure. What file, then what word i need to find, (return) how many times, this word is in the file.
TekstiAnalüsaator analüsaator = new TekstiAnalüsaator("kiri.txt");
int esinemisteArv = analüsaator.sõneEsinemisteArv("kala");
Please see the code example below. This should solve the issue you are facing.
import java.io.*;
public class CountWords {
public static void main(String args[]) throws IOException {
System.out.println(count("Test.java", "static"));
}
public static int count(String filename, String wordToSearch) throws IOException {
int tokencount = 0;
FileReader fr = new FileReader(filename);
BufferedReader br = new BufferedReader(fr);
String s;
int linecount = 0;
String line;
while ((s = br.readLine()) != null) {
if (s.contains(wordToSearch))
tokencount++;
// System.out.println(s);
}
return tokencount;
}
}
It is a bit of a tricky question because counting words in a string is not so simple task. Your approach is fine for reading the file line by line so now the problem is how to count the word matches.
For example you can do the simple check for matches like that:
public static int getCountOFWordsInLine(String line, String test){
int count=0;
int index=0;
while(line.indexOf(test,index ) != -1) {
count++;
index=line.indexOf(test,index)+1;
}
return count;
}
The problem with that approach is that if your word is "test" and your string is "Next word matches asdfatestsdf" it will count it as a match. So you can try using some more advanced regex:
public static int getCountOFWordsInLine(String line, String word) {
int count = 0;
Pattern pattern = Pattern.compile("\\b"+word+"\\b");
Matcher matcher = pattern.matcher(line);
while (matcher.find())
count++;
return count;
}
It actually checks for the word surrounded by \b which is word break
It still won't find the word if it start with uppercase though. If you want to make it case insensitive you can modify the previous method by changing everything to lowercase prior to searching. But it depends on your definition of word.
The whole program will become:
public class MainClass {
public static void main(String[] args) throws InterruptedException {
try {
InputStream baidid = new FileInputStream("c:\\test.txt");
InputStreamReader tekst = new InputStreamReader(baidid, "UTF-8");
BufferedReader puhverdab = new BufferedReader(tekst);
String rida = puhverdab.readLine();
String word="test";
int count=0;
while (rida != null){
System.out.println("Reading: " + rida);
count+=getCountOFWordsInLine(rida,word );
rida = puhverdab.readLine();
}
System.out.println("count:"+count);
puhverdab.close();
}catch(Exception e) {
e.printStackTrace();
}
}
public static int getCountOFWordsInLine(String line, String test) {
int count = 0;
Pattern pattern = Pattern.compile("\\b"+test+"\\b");
Matcher matcher = pattern.matcher(line);
while (matcher.find())
count++;
return count;
}
}
import java.io.*;
import java.until.regex.*;
public class TA
{
public static void main(String[] args) throws Exception
{
InputStream baidid = new FileInputStream("test.txt");
InputStreamReader tekst = new InputStreamReader(baidid, "UTF-8");
BufferedReader puhverdab = new BufferedReader(tekst);
String rida;
String word = argv[0]; // search word passed via command line
int count1=0, count2=0, count3=0, count4=0;
Pattern P1 = Pattern.compile("\\b" + word + "\\b");
Pattern P2 = Pattern.compile("\\b" + word + "\\b", Pattern.CASE_INSENSITIVE);
while ((rida = puhverdab.readLine()) != null)
{
System.out.println("Reading: " + rida);
// Version 1 : counts lines containing [word]
if (rida.contains(word)) count1++;
// Version 2: counts every instance of [word]
into pos=0;
while ((pos = rida.indexOf(word, pos)) != -1) { count2++; pos++; }
// Version 3: looks for surrounding whitespace
Matcher m = P1.matcher(rida);
while (m.find()) count3++;
// Version 4: looks for surrounding whitespace (case insensitive)
Matcher m = P2.matcher(rida);
while (m.find()) count4++;
}
System.out.println("Found exactly " + count1 + " line(s) containing word: \"" + word + "\"");
System.out.println("Found word \"" + word + "\" exactly " + count2 + " time(s)");
System.out.println("Found word \"" + word + "\" surrounded by whitespace " + count3 + " time(s).");
System.out.println("Found, case insensitive search, word \"" + word + "\" surrounded by whitespace " + count4 + " time(s).");
puhverdab.close();
}
}
This reads line-by-line as you've already done, splits a line by whitespace to obtain individual words, and checks each word for a match.
int countWords(String filename, String word) throws Exception {
InputStream inputStream = new FileInputStream(filename);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "UTF-8");
BufferedReader reader = new BufferedReader(inputStreamReader);
int count = 0;
String line = reader.readLine();
while (line != null) {
String[] words = line.split("\\s+");
for (String w : words)
if (w.equals(word))
count++;
line = reader.readLine();
}
reader.close();
return count;
}
I'm working on a project and I'm trying to count
1) The number of words.
2) The number of lines in a text file.
My problem is that I can't figure out how to detect when the file goes to the next line so I can increment lines correctly. Basically if next is not a space increment words and if next is a new line, increment lines. How would I do this? Thanks!
public static void readFile(Scanner f) {
int words = 0;
int lines = 0;
while (f.hasNext()) {
if (f.next().equals("\n")) {
lines++;
} else if (!(f.next().equals(" "))) {
words++;
}
}
System.out.println("Total number of words: " + words);
System.out.println("Total number of lines: " + lines);
}
Try this:
public static void readFile(Scanner f) {
int words = 0;
int lines = 0;
while (f.hasNextLine()) {
String line = f.nextLine();
lines++;
for (String token : line.split("\\s+")) {
if (!token.isEmpty()) {
words++;
}
}
}
System.out.println("Total number of words: " + words);
System.out.println("Total number of lines: " + lines);
}
Do you have to use InputStream? (Yes) It is better to use a BufferedReader with an InputStreamReader passed in so you can read the file line by line and increment while doing so.
numLines = 0;
try (BufferedReader br = new BufferedReader(new InputStreamReader(inputStream))) {
String line;
while ((line = br.readLine()) != null)
{
numLines++;
// process the line.
}
}
Then to count the words just split the string using a regular expression that finds whitespaces. myStringArray = MyString.Split(MyRegexPattern); will then return a String[] of all the words. Then all you do is numWords += myStringArray.length();
You can use an InputStreamReader to create a bufferedreader which can read a file line by line:
int amountOfLines = 0;
try {BufferedReader br = new BufferedReader(new InputStreamReader(inputStream))} catch (Exception e) {e.printStackTrace();}
String line;
while ((line = br.readLine()) != null{
numLines++;
// process the line.
}
You can then use the split(String) method to separate every part
Try following:
public static void readFile(Scanner f) {
int words = 0;
int lines = 0;
while (f.hasNextLine()) {
String line = f.nextLine();
String[] arr = line.split("\\s");
words += arr.length;
lines++;
}
System.out.println("Total number of words: " + words);
System.out.println("Total number of lines: " + lines);
}
I'm writing a program that'll scan a text file in, and count the number of words in it. The definition for a word for the assignment is: 'A word is a non-empty string consisting of only of letters (a,. . . ,z,A,. . . ,Z), surrounded
by blanks, punctuation, hyphenation, line start, or line end.
'.
I'm very novice at java programming, and so far i've managed to write this instancemethod, which presumably should work. But it doesn't.
public int wordCount() {
int countWord = 0;
String line = "";
try {
File file = new File("testtext01.txt");
Scanner input = new Scanner(file);
while (input.hasNext()) {
line = line + input.next()+" ";
input.next();
}
input.close();
String[] tokens = line.split("[^a-zA-Z]+");
for (int i=0; i<tokens.length; i++){
countWord++;
}
return countWord;
} catch (Exception ex) {
ex.printStackTrace();
}
return -1;
}
Quoting from Counting words in text file?
int wordCount = 0;
while (input.hasNextLine()){
String nextLine = input.nextLine();
Scanner word = new Scanner(nextline);
while(word.hasNext()){
wordCount++;
word.next();
}
word.close();
}
input.close();
The only usable word separators in your file are spaces and hyphens. You can use regex and the split() method.
int num_words = line.split("[\\s\\-]").length; //stores number of words
System.out.print("Number of words in file is "+num_words);
REGEX (Regular Expression):
\\s splits the String at white spaces/line breaks and \\- at hyphens. So wherever there is a space, line break or hyphen, the sentence will be split. The words extracted are copied into and returned as an array whose length is the number of words in your file.
you can use java regular expression.
You can read http://docs.oracle.com/javase/tutorial/essential/regex/groups.html to know about group
public int wordCount(){
String patternToMatch = "([a-zA-z]+)";
int countWord = 0;
try {
Pattern pattern = Pattern.compile(patternToMatch);
File file = new File("abc.txt");
Scanner sc = new Scanner(file);
while(sc.hasNextLine()){
Matcher matcher = pattern.matcher(sc.nextLine());
while(matcher.find()){
countWord++;
}
}
sc.close();
}catch(Exception e){
e.printStackTrace();
}
return countWord > 0 ? countWord : -1;
}
void run(String path)
throws Exception
{
try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8")))
{
int result = 0;
while (true)
{
String line = reader.readLine();
if (line == null)
{
break;
}
result += countWords(line);
}
System.out.println("Words in text: " + result);
}
}
final Pattern pattern = Pattern.compile("[A-Za-z]+");
int countWords(String text)
{
Matcher matcher = pattern.matcher(text);
int result = 0;
while (matcher.find())
{
++result;
System.out.println("Matcher found [" + matcher.group() + "]");
}
System.out.println("Words in line: " + result);
return result;
}
I'm trying to write a small program that detect comments in a code file, and tag them by a index-tag, meaning a tag with an increasing value.
For example this input:
method int foo (int y) {
int temp; // FIRST COMMENT
temp = 63; // SECOND COMMENT
// THIRD COMMENT
}
should be change to:
method int foo (int y) {
int temp; <TAG_0>// FIRST COMMENT</TAG>
temp = 63; <TAG_1>// SECOND COMMENT</TAG>
<TAG_2>// THIRD COMMENT</TAG>
}
I tried the following code:
String prefix, suffix;
String pattern = "(//.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(fileText);
int i = 0;
suffix = "</TAG>";
while (m.find()) {
prefix = "<TAG_" + i + ">";
System.out.println(m.replaceAll(prefix + m.group() + suffix));
i++;
}
The output for the above code is:
method int foo (int y) {
int temp; <TAG_0>// FIRST COMMENT</TAG>
temp = 63; <TAG_0>// SECOND COMMENT</TAG>
<TAG_0>// THIRD COMMENT</TAG>
}
To replace occurrences of detected patterns, you should use the Matcher#appendReplacement method which fills a StringBuffer:
StringBuffer sb = new StringBuffer();
while (m.find()) {
prefix = "<TAG_" + i + ">";
m.appendReplacement(sb, prefix + m.group() + suffix);
i++;
}
m.appendTail(sb); // append the rest of the contents
The reason replaceAll will do the wrong replacement is that it will have the Matcher scan the whole string to replace every matched pattern with <TAG_0>...</TAG>. In effect, the loop would only execute once.
Have you tried reading the file per line, like:
String prefix, suffix;
suffix = " </TAG>";
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
int i = 0;
for (String line; (line = br.readLine()) != null;) {
if (line.contains("//")) {
prefix = "<TAG_" + i + ">//";
System.out.println(line.split("//*")[0] + " " + prefix + line.split("//*")[1] + suffix);
i++;
}
}
} catch (IOException e) {
}
fichiertexte.txt :
method int foo (int y) {
int temp; // FIRST COMMENT
temp = 63; // SECOND COMMENT
// THIRD COMMENT
}
App.java :
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class App {
public static void main(String[] args) {
String fileText = "";
String fichier = "fichiertexte.txt";
// lecture du fichier texte
try {
InputStream ips = new FileInputStream(fichier);
InputStreamReader ipsr = new InputStreamReader(ips);
BufferedReader br = new BufferedReader(ipsr);
String ligne;
while ((ligne = br.readLine()) != null) {
//System.out.println(ligne);
fileText += ligne + "\n";
}
br.close();
} catch (Exception e) {
System.err.println(e.toString());
}
String prefix, suffix;
String pattern = "(//.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(fileText);
int i = 0;
suffix = "</TAG>";
StringBuffer sb = new StringBuffer();
while (m.find()) {
prefix = "<TAG_" + i + ">";
m.appendReplacement(sb, prefix + m.group() + suffix);
i++;
}
System.out.println(sb.toString());
}
}
System.out :
method int foo (int y) {
int temp; <TAG_0>// FIRST COMMENT</TAG>
temp = 63; <TAG_1>// SECOND COMMENT</TAG>
<TAG_2>// THIRD COMMENT</TAG>
}
So, I'm working on a procedure that has an entry of a txt file called orders that specifies the number of words to bold and wich words must be bolded. I've managed to to it for one word but when i try with two words the output gets doubled. For example:
Input:
2
Ophelia
him
Output:
ACT I
ACT I
SCENE I. Elsinore. A platform before the castle.
SCENE I. Elsinore. A platform before the castle.
FRANCISCO at his post. Enter to him BERNARDO
FRANCISCO at his post. Enter to *him* BERNARDO
Here's my code, can anyone help me? PS: Ignore the boolean I guess.
static void bold(char bold, BufferedReader orders, BufferedReader in, BufferedWriter out) throws IOException
{
String linha = in.readLine();
boolean encontrou = false;
String[] palavras = new String[Integer.parseInt(orders.readLine())];
for (int i = 0; i < palavras.length; i++)
{
palavras[i] = orders.readLine();
}
while (linha != null)
{
StringBuilder str = new StringBuilder(linha);
for (int i = 0; i < palavras.length && !encontrou; i++)
{
if (linha.toLowerCase().indexOf(palavras[i]) != -1)
{
str.insert((linha.toLowerCase().indexOf(palavras[i])), bold);
str.insert((linha.toLowerCase().indexOf(palavras[i])) + palavras[i].length() + 1, bold);
out.write(str.toString());
out.newLine();
}
else
{
out.write(linha);
out.newLine();
}
}
linha = in.readLine();
}
}
This merits a regular expression replace of WORD-BOUNDARY + ALTERNATIVES + WORD-BOUNDARY.
String linha = in.readLine(); // Read number of words to be bolded.
String[] palavras = new String[Integer.parseInt(orders.readLine())];
for(int i = 0; i < palavras.length; i++){
palavras[i]=orders.readLine();
}
// We make a regular expression Pattern.
// Like "\\b(him|her|it)\\b" where \\b is a word-boundary.
// This prevents mangling "shimmer".
StringBuilder regex = new StringBuilder("\\b(");
for (int i = 0; i < palavras.length; i++) {
if (i != 0) {
regex.append('|');
}
regex.append(Pattern.quote(palavras[i]));
}
regex.append(")\\b");
Pattern pattern = Pattern.compile(regex.toString(), Pattern.CASE_INSENSITIVE);
boolean encontrou = false;
linha = in.readLine(); // Read first line.
while(linha != null){
Matcher m = pattern.matcher(linha);
String linha2 = m.replaceAll(pattern, "*$1*");
if (linha2 != linha) {
encontrou = true; // Found a replacement.
}
out.write(linha2);
out.newLine();
linha = in.readLine(); // Read next line.
}
A replaceAll (instead of replaceFirst) then replaces all occurrences.
It's writing out twice because you output your StringBuilder (out.write(str.toString())) for the line (linha) every time you iterate through it, which will be at least the number of words in the lookup list.
Move the out.write() statements outside the loop and you should be fine.
Note this will only find one match in each line for each word. If you need to find more than one, the code is a little more complicated. You need to introduce a while loop instead of your if test for matching, or you could consider using replaceAll() using a regular expression based on your word palavras[i]. Ensuring you respected the capitalisation of the original is not simple there, but possible.
Fixed version
static void bold(char bold, BufferedReader orders, BufferedReader in, BufferedWriter out)
throws IOException
{
String linha = in.readLine();
boolean encontrou = false;
String[] palavras = new String[Integer.parseInt(orders.readLine())];
for (int i = 0; i < palavras.length; i++)
{
palavras[i] = orders.readLine();
}
while (linha != null)
{
StringBuilder str = new StringBuilder(linha);
for (int i = 0; i < palavras.length && !encontrou; i++)
{
if (linha.toLowerCase().indexOf(palavras[i]) != -1)
{
str.insert((linha.toLowerCase().indexOf(palavras[i])), bold);
str.insert(
(linha.toLowerCase().indexOf(palavras[i])) + palavras[i].length() + 1,
bold);
}
}
out.write(str.toString());
out.newLine();
linha = in.readLine();
}
}
With replaceAll
static void bold(char bold, BufferedReader orders, BufferedReader in, BufferedWriter out)
throws IOException
{
String linha = in.readLine();
boolean encontrou = false;
String[] palavras = new String[Integer.parseInt(orders.readLine())];
for (int i = 0; i < palavras.length; i++)
{
palavras[i] = orders.readLine();
}
while (linha != null)
{
for (int i = 0; i < palavras.length && !encontrou; i++)
{
String regEx = "\\b("+palavras[i]+")\\b";
linha = linha.replaceAll(regEx, bold + "$1"+bold);
}
out.write(linha);
our.newLine();
linha = in.readLine();
}
}
P.S. I've left the found boolean (encontrou) in, although it is not doing anything at the moment.