find common words in multiple text files

find common words in multiple text files - java

I have 100 text files. 50 of them are called text_H and the other are called text_T. What I would like to do is the following open two text files text_T_1 and text_H_1 and find the number of common words and write it to a text file then open text_H_2 and text_T_2 and find the number of common words....then open text_H_50 and text_T_50 and find the number of common words.
I have written the following code that open two text files and find common words and return the the number of common words between the the two files. The results are written in text file
For whatever reason instead of giving me the number of common word for just the open text files, it gave me the number of of common words for all files. For the example if the number of common words between fileA_1 and fileB_1 is 10 and the number of common words between fileA_2 and fileB_2 is 5, then result I get for number of common word for the second two files is 10+5=15.
I'm hoping someone here can catch whatever it is that I'm missing, because I've been through this code many times now without success. Thanks ahead of time for any help!
The code:
package xml_test;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;
public class app {
private static ArrayList<String> load(String f1) throws FileNotFoundException
{
Scanner reader = new Scanner(new File(f1));
ArrayList<String> out = new ArrayList<String>();
while (reader.hasNext())
{
String temp = reader.nextLine();
String[] sts = temp.split(" ");
for (int i = 0;i<sts.length;i++)
{
if(sts[i] != "" && sts[i] != " " && sts[i] != "\n")
out.add(sts[i]);
}
}
return out;
}
private static void write(ArrayList<String> out, String fname) throws IOException
{
FileWriter writer = new FileWriter(new File(fname));
//int count=0;
int temp1=0;
for (int ss= 1;ss<=3;ss++)
{
int count=0;
for (int i = 0;i<out.size();i++)
{
//writer.write(out.get(i) + "\n");
//writer.write(new Integer(count).toString());
count++;
}
writer.write("count ="+new Integer(temp1).toString()+"\n");
}
writer.close();
}
public static void main(String[] args) throws IOException
{
ArrayList<String> file1;
ArrayList<String> file2;
ArrayList<String> out = new ArrayList<String>();
//add for loop to loop through all T's and H's
for(int kk = 1;kk<=3;kk++)
{
int count=0;
file1 = load("Training_H_"+kk+".txt");
file2 = load("Training_T_"+kk+".txt");
//int count=1;
for(int i = 0;i<file1.size();i++)
{
String word1 = file1.get(i);
count=0;
//System.out.println(word1);
for (int z = 0; z <file2.size(); z++)
{
//if (file1.get(i).equalsIgnoreCase(file2.get(i)))
if (word1.equalsIgnoreCase(file2.get(z)))
{
boolean already = false;
for (int q = 0;q<out.size();q++)
{
if (out.get(q).equalsIgnoreCase(file1.get(i)))
{
count++;
//System.out.println("count is "+count);
already = true;
}
}
if (already==false)
{
out.add(file1.get(i));
}
}
}
//write(out,"output_"+kk+".txt");
}
//count=new Integer(count).toString();
//write(out,"output_"+kk+".txt");
//write(new Integer(count).toString(),"output_2.txt");
//System.out.println("count is "+count);
}//
}
}

Let me show you what your code is doing and see if you can spot the problem.
List wordsInFile1 = getWordsFromFile();
List wordsInFile2 = getWordsFromFile();
List foundWords = empty;
//Does below for each compared file
for each word in file 1
set count to 0
compare to each word in file 2
if the word matches see if it's also in foundWords
if it is in foundWords, add 1 to count
otherwise, add the word to foundWords
//Write the number of words
prints out the number of words in foundWords
Hint: The issue is with foundWords and where you are adding to count. arunmoezhi's comment is on the right track, as well as board_reader's point #3 in his answer.
As it stands now, your code is doing nothing meaningful with any of the count variables

use more meaningful variable names in loops, makes code readable.
use HashMap-s instead of ArrayList-s, will make code smaller, faster and a lot easier. will use less memory too in case words are repeated several times in files.
should not you increase count in already==false case?
could not figure out point of calculating count 3 times in write method, is not count equal to out.size()?
probably there are more too...

Related

split method not sperating special characters

I'm working a program that counts the accurrences of words in a text file. The program compiles and runs fine, but I'm tryting to use the split method to seperate special characters such as .,;:!?(){} from the words.
here is an output example
6 eyes,
3 eyes.
2 eyes;
1 eyes?
1 eyrie
As you can see the split fuction is not working. I have tried debugging, but no luck so far. Can anythone point me out to the right direction or tell me what I'm doing wrong. Thank you.
import java.util.*;
import java.io.*;
public class testingForLetters {
public static void main(String[] args) throws FileNotFoundException {
// open the file
Scanner console = new Scanner(System.in);
System.out.print("What is the name of the text file? ");
String fileName = console.nextLine();
Scanner input = new Scanner(new File(fileName));
// count occurrences
Map<String, Integer> wordCounts = new TreeMap<String, Integer>();
while (input.hasNext()) {
input.next().split("[ \n\t\r.,;:!?(){}]" );
String next = input.next().toLowerCase();
if (next.startsWith("a") || next.startsWith("b") || next.startsWith("c") || next.startsWith("d") || next.startsWith("e") ) {
if (!wordCounts.containsKey(next)) {
wordCounts.put(next, 1);
} else {
wordCounts.put(next, wordCounts.get(next) + 1);
}
}
}
// get cutoff and report frequencies
System.out.println("Total words = " + wordCounts.size());
for (String word : wordCounts.keySet()) {
int count = wordCounts.get(word);
System.out.println(count + "\t" + word);
}
}
}

The .split() method returns an array of strings, and right now you aren't setting input.next().split() equal to anything. You have to create an array and set it equal to input.next().split(), and then get the word(s) from the array. You basically need to handle it exactly like you handled the .toLowerCase() part where you set String next = input.next().toLowerCase(). Hope this helps.

if statement not adding value to my counter in word count program

I have a java program that reads a txt file and counts the words in that file. I setup my program so the String read from the txt file is saved as an ArrayList, and my variable word contains that ArrayList. The issue with my code is that my if statement does not seem to add a value to my count variable each time it detects space in the word string, it seems to only run the if statement once. How can I make it so the if statement finds a space, adds a +1 to my counter value, removes the space, and looks for the next space in the word variable's string? Here is the code:
import java.io.*;
import java.util.*;
public class FrequencyCounting
{
public static void main(String[] args) throws FileNotFoundException
{
// Read-in text from a file and store each word and its
// frequency (count) in a collection.
Scanner inputFile = new Scanner(new File("phrases.txt"));
String word= " ";
Integer count = 0;
List<String> ma = new ArrayList<String>();
while(
inputFile.hasNextLine()) {
word = word + inputFile.nextLine() + " ";
}
ma.add(word);
System.out.println(ma);
if(word.contains(" ")) {
ma.remove(" ");
count++;
System.out.println("does contain");
}
else {
System.out.println("does not contain");
}
System.out.println(count);
//System.out.println(ma);
inputFile.close();
// Output each word, followed by a tab character, followed by the
// number of times the word appeared in the file. The words should
// be in alphabetical order.
; // TODO: Your code goes here.
}
}
When I execute the program, I get a value of 1 for the variable count and I get a returned string representation of the txt file from my phrases.txt
phrases.txt is :
my watch fell in the water
time to go to sleep
my time to go visit
watch out for low flying objects
great view from the room
the world is a stage
the force is with you
you are not a jedi yet
an offer you cannot refuse
are you talking to me

Your if statement is not inside any loop, so it will only execute once.
A better approach, which would save a shit ton of runtime, is to read each line like you already do, use the String.split() method to split it on spaces, then add each element of the returned String[] to your list by using the ArrayList.addAll() method (if that one exist, otherwise (optionally, ensure the capacity and) add the elements one by one).
Then count by using the ArrayList.size() method to get the number of elements.

Based on the comments in your code :
// Read-in text from a file and store each word and its
// frequency (count) in a collection.
// Output each word, followed by a tab character, followed by the
// number of times the word appeared in the file. The words should
// be in alphabetical order.
My understanding is that you need to store count for every word, rather having a total count of words. For storing count for every word which should be stored itself in alphabetical order, it is better to go with a TreeMap.
public static void main(String[] args) {
Map<String, Integer> wordMap = new TreeMap<String, Integer>();
try {
Scanner inputFile = new Scanner(new File("phrases.txt"));
while(inputFile.hasNextLine()){
String line = inputFile.nextLine();
String[] words = line.split(" ");
for(int i=0; i<words.length; i++){
String word = words[i].trim();
if(word.length()==0){
continue;
}
int count = 0;
if(wordMap.containsKey(word)){
count = wordMap.get(word);
}
count++;
wordMap.put(word, count);
}
}
inputFile.close();
for(Entry<String,Integer> entry : wordMap.entrySet()){
System.out.println(entry.getKey()+"\t"+entry.getValue());
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}

What is your goal here ? Do you just want to read the file and count numbers of words?

You need to use a while loop instead of an if statement that'll just run once. Here's a better way to do what you want to do:
Scanner inputFile = new Scanner(new File("phrases.txt"));
StringBuilder sb = new StringBuilder();
String line;
int totalCount = 0;
while(inputFile.hasNextLine()) {
line = inputFile.nextLine();
sb.append(line).append("\n"); // This is more efficient than concatenating strings
int spacesOnLine = countSpacesOnLine(line);
totalCount += spacesOnLine;
// print line and spacesOnLine if you wish to here
}
// print text file
System.out.println(sb.toString());
// print total spaces in file
System.out.println("Total spaces" + totalCount);
inputFile.close();
Then add a method that counts the spaces on a line:
private int countSpacesOnLine(String line) {
int totalSpaces = 0;
for(int i = 0; i < line.length(); i++) {
if (line.charAt(i) == ' ')
totalSpaces += 1;
}
return totalSpaces;
}

You can achieve your objective with the following one liner too:
int words = Files.readAllLines(Paths.get("phrases.txt"), Charset.forName("UTF-8")).stream().mapToInt(string -> string.split(" ").length).sum();

probably I am late, but here is c# simple version:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
namespace StackOverflowAnswers
{
class Program
{
static void Main(string[] args)
{
string contents = File.ReadAllText(#"C:\temp\test.txt");
var arrayString = contents.Split(' ');
Console.WriteLine("Number of Words {0}", arrayString.Length);
Console.ReadLine();
}
}
}

Frequency by letter order

I am looking to manipulate a text file through frequency by letter order. In my program there is a method I'm not sure how to start. I'd like to get an output something like:
Letter / Count
1 A 6 ***
2 B 8 ****
3 C 6 ***
(etc.)
To which 6 names begin with A, 8 with B, and 6 with C.
Then an '*' for every 2 count.
My practice problem is actually using a text file with 90000 names and a different '*' count, but an example code and explanation of why it works would be greatly appreciated for my study.
Here's the beginning of my program, but like I said I'm not sure how to start this method whatsoever.
import javax.swing.JOptionPane;
import java.io.*;
public class P03Census {
String rec;
int ctr = 0;
public static void main(String[] args)throws IOException {
Object result = JOptionPane.showInputDialog(null, "Enter a file name\n(1990 to 2000)\nadd extension",
"Taylor Daggett", JOptionPane.PLAIN_MESSAGE);
String textDoc = (String) result;
File file = new File(textDoc);
System.out.println("-----------------------------------------------------------------------------------------");
System.out.println("File name: " +
file);
if (!textDoc.endsWith(".txt")) {
System.out.println("Usage: This is not a text file!");
System.exit(0);
} else if (!file.exists()) {
System.out.println("File not found!");
System.exit(0);
}
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String rec;
int lines = 0;
int i;
while((rec = br.readLine()) != null){
lines++;
}
System.out.println("Record count:"+lines);
System.out.println("------------------------------------------------------------------------------------------");
}
}

Here's an algorithm that would do what you want, it exploits the fact you can use char variables as int:
First, create an array int[] letterCount = new int[26], which you will use to count the letters.
Then, inside the body of your main while loop, convert the string rec into an array String[] where every element is a name. If, in your input file, the names are always separated by the same char (like whitespace for example), you could use String[] names = rec.split(" ").
Next, run through that names in a for loop, and check the first letter of each name: char firstLetter = names[i].charAt(0). And use it to increase the count of that letter by one, in the array letterCount: letterCount[firstLetter - 'a']++;
At the end of the loop, letterCount should have the right count. Note that if you file contains capital letters, you have to call rec.toLowerCase(), at the start of the body of the loop, otherwise you will get out of bounds error, when trying to call letterCount[firstLetter - 'a'], or if all names start with upper case, then just replace by letterCount[firstLetter - 'A']

Reading two lines from an input file using Scanner

Hi I'm in a programming class over the summer and am required to create a program that reads input from a file. The input file includes DNA sequences ATCGAGG etc and the first line in the file states how many pairs of sequences need to be compared. The rest are pairs of sequences. In class we use the Scanner method to input lines from a file, (I read about bufferedReader but we have not covered it in class so not to familiar with it) but am lost on how to write the code on how to compare two lines from the Scanner method simultaneously.
My attempt:
public static void main (String [] args) throws IOException
{
File inFile = new File ("dna.txt");
Scanner sc = new Scanner (inFile);
while (sc.hasNextLine())
{
int pairs = sc.nextLine();
String DNA1 = sc.nextLine();
String DNA2 = sc.nextLine();
comparison(DNA1,DNA2);
}
sc.close();
}
Where the comparison method would take a pair of sequences and output if they had common any common characters. Also how would I proceed to input the next pair, any insight would be helpful.. Just stumped and google confused me even further. Thanks!
EDIT:
Here's the sample input
7
atgcatgcatgc
AtgcgAtgc
GGcaAtt
ggcaatt
GcT
gatt
aaaaaGTCAcccctccccc
GTCAaaaaccccgccccc
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
gctagtacACCT
gctattacGcct

First why you are doing:
while (sc.hasNextLine())
{
int pairs = sc.nextLine();
While you have pairs only in one line not pairs and two lines of input, but number of lines once? Move reading pairs from that while looop and parse it to int, then it does not matter but you could use it to stop reading lines if you know how many lines are there.
Second:
throws IOException
Might be irrelevant but, really you don't know how to do try catch and let's say skip if you do not care about exceptions?
Comparision, if you read strings then string has method "equals" with which you can compare two strings.
Google will not help you with those problems, you just don't know it all, but if you want to know then search for basic stuff like type in google "string comparision java" and do not think that you can find solution typing "Reading two lines from an input file using Scanner" into google, you have to go step by step and cut problem into smaller pieces, that is the way software devs are doing it.
Ok I have progz that somehow wokrked for me, just finds the lines that have something and then prints them out even if I have part, so it is brute force which is ok for such thing:
import java.io.File;
import java.io.IOException;
import java.util.Scanner;
public class program
{
public static void main (String [] args) throws IOException
{
File inFile = new File ("c:\\dna.txt");
Scanner sc = new Scanner (inFile);
int pairs = Integer.parseInt(sc.nextLine());
for (int i = 0; i< pairs-1; i++)
{
//ok we have 7 pairs so we do not compare everything that is one under another
String DNA1 = sc.nextLine();
String DNA2 = sc.nextLine();
Boolean compareResult = comparison(DNA1,DNA2);
if (compareResult){
System.out.println("found the match in:" + DNA1 + " and " + DNA2) ;
}
}
sc.close();
}
public static Boolean comparison(String dna1, String dna2){
Boolean contains = false;
for (int i = 0; i< dna1.length(); i++)
{
if (dna2.contains(dna1.subSequence(0, i)))
{
contains = true;
break;
}
if (dna2.contains(dna1.subSequence(dna1.length()-i,dna1.length()-1 )))
{
contains = true;
break;
}
}
return contains;
}
}

Finding the Length of a String within an ArrayList

I am writing a CSVParser program that seperates words at commas. I am currently trying to find and record the longest word that appears in the file. Here is my class.
import csv.CSVParser;
import java.io.*;
import java.util.*;
public class RecordFormatter {
public static void main (String[] args) {
CSVParser parser = new CSVParser(new File (args[0]));
while (parser.hasNextLine()) {
ArrayList<String> ls = parser.getNextLine();
for (int i = 0; i<ls.size(); i++) {
System.out.print("|" + ls.get(i) + " ");
}
System.out.print("|");
System.out.println();
}
CSVParser parser1 = new CSVParser(new File (args[0]));
ArrayList<Integer> maxCol = new ArrayList<Integer>();
while (parser1.hasNextLine()) {
ArrayList<String> ls1 = parser1.getNextLine();
for (int i = 0; i<ls1.size(); i++) {
maxCol.add(ls1.get(i)); //Here is where my bug occurs.
}
}
}
}
I have created two CSVParsers and am trying to use the second of the two to record the length. I tried (as you can see above) storing the int length value of each word into another Arraylist, but I can't seem to get it to work. Any help would be much appreciated.

Without giving the solution, since this is homework...
Notice that you are keeping every length value rather than comparing the current length against a previous value to determine if it is longer and only then keeping it.
Seems like you need just a single maxLength Integer (or int) rather than a list since you just want the longest single word.
If you wanted the longest word per line, a List might then be appropriate.
Another option would be to use a SortedList and get the largest value (last value) in the list.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

find common words in multiple text files - java

Related

split method not sperating special characters

if statement not adding value to my counter in word count program

Frequency by letter order

Reading two lines from an input file using Scanner

Finding the Length of a String within an ArrayList

Categories

Resources