Find unique words in a file - Java

Find unique words in a file - Java - java

Using a msdos window I am piping in an amazon.txt file.
I am trying to use the collections framework. Keep in mind I want to keep this
as simple as possible.
What I want to do is count all the unique words in the file... with no duplicates.
This is what I have so far. Please be kind this is my first java project.
import java.util.Scanner;
import java.util.ArrayList;
import java.util.Iterator;
public class project1 {
// ArrayList<String> a = new ArrayList<String>();
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String word;
String grab;
int count = 0;
ArrayList<String> a = new ArrayList<String>();
// Iterator<String> it = a.iterator();
System.out.println("Java project\n");
while (sc.hasNext()) {
word = sc.next();
a.add(word);
if (word.equals("---")) {
break;
}
}
Iterator<String> it = a.iterator();
while (it.hasNext()) {
grab = it.next();
if (grab.contains("a")) {
System.out.println(it.next()); // Just a check to see
count++;
}
}
System.out.println("I counted abc = ");
System.out.println(count);
System.out.println("\nbye...");
}
}

In your version, the wordlist a will contain all words but duplicates aswell. You can either
(a) check for every new word, if it is already included in the list (List#contains is the method you should call), or, the recommended solution
(b) replace ArrayList<String> with TreeSet<String>. This will eliminate duplicates automatically and store the words in alphabetical order
Edit
If you want to count the unique words, then do the same as above and the desired result is the collections size. So if you entered the sequence "a a b c ---", the result would be 3, as there are three unique words (a, b and c).

Instead of ArrayList<String>, use HashSet<String> (not sorted) or TreeSet<String> (sorted) if you don't need a count of how often each word occurs, Hashtable<String,Integer> (not sorted) or TreeMap<String,Integer> (sorted) if you do.
If there are words you don't want, place those in a HashSet<String> and check that this doesn't contain the word your Scanner found before placing into your collection. If you only want dictionary words, put your dictionary in a HashSet<String> and check that it contains the word your Scanner found before placing into your collection.

Related

Ignore numbers in a text file when scanning it in Java

I am doing an assignment in Java that requires us to read two different files. One has the top 1000 boy names, and the other contains the top 1000 girl names. We have to write a program that returns all of the names that are in both files. We have to read each boy and girl name as a String, ignoring the number of namings, and add it to a HashSet. When adding to a HashSet, the add method will return false if the name to be added already exists int he HashSet. So to find the common names, you just have to keep track of which names returned false when adding. My problem is that I can't figure out how to ignore the number of namings in each file. My HashSet contains both, and I just want the names.
Here is what I have so far.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class Names {
public static void main(String[] args) {
Set<String> boynames = new HashSet<String>();
Set<String> girlnames = new HashSet<String>();
boynames = loadBoynames();
System.out.println(girlnames);
}
private static Set<String> loadBoynames() {
HashSet<String> d = new HashSet<String>();
File names = new File("boynames.txt");
Scanner s = null;
try {
s = new Scanner(names);
} catch (FileNotFoundException e) {
System.out.println("Can't find boy names file.");
System.exit(1);
}
while(s.hasNext()){
String currentName = s.next();
d.add(currentName.toUpperCase());
}
return d;
}
}
My plan is to take the HashSet that I currently have and add the girl names to it, but before I do I need to not have the numbers in my HashSet.
I tried to skip numbers with this code, but it just spat out errors
while(s.hasNextLine()){
if (s.hasNextInt()){
number = s.nextInt();
}else{
String currentName = s.next();
d.add(currentName.toUpperCase());
}
}
Any help would be appreciated.

You could also use regex to replace all numbers (or more special chars if needed)
testStr = testStr.replaceAll("\\d","");

Try to use StreamTokenizer (java.io) class to read file. it will split your file into tokens and also provide type of token like String value, number value in double data type, end of file, end of line). so you can easily identify the String token.
You can find details from here
http://docs.oracle.com/javase/6/docs/api/java/io/StreamTokenizer.html

How to remove a word from a linked list that has a vowel as the first character in Java

This is the first time I am having to use a linkedlist. I understand how to iterate through it properly, and how to set one up. The problem I am having is I am unsure how to properly do this in combing it with checking if the first letter of a word is a vowel, and if so removing that word from the list. Here is my code so far:
import java.util.*;
public class LinkedListExample
{
public static void main(String args[])
{
//Linked List Declaration
LinkedList<String> linkedlist = new LinkedList<String>();
Scanner sc=new Scanner(System.in);
for(int i = 0; i<4; i++)//filling the list
{
System.out.println("What is your word?");
String yourValue = sc.next();
linkedlist.add(yourValue);
sc.nextLine();
}
Iterator<String> i = linkedlist.iterator();
while (i.hasNext())
{
String vowels = "aeiouy";
//Need to remove the words with the vowels as the first letter here
}
while(i.hasNext())//printing out new list
{
System.out.println(i.next());
}
}
}
I know I must use a for loop to make this work. My first thought was using a for loop to check against my string vowels, but I was unsure how to make that work with a linked list. I am also unsure how to remove something here while using an iterator to iterate through the linked list.

List<String> filteredList = list.stream().filter(n->n.startsWith("a")||n.startsWith("e")||n.startsWith("i")||n.startsWith("o")||n.startsWith("u")).collect(Collectors.toList());
List<String> unique = new ArrayList<String>(list);
unique.removeAll(filteredList);
unique.forEach(System.out::println);
here made an array list that contains words that starts with a,e,i,o,u and then I had created an array list that contain all elements then i removed elements that are present in filtered list unique list is your required list. I hope my post would be helpful cheers.

while (i.hasNext())
{
String vowels = "aeiouy";
//Need to remove the words with the vowels as the first letter here
boolean found = false;
String str = i.next();
for(int counter = 0; counter < vowels.length(); counter++)
if(vowels.charAt(counter) == str.charAt(0)) {
found = true;
break;
}
if(found) { /* do stuff here */}
}
EDIT:
After that, and before printing the new values, you have to reinitialize the iterator again by doing: i = linkedlist.iterator();. Pay attention to that. :)

How to Count Unique Values in an ArrayList?

I have to count the number of unique words from a text document using Java. First I had to get rid of the punctuation in all of the words. I used the Scanner class to scan each word in the document and put in an String ArrayList.
So, the next step is where I'm having the problem! How do I create a method that can count the number of unique Strings in the array?
For example, if the array contains apple, bob, apple, jim, bob; the number of unique values in this array is 3.
public countWords() {
try {
Scanner scan = new Scanner(in);
while (scan.hasNext()) {
String words = scan.next();
if (words.contains(".")) {
words.replace(".", "");
}
if (words.contains("!")) {
words.replace("!", "");
}
if (words.contains(":")) {
words.replace(":", "");
}
if (words.contains(",")) {
words.replace(",", "");
}
if (words.contains("'")) {
words.replace("?", "");
}
if (words.contains("-")) {
words.replace("-", "");
}
if (words.contains("‘")) {
words.replace("‘", "");
}
wordStore.add(words.toLowerCase());
}
} catch (FileNotFoundException e) {
System.out.println("File Not Found");
}
System.out.println("The total number of words is: " + wordStore.size());
}

Are you allowed to use Set? If so, you HashSet may solve your problem. HashSet doesn't accept duplicates.
HashSet noDupSet = new HashSet();
noDupSet.add(yourString);
noDupSet.size();
size() method returns number of unique words.
If you have to really use ArrayList only, then one way to achieve may be,
1) Create a temp ArrayList
2) Iterate original list and retrieve element
3) If tempArrayList doesn't contain element, add element to tempArrayList

Starting from Java 8 you can use Stream:
After you add the elements in your ArrayList:
long n = wordStore.stream().distinct().count();
It converts your ArrayList to a stream and then it counts only the distinct elements.

I would advice to use HashSet. This automatically filters the duplicate when calling add method.

Although I believe a set is the easiest solution, you can still use your original solution and just add an if statement to check if value already exists in the list before you do your add.
if( !wordstore.contains( words.toLowerCase() )
wordStore.add(words.toLowerCase());
Then the number of words in your list is the total number of unique words (ie: wordStore.size() )

This general purpose solution takes advantage of the fact that the Set abstract data type does not allow duplicates. The Set.add() method is specifically useful in that it returns a boolean flag indicating the success of the 'add' operation. A HashMap is used to track the occurrence of each original element. This algorithm can be adapted for variations of this type of problem. This solution produces O(n) performance..
public static void main(String args[])
{
String[] strArray = {"abc", "def", "mno", "xyz", "pqr", "xyz", "def"};
System.out.printf("RAW: %s ; PROCESSED: %s \n",Arrays.toString(strArray), duplicates(strArray).toString());
}
public static HashMap<String, Integer> duplicates(String arr[])
{
HashSet<String> distinctKeySet = new HashSet<String>();
HashMap<String, Integer> keyCountMap = new HashMap<String, Integer>();
for(int i = 0; i < arr.length; i++)
{
if(distinctKeySet.add(arr[i]))
keyCountMap.put(arr[i], 1); // unique value or first occurrence
else
keyCountMap.put(arr[i], (Integer)(keyCountMap.get(arr[i])) + 1);
}
return keyCountMap;
}
RESULTS:
RAW: [abc, def, mno, xyz, pqr, xyz, def] ; PROCESSED: {pqr=1, abc=1, def=2, xyz=2, mno=1}

You can create a HashTable or HashMap as well. Keys would be your input strings and Value would be the number of times that string occurs in your input array. O(N) time and space.
Solution 2:
Sort the input list.
Similar strings would be next to each other.
Compare list(i) to list(i+1) and count the number of duplicates.

In shorthand way you can do it as follows...
ArrayList<String> duplicateList = new ArrayList<String>();
duplicateList.add("one");
duplicateList.add("two");
duplicateList.add("one");
duplicateList.add("three");
System.out.println(duplicateList); // prints [one, two, one, three]
HashSet<String> uniqueSet = new HashSet<String>();
uniqueSet.addAll(duplicateList);
System.out.println(uniqueSet); // prints [two, one, three]
duplicateList.clear();
System.out.println(duplicateList);// prints []
duplicateList.addAll(uniqueSet);
System.out.println(duplicateList);// prints [two, one, three]

public class UniqueinArrayList {
public static void main(String[] args) {
StringBuffer sb=new StringBuffer();
List al=new ArrayList();
al.add("Stack");
al.add("Stack");
al.add("over");
al.add("over");
al.add("flow");
al.add("flow");
System.out.println(al);
Set s=new LinkedHashSet(al);
System.out.println(s);
Iterator itr=s.iterator();
while(itr.hasNext()){
sb.append(itr.next()+" ");
}
System.out.println(sb.toString().trim());
}
}

3 distinct possible solutions:
Use HashSet as suggested above.
Create a temporary ArrayList and store only unique element like below:
public static int getUniqueElement(List<String> data) {
List<String> newList = new ArrayList<>();
for (String eachWord : data)
if (!newList.contains(eachWord))
newList.add(eachWord);
return newList.size();
}
Java 8 solution
long count = data.stream().distinct().count();

string compare in java

I have a ArrayList, with elements something like:
[string,has,was,hctam,gnirts,saw,match,sah]
I would like to delete the ones which are repeating itself, such as string and gnirts, and delete the other(gnirts). How do I go about achieving something as above?
Edit: I would like to rephrase the question:
Given an arrayList of strings, how does one go about deleting elements containing reversed strings?
Given the following input:
[string,has,was,hctam,gnirts,saw,match,sah]
How does one reach the following output:
[string,has,was,match]

Set<String> result = new HashSet<String>();
for(String word: words) {
if(result.contains(word) || result.contains(new StringBuffer(word).reverse().toString())) {
continue;
}
result.add(word);
}
// result

You can use a comparator that sorts the characters before checking them for equality. This means that compare("string", "gnirts") will return 0. Then use this comparator as you traverse through the list and copy the matching elements to a new list.
Another option (if you have a really large list) is to create an Anagram class that extends the String class. Override the hashcode method so that anagrams produce the same hashcode, then use a hashmap of anagrams to check your array list for anagrams.

HashSet<String> set = new HashSet<String>();
for (String str : arraylst)
{
set.add(str);
}
ArrayList<String> newlst = new ArrayList<String>();
for (String str : arraylst)
{
if(!set.contains(str))
newlst.add(str);
}

To remove duplicate items, you can use HashMap (), where as the key codes will be used by the sum of the letters (as each letter has its own code - is not a valid situation where two different words have an identical amount of code numbers), as well as the value - this the word. When adding a new word in a HashMap, if the amount of code letters of new words is identical to some of the existing key in a HashMap, then the word with the same key is replaced by a new word. Thus, we get the HashMap collection of words without repetition.
With regard to the fact that the bottom line "string" looks better "gnirts". It may be a situation where we can not determine which word is better, so the basis has been taken that the final form of the word is not important - thing is that there are no duplicate
ArrayList<String> mainList = new ArrayList<String>();
mainList.add("string,has,was,hctam,gnirts,saw,match,sah");
String[] listChar = mainList.get(0).split(",");
HashMap <Integer, String> hm = new HashMap<Integer, String>();
for (String temp : listChar) {
int sumStr=0;
for (int i=0; i<temp.length(); i++)
sumStr += temp.charAt(i);
hm.put(sumStr, temp);
}
mainList=new ArrayList<String>();
Set<Map.Entry<Integer, String>> set = hm.entrySet();
for (Map.Entry<Integer, String> temp : set) {
mainList.add(temp.getValue());
}
System.out.println(mainList);
UPD:
1) The need to maintain txt-file in ANSI
In the beginning, I replaced Scaner on FileReader and BufferedReader
String fileRStr = new String();
String stringTemp;
FileReader fileR = new FileReader("text.txt");
BufferedReader streamIn = new BufferedReader(fileR);
while ((stringTemp = streamIn.readLine()) != null)
fileRStr += stringTemp;
fileR.close();
mainList.add(fileRStr);
In addition, all the words in the file must be separated by commas, as the partition ishonoy lines into words by the function split (",").
If you have words separated by another character - replace the comma at the symbol in the following line:
String[] listChar = mainList.get(0).split(",");

add file to collections framework from file

I am not trying to duplicate threads here.
My problem is i am piping in a file using msdos called amazon.txt
the file has 637 words in it..
I want a count of unique words.. and not a count of "a", "the" , "this"
which i havent counted for yet in the code..
when i add to a tree set it only has 8 words..
There should be atlest 300 unique words..
count of total file = 637
count2 of treeset = 8
I thought treeset handles duplicates? what am i doing wrong?
The file does contain some ints an $
import java.util.Scanner;
import java.util.ArrayList;
import java.util.TreeSet;
import java.util.Iterator;
import java.util.HashSet;
public class practice1
{
public static void main(String[] args)
{
Scanner sc = new Scanner(System.in);
String word;
//String grab;
int count = 0;
int count2 =0;
int count3 =0;
int count4 =0;
int number;
//ArrayList<String> a = new ArrayList<String>();
TreeSet<String> a = new TreeSet<String>();
while (sc.hasNext())
{
word = sc.next();
count++; // 637 words
a.add(word);
if (word.equals("---"))
{
break;
}
}
Iterator<String> it = a.iterator();
while(it.hasNext())
{
string grab = it.next();
count2++; // 8 words
if (grab.equals("---"))
{
break;
}
}
System.out.println("count2");
System.out.println(count2);
System.out.println("count");
System.out.println(count);
System.out.println("\nbye...");
}
}

Your method for counting the number of entries in the TreeSet is to iterate over the Set and stop counting when you first see the string "---".
This isn't correct. You are probably assuming that the order of entries returned by TreeSet.iterator() is the same order as which they were inserted in. That isn't the case:
The elements are ordered using their natural ordering, or by a Comparator provided at set creation time, depending on which constructor is used.
"Natural ordering" here means the results of String.compareTo(String) (since String implements Comparable<String>), which tests for lexicographical order. In other words, a the iterator of a TreeSet<String> returns the items in alphabetical order.
If you want to know the size of your Set, just use size().

I don't see anywhere where you are adding the word into the TreeSet 'a'.
If I'm just missing that (and I might be) I'd bet the problem is that a TreeSet is not guaranteed to iterate in the order of insertion. That is, you add "---" last but there's no reason it won't come out of the iterator 8th and terminate your program.
So I'd say get rid of the check where you see if the iterator returns "---" and see where that gets you.
Had time to verify, change:
if (grab.equals("---"))
{
break;
}
to:
if (grab.equals("---"))
{
//break;
}
and it works as expected.
Good luck!

There is no need to iterate a 2nd time, just replace 2nd loop with
System.out.println("Treeset.size():" + a.size() );
and do not add "---" to treeset in the first loop (assuming this is some kind of end of file marker)
if (word.equals("---"))
{
break;
}
a.add(word);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find unique words in a file - Java - java

Related

Ignore numbers in a text file when scanning it in Java

How to remove a word from a linked list that has a vowel as the first character in Java

How to Count Unique Values in an ArrayList?

string compare in java

add file to collections framework from file

Categories

Resources