I have to count the number of unique words from a text document using Java. First I had to get rid of the punctuation in all of the words. I used the Scanner class to scan each word in the document and put in an String ArrayList.
So, the next step is where I'm having the problem! How do I create a method that can count the number of unique Strings in the array?
For example, if the array contains apple, bob, apple, jim, bob; the number of unique values in this array is 3.
public countWords() {
try {
Scanner scan = new Scanner(in);
while (scan.hasNext()) {
String words = scan.next();
if (words.contains(".")) {
words.replace(".", "");
}
if (words.contains("!")) {
words.replace("!", "");
}
if (words.contains(":")) {
words.replace(":", "");
}
if (words.contains(",")) {
words.replace(",", "");
}
if (words.contains("'")) {
words.replace("?", "");
}
if (words.contains("-")) {
words.replace("-", "");
}
if (words.contains("‘")) {
words.replace("‘", "");
}
wordStore.add(words.toLowerCase());
}
} catch (FileNotFoundException e) {
System.out.println("File Not Found");
}
System.out.println("The total number of words is: " + wordStore.size());
}
Are you allowed to use Set? If so, you HashSet may solve your problem. HashSet doesn't accept duplicates.
HashSet noDupSet = new HashSet();
noDupSet.add(yourString);
noDupSet.size();
size() method returns number of unique words.
If you have to really use ArrayList only, then one way to achieve may be,
1) Create a temp ArrayList
2) Iterate original list and retrieve element
3) If tempArrayList doesn't contain element, add element to tempArrayList
Starting from Java 8 you can use Stream:
After you add the elements in your ArrayList:
long n = wordStore.stream().distinct().count();
It converts your ArrayList to a stream and then it counts only the distinct elements.
I would advice to use HashSet. This automatically filters the duplicate when calling add method.
Although I believe a set is the easiest solution, you can still use your original solution and just add an if statement to check if value already exists in the list before you do your add.
if( !wordstore.contains( words.toLowerCase() )
wordStore.add(words.toLowerCase());
Then the number of words in your list is the total number of unique words (ie: wordStore.size() )
This general purpose solution takes advantage of the fact that the Set abstract data type does not allow duplicates. The Set.add() method is specifically useful in that it returns a boolean flag indicating the success of the 'add' operation. A HashMap is used to track the occurrence of each original element. This algorithm can be adapted for variations of this type of problem. This solution produces O(n) performance..
public static void main(String args[])
{
String[] strArray = {"abc", "def", "mno", "xyz", "pqr", "xyz", "def"};
System.out.printf("RAW: %s ; PROCESSED: %s \n",Arrays.toString(strArray), duplicates(strArray).toString());
}
public static HashMap<String, Integer> duplicates(String arr[])
{
HashSet<String> distinctKeySet = new HashSet<String>();
HashMap<String, Integer> keyCountMap = new HashMap<String, Integer>();
for(int i = 0; i < arr.length; i++)
{
if(distinctKeySet.add(arr[i]))
keyCountMap.put(arr[i], 1); // unique value or first occurrence
else
keyCountMap.put(arr[i], (Integer)(keyCountMap.get(arr[i])) + 1);
}
return keyCountMap;
}
RESULTS:
RAW: [abc, def, mno, xyz, pqr, xyz, def] ; PROCESSED: {pqr=1, abc=1, def=2, xyz=2, mno=1}
You can create a HashTable or HashMap as well. Keys would be your input strings and Value would be the number of times that string occurs in your input array. O(N) time and space.
Solution 2:
Sort the input list.
Similar strings would be next to each other.
Compare list(i) to list(i+1) and count the number of duplicates.
In shorthand way you can do it as follows...
ArrayList<String> duplicateList = new ArrayList<String>();
duplicateList.add("one");
duplicateList.add("two");
duplicateList.add("one");
duplicateList.add("three");
System.out.println(duplicateList); // prints [one, two, one, three]
HashSet<String> uniqueSet = new HashSet<String>();
uniqueSet.addAll(duplicateList);
System.out.println(uniqueSet); // prints [two, one, three]
duplicateList.clear();
System.out.println(duplicateList);// prints []
duplicateList.addAll(uniqueSet);
System.out.println(duplicateList);// prints [two, one, three]
public class UniqueinArrayList {
public static void main(String[] args) {
StringBuffer sb=new StringBuffer();
List al=new ArrayList();
al.add("Stack");
al.add("Stack");
al.add("over");
al.add("over");
al.add("flow");
al.add("flow");
System.out.println(al);
Set s=new LinkedHashSet(al);
System.out.println(s);
Iterator itr=s.iterator();
while(itr.hasNext()){
sb.append(itr.next()+" ");
}
System.out.println(sb.toString().trim());
}
}
3 distinct possible solutions:
Use HashSet as suggested above.
Create a temporary ArrayList and store only unique element like below:
public static int getUniqueElement(List<String> data) {
List<String> newList = new ArrayList<>();
for (String eachWord : data)
if (!newList.contains(eachWord))
newList.add(eachWord);
return newList.size();
}
Java 8 solution
long count = data.stream().distinct().count();
Related
public class JavaApplication13 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
BufferedReader br;
String strLine;
ArrayList<String> arr =new ArrayList<>();
HashMap<Integer,ArrayList<String>> hm = new HashMap<>();
try {
br = new BufferedReader( new FileReader("words.txt"));
while( (strLine = br.readLine()) != null){
arr.add(strLine);
}
} catch (FileNotFoundException e) {
System.err.println("Unable to find the file: fileName");
} catch (IOException e) {
System.err.println("Unable to read the file: fileName");
}
ArrayList<Integer> lengths = new ArrayList<>(); //List to keep lengths information
System.out.println("Total Words: "+arr.size()); //Total waords read from file
int i=0;
while(i<arr.size()) //this loop will itrate our all the words of text file that are now stored in words.txt
{
boolean already=false;
String s = arr.get(i);
//following for loop will check if that length is already in lengths list.
for(int x=0;x<lengths.size();x++)
{
if(s.length()==lengths.get(x))
already=true;
}
//already = true means file is that we have an arrayist of the current string length in our map
if(already==true)
{
hm.get(s.length()).add(s); //adding that string according to its length in hm(hashmap)
}
else
{
hm.put(s.length(),new ArrayList<>()); //create a new element in hm and the adding the new length string
hm.get(s.length()).add(s);
lengths.add(s.length());
}
i++;
}
//Now Print the whole map
for(int q=0;q<hm.size();q++)
{
System.out.println(hm.get(q));
}
}
}
is this approach is right?
Explanation:
load all the words to an ArrayList.
then iterate through each index and check the length of word add it to an ArrayList of strings containing that length where these ArrayList are mapped in a hashmap with length of words it is containing.
Firstly, your code is working only for the files which contain one word by line as you're processing whole lines as words. To make your code more universal you have to process each line by splitting it to words:
String[] words = strLine.split("\\s+")
Secondly, you don't need any temporary data structures. You can add your words to the map right after you read the line from file. arr and lengths lists are actually useless here as they do not contain any logic except temporary storing. You're using lengths list just to store the lengths which has already been added to the hm map. The same can be reached by invoking hm.containsKey(s.length()).
And an additional comment on your code:
for(int x=0;x<lengths.size();x++) {
if(s.length()==lengths.get(x))
already=true;
}
when you have a loop like this when you only need to find if some condition is true for any element you don't need to proceed looping when the condition is already found. You should use a break keyword inside your if statement to terminate the loop block, e.g.
for(int x=0;x<lengths.size();x++) {
if(s.length()==lengths.get(x))
already=true;
break; // this will terminate the loop after setting the flag to true
}
But as I already mentioned you don't need it at all. That is just for educational purposes.
Your approach is long, confusing, hard to debug and from what I see it's not good performance-wise (check out the contains method). Check this:
String[] words = {"a", "ab", "ad", "abc", "af", "b", "dsadsa", "c", "ghh", "po"};
Map<Integer, List<String>> groupByLength =
Arrays.stream(words).collect(Collectors.groupingBy(String::length));
System.out.println(groupByLength);
This is just an example, but you get the point. I have an array of words, and then I use streams and Java8 magic to group them in a map by length (exactly what you're trying to do). You get the stream, then collect it to a map, grouping by length of the words, so it's gonna put every 1 letter word in a list under key 1 etc.
You can use the same approach, but you have your words in a list so remember to not use Arrays.stream() but just .stream() on your list.
We have to find all simple words from a bunch of simple and compound words. For example:
Input: chat, ever, snapchat, snap, salesperson, per, person, sales, son, whatsoever, what so.
Output should be: chat, ever, snap, per, sales, son, what, so
My sample code:
private static String[] find(String[] words) {
// TODO Auto-generated method stub
//System.out.println();
ArrayList<String> alist = new ArrayList<String>();
Set<String> r1 = new HashSet<String>();
for(String s: words){
alist.add(s);
}
Collections.sort(alist,new Comparator<String>() {
public int compare(String o1, String o2) {
return o1.length()-o2.length();
}
});
//System.out.println(alist.toString());
int count= 0;
for(int i=0;i<alist.size();i++){
String check = alist.get(i);
r1.add(check);
for(int j=i+1;j<alist.size();j++){
String temp = alist.get(j);
//System.out.println(check+" "+temp);
if(temp.contains(check) ){
alist.remove(temp);
}
}
}
System.out.println(r1.toString());
String res[] = new String[r1.size()];
for(String i:words){
if(r1.contains(i)){
res[count++] = i;
}
}
return res;
}
I am unable to get a solution with the above code. Any suggestions or ideas
compound word = concatenation of two or more words;rest all words are considered as simple words
We have to remove all the compound words
Algorithm
Read the input into a set of Strings i.e. Set<String> input
Create a empty set for simple words i.e. Set<String> simpleWords
Create a empty set for compound words i.e. Set<String> compoundWords
Iterate over input. For each element
Let length of element be elemLength
Create a set Set<String> inputs of all Strings from the set input (excluding element) for which the below is true
Length less than element
Not present in compundWords
Create set of all permutations of inputs(by concatenating) with max length = elemLength i.e. Set<String> currentPermutations
See if any of currentPermutations is = element
If yes, add element into compoundWords
If no, continue with iteration
After the iteration is done place all Strings from input which are not present in compoundWords into simpleWords
That is your answer.
Before you start writing code decide the logic that you are going to use. Use descriptive variable names and you are basically done.
The reason your logic is not working has to do with the way you are checking temp.contains(check). This is checking for substring not a compound word as per your definition.
I have two list word containing words (word is a copy of the list words) and existingGuesses containing characters and I want to compare them (means compare whether each character is present in the list word or not) by iterating through a for loop. Can anybody suggest me how to do the comparison?
public List<String> getWordOptions(List<String> existingGuesses, String newGuess)
{
List<String> word = new ArrayList<String>(words);
/* String c = existingGuesses.get(0);
ListIterator<String> iterator = word.listIterator();
while(iterator.hasNext()){
if(word.contains(c))
{
word.remove(c);
}
}*/
for(String temp: word){
for(String cha: existingGuesses){
}
}
return null;
}
You can check for the guesses in words like this by using the List#contains(Object).
for(String myGuess: existingGuesses){
if(word.contains(myGuess)) {
// Do what you want
}
}
How about the following O(N) complexity code
public List<String> getWordOptions(List<String> existingGuesses, String newGuess) {
List<String> word = new ArrayList<String>(words);
for (String cha : existingGuesses) {
if (word.contains(cha)) {
word.remove(cha);
}
}
return null;
}
If you want to compare them and remove them if its there,
Then You can use the List#removeAll(anotherlist)
Removes from this list all of its elements that are contained in the specified collection (optional operation).
(Got clue from word.remove(c);) from your commented code.
You can use Collection.retainAll:
List<String> word=new ArrayList<String>();//fill list
List<String> existingGuesses=new ArrayList<String>();//fill list
List<String> existingWords=new ArrayList<String>(word);
existingWords.retainAll(existingGuesses);
//existingWords will only contain the words present in both the lists
System.out.println(existingWords);
I have a ArrayList, with elements something like:
[string,has,was,hctam,gnirts,saw,match,sah]
I would like to delete the ones which are repeating itself, such as string and gnirts, and delete the other(gnirts). How do I go about achieving something as above?
Edit: I would like to rephrase the question:
Given an arrayList of strings, how does one go about deleting elements containing reversed strings?
Given the following input:
[string,has,was,hctam,gnirts,saw,match,sah]
How does one reach the following output:
[string,has,was,match]
Set<String> result = new HashSet<String>();
for(String word: words) {
if(result.contains(word) || result.contains(new StringBuffer(word).reverse().toString())) {
continue;
}
result.add(word);
}
// result
You can use a comparator that sorts the characters before checking them for equality. This means that compare("string", "gnirts") will return 0. Then use this comparator as you traverse through the list and copy the matching elements to a new list.
Another option (if you have a really large list) is to create an Anagram class that extends the String class. Override the hashcode method so that anagrams produce the same hashcode, then use a hashmap of anagrams to check your array list for anagrams.
HashSet<String> set = new HashSet<String>();
for (String str : arraylst)
{
set.add(str);
}
ArrayList<String> newlst = new ArrayList<String>();
for (String str : arraylst)
{
if(!set.contains(str))
newlst.add(str);
}
To remove duplicate items, you can use HashMap (), where as the key codes will be used by the sum of the letters (as each letter has its own code - is not a valid situation where two different words have an identical amount of code numbers), as well as the value - this the word. When adding a new word in a HashMap, if the amount of code letters of new words is identical to some of the existing key in a HashMap, then the word with the same key is replaced by a new word. Thus, we get the HashMap collection of words without repetition.
With regard to the fact that the bottom line "string" looks better "gnirts". It may be a situation where we can not determine which word is better, so the basis has been taken that the final form of the word is not important - thing is that there are no duplicate
ArrayList<String> mainList = new ArrayList<String>();
mainList.add("string,has,was,hctam,gnirts,saw,match,sah");
String[] listChar = mainList.get(0).split(",");
HashMap <Integer, String> hm = new HashMap<Integer, String>();
for (String temp : listChar) {
int sumStr=0;
for (int i=0; i<temp.length(); i++)
sumStr += temp.charAt(i);
hm.put(sumStr, temp);
}
mainList=new ArrayList<String>();
Set<Map.Entry<Integer, String>> set = hm.entrySet();
for (Map.Entry<Integer, String> temp : set) {
mainList.add(temp.getValue());
}
System.out.println(mainList);
UPD:
1) The need to maintain txt-file in ANSI
In the beginning, I replaced Scaner on FileReader and BufferedReader
String fileRStr = new String();
String stringTemp;
FileReader fileR = new FileReader("text.txt");
BufferedReader streamIn = new BufferedReader(fileR);
while ((stringTemp = streamIn.readLine()) != null)
fileRStr += stringTemp;
fileR.close();
mainList.add(fileRStr);
In addition, all the words in the file must be separated by commas, as the partition ishonoy lines into words by the function split (",").
If you have words separated by another character - replace the comma at the symbol in the following line:
String[] listChar = mainList.get(0).split(",");
Using a msdos window I am piping in an amazon.txt file.
I am trying to use the collections framework. Keep in mind I want to keep this
as simple as possible.
What I want to do is count all the unique words in the file... with no duplicates.
This is what I have so far. Please be kind this is my first java project.
import java.util.Scanner;
import java.util.ArrayList;
import java.util.Iterator;
public class project1 {
// ArrayList<String> a = new ArrayList<String>();
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String word;
String grab;
int count = 0;
ArrayList<String> a = new ArrayList<String>();
// Iterator<String> it = a.iterator();
System.out.println("Java project\n");
while (sc.hasNext()) {
word = sc.next();
a.add(word);
if (word.equals("---")) {
break;
}
}
Iterator<String> it = a.iterator();
while (it.hasNext()) {
grab = it.next();
if (grab.contains("a")) {
System.out.println(it.next()); // Just a check to see
count++;
}
}
System.out.println("I counted abc = ");
System.out.println(count);
System.out.println("\nbye...");
}
}
In your version, the wordlist a will contain all words but duplicates aswell. You can either
(a) check for every new word, if it is already included in the list (List#contains is the method you should call), or, the recommended solution
(b) replace ArrayList<String> with TreeSet<String>. This will eliminate duplicates automatically and store the words in alphabetical order
Edit
If you want to count the unique words, then do the same as above and the desired result is the collections size. So if you entered the sequence "a a b c ---", the result would be 3, as there are three unique words (a, b and c).
Instead of ArrayList<String>, use HashSet<String> (not sorted) or TreeSet<String> (sorted) if you don't need a count of how often each word occurs, Hashtable<String,Integer> (not sorted) or TreeMap<String,Integer> (sorted) if you do.
If there are words you don't want, place those in a HashSet<String> and check that this doesn't contain the word your Scanner found before placing into your collection. If you only want dictionary words, put your dictionary in a HashSet<String> and check that it contains the word your Scanner found before placing into your collection.