remove repeated words from String Array

remove repeated words from String Array - java

Good Morning
I write a function that calculates for me the frequency of a term:
public static int tfCalculator(String[] totalterms, String termToCheck) {
int count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count;
}
and after that I use it on the code below to calculate every word from a String[] words
for(String word:words){
int freq = tfCalculator(words, word);
System.out.println(word + "|" + freq);
mm+=word + "|" + freq+"\n";
}
well the problem that I have is that the words repeat here is for example the result:
cytoskeletal|2
network|1
enable|1
equal|1
spindle|1
cytoskeletal|2
...
...
so can someone help me to remove the repeated word and get as result like that:
cytoskeletal|2
network|1
enable|1
equal|1
spindle|1
...
...
Thank you very much!

Java 8 solution
words = Arrays.stream(words).distinct().toArray(String[]::new);
the distinct method removes duplicates. words is replaced with a new array without duplicates

I think here you want to print the frequency of each string in the array totalterms . I think using Map is a easier solution as in the single traversal of the array it will store the frequency of all the strings Check the following implementation.
public static void printFrequency(String[] totalterms)
{
Map frequencyMap = new HashMap<String, Integer>();
for (String string : totalterms) {
if(frequencyMap.containsKey(string))
{
Integer count = (Integer)frequencyMap.get(string);
frequencyMap.put(string, count+1);
}
else
{
frequencyMap.put(string, 1);
}
}
Set <Entry<String, Integer>> elements= frequencyMap.entrySet();
for (Entry<String, Integer> entry : elements) {
System.out.println(entry.getKey()+"|"+entry.getValue());
}
}

You can just use a HashSet and that should take care of the duplicates issue:
words = new HashSet<String>(Arrays.asList(words)).toArray(new String[0]);
This will take your array, convert it to a List, feed that to the constructor of HashSet<String>, and then convert it back to an array for you.

Sort the array, then you can just count equal adjacent elements:
Arrays.sort(totalterms);
int i = 0;
while (i < totalterms.length) {
int start = i;
while (i < totalterms.length && totalterms[i].equals(totalterms[start])) {
++i;
}
System.out.println(totalterms[start] + "|" + (i - start));
}

in two line :
String s = "cytoskeletal|2 - network|1 - enable|1 - equal|1 - spindle|1 - cytoskeletal|2";
System.out.println(new LinkedHashSet(Arrays.asList(s.split("-"))).toString().replaceAll("(^\[|\]$)", "").replace(", ", "- "));

Your code is fine, you just need keep track of which words were encountered already. For that you can keep a running set:
Set<String> prevWords = new HashSet<>();
for(String word:words){
// proceed if word is new to the set, otherwise skip
if (prevWords.add(word)) {
int freq = tfCalculator(words, word);
System.out.println(word + "|" + freq);
mm+=word + "|" + freq+"\n";
}
}

Related

Scanning string for keywords of various lengths

I want to scan my document split into array of words for certain keywords such as 'Fuel', 'Vehicle', 'Vehicle Leasing', 'Asset Type Maintenance' etc. The problem is that the keywords are of different lengths. One is a single word keyword, the other is 4 words keyword. At the moment I'm scanning word after word but that doesn't like the idea of multiple word keywords such as 'Vehicle Leasing' for example.
What can I do to improve my code and to work with multiple word keywords?
This is how it looks now
public void findKeywords(POITextExtractor te, ArrayList<HashMap<String,Integer>> listOfHashMaps, ArrayList<Integer> KeywordsFound, ArrayList<Integer> existingTags) {
String document = te.getText().toString();
String[] words = document.split("\\s+");
int wordsNo = 0;
int keywordsMatched = 0;
try {
for(String word : words) {
wordsNo++;
for(HashMap<String, Integer> hashmap : listOfHashMaps) {
if(hashmap.containsKey(word) && !KeywordsFound.contains(hashmap.get(word)) && !existingTags.contains(hashmap.get(word))) {
KeywordsFound.add(hashmap.get(word));
keywordsMatched++;
System.out.println(word);
}
}
}
System.out.println("New keywords found: " + KeywordsFound);
System.out.println("Number of words in document = " + wordsNo);
System.out.println("Number of keywords matched: " + keywordsMatched);
} catch (IllegalArgumentException e) {
e.printStackTrace();
}
}
I have included my method. If there's anything else required to understand my code, leave a comment please.
#UPDATE
public void findKeywords(POITextExtractor te, ArrayList<HashMap<String,Integer>> listOfHashMaps, ArrayList<Integer> KeywordsFound, ArrayList<Integer> existingTags) {
String document = te.getText().toString();
String[] words = document.split("\\s+");
int wordsNo = 0;
int keywordsMatched = 0;
for(HashMap<String, Integer> hashmap : listOfHashMaps) {
Iterator it = hashmap.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
//System.out.println(pair.getKey() + " = " + pair.getValue());
it.remove(); // avoids a ConcurrentModificationException
if(document.contains((CharSequence) pair.getKey()) && !KeywordsFound.contains(pair.getValue()) && !existingTags.contains(pair.getValue())) {
System.out.println(pair.getKey());
KeywordsFound.add((Integer) pair.getValue());
keywordsMatched++;
}
}
}
System.out.println("New keywords found: " + KeywordsFound);
System.out.println("Number of keywords matched: " + keywordsMatched);
}

Another way of doing it would be to split the string by the search strings.
eg.
List<String> searchString = new ArrayList<>();
searchString.add("Fuel");
searchString.add("Asset Type Maintenance");
searchString.add("Vehicle Leasing");
String document=""; // Assuming that you complete string is initilaized here.
for (String str : searchString) {
String[] tempDoc=document.split(str);
System.out.println(str + " is repated "+ (tempDoc.length-1) + " times");
Note this might thrash the JVM in garbage collection.
You can compare the performance on you own.

I assume this is a kind of homework. Therefore:
Have a look at string search algorithms that search for a substring (pattern) in a larger string.
Then assume that you use one of this algorithms, but instead of having a sequence of chars (pattern) that you search for in a larger sequence of chars, you have a sequence of string (pattern) that you search for in a larger sequence of string. (so you just have a different, much larger, alphabet)

Count Number of String Matches gives Wrong Result

I want to count the Number of String Matches in a List:
My ArrayList contains:
recognise
product
product
process
process
process
principle
partner
particular
So that the output should be:
recognise 1
product 2
process 3
principle 1
partner 1
particular 1
My Code is:
List<String> mylist=new LinkedList<String>();
HashMap<String, Integer> result= new LinkedHashMap<String, Integer>();
for (int i = 0; i < wordlist.size(); i++) {
mylist.add(wordlist.get(i)); //wordlist contains the above mentioned items
}
Collections.sort(mylist);
Collections.reverse(mylist);
String small="";
int c=0;
for(int i=0;i<mylist.size();i++)
{
c+=1;
small=mylist.get(i);
for(int j=i;j<mylist.size();j++)
{
if(small.contains(mylist.get(j)))
{
small=mylist.get(j);
}
}
if (!result.containsKey(small) || result.get(small) < c){
result.put(small, c);
c=0;
}
}
for (String key : result.keySet()){
System.out.println(key + ": " + result.get(key));
}

If all you want to do is count the occurrences of each string in your list, this should suffice:
for(String s : mylist) {
if(result.containsKey(s)) {
result.put(s, result.get(s) + 1);
} else {
result.put(s, 1);
}
}
No need to sort / reverse / etc mylist.
To get the count sorted, just use a SortedMap by providing a Comparator (I have no prior experience on this, so you better look up the API yourself).

Time complexity of your algo seems high of order O(N^2*length_max) .
However you can do this in O(N) , where N = sum of all strings lengths by making a trie ,
where each node contains an integer i which tells the number of times string ends at that node, and char ch .
struct node
{
int i;
char ch;
}
ALGORITHM :
When traversing the string , if it's not in trie , insert it in trie , else traverse down the trie and when string ends , do i=i+1 , marking one more string ends here .
see : https://stackoverflow.com/questions/296618/what-is-the-most-common-use-of-the-trie-data-structure

Counting occurrences in a string array and deleting the repeats using java

i'm having trouble with a code. I have read words from a text file into a String array, removed the periods and commas. Now i need to check the number of occurrences of each word. I managed to do that as well. However, my output contains all the words in the file, and the occurrences.
Like this:
the 2
birds 2
are 1
going 2
north 2
north 2
Here is my code:
public static String counter(String[] wordList)
{
//String[] noRepeatString = null ;
//int[] countArr = null ;
for (int i = 0; i < wordList.length; i++)
{
int count = 1;
for(int j = 0; j < wordList.length; j++)
{
if(i != j) //to avoid comparing itself
{
if (wordList[i].compareTo(wordList[j]) == 0)
{
count++;
//noRepeatString[i] = wordList[i];
//countArr[i] = count;
}
}
}
System.out.println (wordList[i] + " " + count);
}
return null;
I need to figure out 1) to get the count value into an array.. 2) to delete the repetitions.
As seen in the commenting, i tried to use a countArr[] and a noRepeatString[], in hopes of doing that.. but i had a NullPointerException.
Any thought on this matter will be much appreciated :)

I would first convert the array into a list because they are easier to operate on than arrays.
List<String> list = Arrays.asList(wordsList);
Then you should create a copy of that list (you'll se in a second why):
ArrayList<String> listTwo = new ArrayList<String>(list);
Now you remove all the duplicates in the second list:
HashSet hs = new HashSet();
hs.addAll(listTwo);
listTwo.clear();
listTwo.addAll(hs);
Then you loop through the second list and get the frequency of that word in the first list. But first you should create another arrayList to store the results:
ArrayList<String> results = new ArrayList<String>;
for(String word : listTwo){
int count = Collections.frequency(list, word);
String result = word +": " count;
results.add(result);
}
Finally you can output the results list:
for(String freq : results){
System.out.println(freq);}
I have not tested this code (can't do that right now). Please ask if there is a problem or it doesnÄt work. See these questions for reference:
How do I remove repeated elements from ArrayList?
One-liner to count number of occurrences of String in a String[] in Java?
How do I clone a generic List in Java?

some syntax issues in your code but works fine
ArrayList<String> results = new ArrayList<String>();
for(String word : listTwo){
int count = Collections.frequency(list, word);
String result = word +": "+ count;
results.add(result);
}

More efficient way of getting frequency of words

I want to count the frequency of each word in an ArrayList by the start of the word. e.g [cat, cog, mouse] will mean there are 2 words begining with c and one word begining with m. The code I have works fine but there are 26 letters in the alphabet which will require alot more if s. Is there any other way of doing this?
public static void countAlphabeticalWords(ArrayList<String> arrayList) throws IOException
{
int counta =0, countb=0, countc=0, countd=0,counte=0;
String word = "";
for(int i = 0; i<arrayList.size();i++)
{
word = arrayList.get(i);
if (word.charAt(0) == 'a' || word.charAt(0) == 'A'){ counta++;}
if (word.charAt(0) == 'b' || word.charAt(0) == 'B'){ countb++;}
}
System.out.println("The number of words begining with A are: " + counta);
System.out.println("The number of words begining with B are: " + countb);
}

Use a Map
public static void countAlphabeticalWords(List<String> arrayList) throws IOException {
Map<Character,Integer> counts = new HashMap<Character,Integer>();
String word = "";
for(String word : list) {
Character c = Character.toUpperCase(word.charAt(0));
if (counts.containsKey(c)) {
counts.put(c, counts.get(c) + 1);
}
else {
counts.put(c, 1);
}
}
for (Map.Entry<Character, Integer> entry : counts.entrySet()) {
System.out.println("The number of words begining with " + entry.getKey() + " are: " + entry.getValue());
}
Or use a Map and AtomicInteger (as per Jarrod Roberson)
public static void countAlphabeticalWords(List<String> arrayList) throws IOException {
Map<Character,AtomicInteger> counts = new HashMap<Character,AtomicInteger>();
String word = "";
for(String word : list) {
Character c = Character.toUpperCase(word.charAt(0));
if (counts.containsKey(c)) {
counts.get(c).incrementAndGet();
}
else {
counts.put(c, new AtomicInteger(1));
}
}
for (Map.Entry<Character, AtomicInteger> entry : counts.entrySet()) {
System.out.println("The number of words begining with " + entry.getKey() + " are: " + entry.getValue());
}
Best Practices
Never do list.get(i), use for(element : list) instead. And never use ArrayList in a signature use the Interface List instead so you can change the implemenation.

How about this? Considering that the words start only with [a-zA-Z]:
public static int[] getCount(List<String> arrayList) {
int[] data = new int[26];
final int a = (int) 'a';
for(String s : arrayList) {
data[((int) Character.toLowerCase(s.charAt(0))) - a]++;
}
return data;
}
edit:
Just out of curiosity, I made a very simple test comparing my method and Steph's method with map.
List with 236 items, 10000000 iterations (without printing the result): my code took ~10000ms and Steph's took ~65000ms.
Test: http://pastebin.com/HNBgKFRk
Data: http://pastebin.com/UhCtapZZ

Now, every character can be cast to an integer, representing an ASCII decimal. For example, (int)'a' is 97. 'z''s ASCII decimal is 122. http://www.asciitable.com/
You can create a lookup table for the characters:
int characters = new int[128]
Then in your algorithm's loop use the ASCII decimal as index and increment the value:
word = arrayList.get(i);
characters[word.charAt(0)]++;
In the end, you can print the occurence of the characters:
for (int i = 97; i<=122; i++){
System.out.println(String.format("The number of words beginning with %s are: %d", (char)i, characters[i]));
}

Find number of repetitions of characters in a given Word

So I was developing an algorithm to count the number of repetitions of each character in a given word. I am using a HashMap and I add each unique character to the HashMap as the key and the value is the number of repetitions. I would like to know what the run time of my solution is and if there is a more efficient way to solve the problem.
Here is the code :
public static void getCount(String name){
public HashMap<String, Integer> names = new HashMap<String, Integer>() ;
for(int i =0; i<name.length(); i++){
if(names.containsKey(name.substring(i, i+1))){
names.put(name.substring(i, i+1), names.get(name.substring(i, i+1)) +1);
}
else{
names.put(name.substring(i, i+1), 1);
}
}
Set<String> a = names.keySet();
Iterator i = a.iterator();
while(i.hasNext()){
String t = (String) i.next();
System.out.println(t + " Ocurred " + names.get(t) + " times");
}
}

The algorithm has a time complexity of O(n), but I'd change some parts of your implementation, namely:
Using a single get() instead of containsKey() + get();
Using charAt() instead of substring() which will create a new String object;
Using a Map<Character, Integer> instead of Map<String, Integer> since you only care about a single character, not the entire String:
In other words:
public static void getCount(String name) {
Map<Character, Integer> names = new HashMap<Character, Integer>();
for(int i = 0; i < name.length(); i++) {
char c = name.charAt(i);
Integer count = names.get(c);
if (count == null) {
count = 0;
}
names.put(c, count + 1);
}
Set<Character> a = names.keySet();
for (Character t : a) {
System.out.println(t + " Ocurred " + names.get(t) + " times");
}
}

Your solution is O(n) from an algorithmic perspective, which is already optimal (at a minimum you have to inspect each character in the entire string at least once which is O(n)).
However there are a couple of ways that you could speed it up be reducing the constant overhead, e.g.
Use a HashMap<Character,Integer>. Characters will be much more efficient than Strings of length 1.
use charAt(i) instead of substring(i,i+1). This avoids creating a new String which will help you a lot. Probably the biggest single improvement you can make.
If the string is going to be long (e.g. thousands of characters or more), consider using an int[] array to count the individual characters rather than a HashMap, with the character's ASCII value used as an index into the array. This isn't a good idea if your Strings are short though.

Store the initial time to a variable, like so:
long start = System.currentTimeMillis();
then at the end, when you finish, print out the current time minus the start time:
System.out.println((System.currentTimeMillis() - start) + "ms taken");
to see the time taken to do it. As far as I can tell, that is the most efficient way to do it, but there may be another good method. Also, use char rather than strings for each individual character (as char/Character is the best class for characters, strings for a series of chars) then do name.charAt(i) rather than name.substring(i, i+1) and change your hashmap to HashMap<Character, Integer>

String s="good";
//collect different unique characters
ArrayList<String> temp=new ArrayList<>();
for (int i = 0; i < s.length(); i++) {
char c=s.charAt(i);
if(!temp.contains(""+c))
{
temp.add(""+s.charAt(i));
}
}
System.out.println(temp);
//get count of each occurrence in the string
for (int i = 0; i < temp.size(); i++) {
int count=0;
for (int j = 0; j < s.length(); j++) {
if(temp.get(i).equals(s.charAt(j)+"")){
count++;
}
}
System.out.println("Occurance of "+ temp.get(i) + " is "+ count+ " times" );
}*/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

remove repeated words from String Array - java

Java 8 solution words = Arrays.stream(words).distinct().toArray(String[]::new); the distinct method removes duplicates. words is replaced with a new array without duplicates

You can just use a HashSet and that should take care of the duplicates issue: words = new HashSet<String>(Arrays.asList(words)).toArray(new String[0]); This will take your array, convert it to a List, feed that to the constructor of HashSet<String>, and then convert it back to an array for you.

Sort the array, then you can just count equal adjacent elements: Arrays.sort(totalterms); int i = 0; while (i < totalterms.length) { int start = i; while (i < totalterms.length && totalterms[i].equals(totalterms[start])) { ++i; } System.out.println(totalterms[start] + "|" + (i - start)); }

in two line : String s = "cytoskeletal|2 - network|1 - enable|1 - equal|1 - spindle|1 - cytoskeletal|2"; System.out.println(new LinkedHashSet(Arrays.asList(s.split("-"))).toString().replaceAll("(^\[|\]$)", "").replace(", ", "- "));

Related

Scanning string for keywords of various lengths

Count Number of String Matches gives Wrong Result

Counting occurrences in a string array and deleting the repeats using java

More efficient way of getting frequency of words

Find number of repetitions of characters in a given Word

Categories

Resources