How to create vocabulary from Arrays of Strings - java

I have to make a vocabulary with unique words of some texts. I have texts converted to Arrays of Strings. Now I want the Array list with only unique words. So the first step, convert the first Array of Strings to a List<Strings> (I guess?) where all double words are filtered out. That is my first step, how do I do this, and do I use a List<String> or another String[]?
Second, the next String[] I 'read-in' should update the vocabulary List<String> but ONLY add new words from the text.
It must look something like:
public List<String> makeVocabulary(String[] tokens){
List<String> vocabulay = new ArrayList<>;
//add unique words from 'tokens' to vocabulary
return vocabulary;
}
TL;DR: how do I convert a whole bunch of String[] to one List<String> with only the unique words from the String[]'s?

Upon review of your code, it appears that you would be clearing vocabulary each time you run this command, so it can only be done once. If you'd like to make it more modular, do something like this:
public class yourClass
{
private List<String> vocabulary = new ArrayList<String>();
public List<String> makeVocabulary(String[] tokens)
{
for( int i = 0; i < tokens.length; i++ )
if( !vocabulary.contains( tokens[i] ) )
vocabulary.add(tokens[i]);
return vocabulary;
}
}

For determining unique tokens, use a Set implementation...
public List<String> makeVocabulary(String[] tokens){
Set<String> uniqueTokens = new HashSet<String>();
for(String token : tokens) {
uniqueTokens.add(token);
}
List<String> vocabulay = new ArrayList<String>(uniqueTokens);
return vocabulary;
}

One way to achieve your goal is to make use of the Set class as opposed to a List of strings. You could look into that e.g. like the code below.
public List<String> makeVocabulary(String[] tokens){
Set<String> temp = new HashSet<>;
//add unique words from 'tokens' to temp
List<String> vocabulary = new ArrayList<>;
vocabulary.addAll(temp);
return vocabulary;
}
If you can live with Set as the return type of makeVocabulary, you can just return temp.

Related

Java: Removing item from array because of character

Lets say you have an array like this: String[] theWords = {"hello", "good bye", "tomorrow"}. I want to remove/ignore all the strings in the array that have the letter 'e'. How would I go about doing that? My thinking is to go:
for (int arrPos = 0; arrPos < theWords.length; arrPos++) { //Go through the array
for (int charPos = 0; charPos < theWords[arrPos].length(); charPos++) { //Go through the strings in the array
if (!((theWords[arrPos].charAt(charPos) == 'e')) { //Finds 'e' in the strings
//Put the words that don't have any 'e' into a new array;
//This is where I'm stuck
}
}
}
I'm not sure if my logic works and if I'm even on the right track. Any responses would be helpful. Many thanks.
One easy way to filter an array is to populate an ArrayList with if in a for-each loop:
List<String> noEs = new ArrayList<>();
for (String word : theWords) {
if (!word.contains("e")) {
noEs.add(word);
}
}
Another way in Java 8 is to use Collection#removeIf:
List<String> noEs = new ArrayList<>(Arrays.asList(theWords));
noEs.removeIf(word -> word.contains("e"));
Or use Stream#filter:
String[] noEs = Arrays.stream(theWords)
.filter(word -> !word.contains("e"))
.toArray(String[]::new);
You can directly use contains() method of String class to check if "e" is present in your string. That will save your extra for loop.
It would be simple if you use ArrayList.
importing import java.util.ArrayList;
ArrayList<String> theWords = new ArrayList<String>();
ArrayList<String> yourNewArray = new ArrayList<String>;//Initializing you new array
theWords.add("hello");
theWords.add("good bye");
theWords.add("tommorow");
for (int arrPos = 0; arrPos < theWords.size(); arrPos++) { //Go through the array
if(!theWords.get(arrPos).contains("e")){
yourNewArray.add(theWords.get(arrPos));// Adding non-e containing string into your new array
}
}
The problem you have is that you need to declare and instantiate the String array before you even know how many elements are going to be in it (since you wouldn't know how many strings would not contain 'e' before going through the loop).
Instead, if you use an ArrayList you do not need to know the required size beforehand. Here is my code from start to end.
String[] theWords = { "hello", "good bye", "tomorrow" };
//creating a new ArrayList object
ArrayList<String> myList = new ArrayList<String>();
//adding the corresponding array contents to the list.
//myList and theWords point to different locations in the memory.
for(String str : theWords) {
myList.add(str);
}
//create a new list containing the items you want to remove
ArrayList<String> removeFromList = new ArrayList<>();
for(String str : myList) {
if(str.contains("e")) {
removeFromList.add(str);
}
}
//now remove those items from the list
myList.removeAll(removeFromList);
//create a new Array based on the size of the list when the strings containing e is removed
//theWords now refers to this new Array.
theWords = new String[myList.size()];
//convert the list to the array
myList.toArray(theWords);
//now theWords array contains only the string(s) not containing 'e'
System.out.println(Arrays.toString(theWords));

Simple words finder

We have to find all simple words from a bunch of simple and compound words. For example:
Input: chat, ever, snapchat, snap, salesperson, per, person, sales, son, whatsoever, what so.
Output should be: chat, ever, snap, per, sales, son, what, so
My sample code:
private static String[] find(String[] words) {
// TODO Auto-generated method stub
//System.out.println();
ArrayList<String> alist = new ArrayList<String>();
Set<String> r1 = new HashSet<String>();
for(String s: words){
alist.add(s);
}
Collections.sort(alist,new Comparator<String>() {
public int compare(String o1, String o2) {
return o1.length()-o2.length();
}
});
//System.out.println(alist.toString());
int count= 0;
for(int i=0;i<alist.size();i++){
String check = alist.get(i);
r1.add(check);
for(int j=i+1;j<alist.size();j++){
String temp = alist.get(j);
//System.out.println(check+" "+temp);
if(temp.contains(check) ){
alist.remove(temp);
}
}
}
System.out.println(r1.toString());
String res[] = new String[r1.size()];
for(String i:words){
if(r1.contains(i)){
res[count++] = i;
}
}
return res;
}
I am unable to get a solution with the above code. Any suggestions or ideas
compound word = concatenation of two or more words;rest all words are considered as simple words
We have to remove all the compound words
Algorithm
Read the input into a set of Strings i.e. Set<String> input
Create a empty set for simple words i.e. Set<String> simpleWords
Create a empty set for compound words i.e. Set<String> compoundWords
Iterate over input. For each element
Let length of element be elemLength
Create a set Set<String> inputs of all Strings from the set input (excluding element) for which the below is true
Length less than element
Not present in compundWords
Create set of all permutations of inputs(by concatenating) with max length = elemLength i.e. Set<String> currentPermutations
See if any of currentPermutations is = element
If yes, add element into compoundWords
If no, continue with iteration
After the iteration is done place all Strings from input which are not present in compoundWords into simpleWords
That is your answer.
Before you start writing code decide the logic that you are going to use. Use descriptive variable names and you are basically done.
The reason your logic is not working has to do with the way you are checking temp.contains(check). This is checking for substring not a compound word as per your definition.

separating unique values in an algorithm

I am decomposing a series of 90,000+ strings into a discrete list of the individual, non-duplicated pairs of words that are included in the strings with the rxcui id values associated with each string. I have developed a method which tries to accomplish this, but it is producing a lot of redundancy. Analysis of the data shows there are about 12,000 unique words in the 90,000+ source strings, after I clean and format the contents of the strings.
How can I change the code below so that it avoids creating the redundant rows in the destination 2D ArrayList (shown below the code)?
public static ArrayList<ArrayList<String>> getAllWords(String[] tempsArray){//int count = tempsArray.length;
int fieldslenlessthan2 = 0;//ArrayList<String> outputarr = new ArrayList<String>();
ArrayList<ArrayList<String>> twoDimArrayList= new ArrayList<ArrayList<String>>();
int idx = 0;
for (String s : tempsArray) {
String[] fields = s.split("\t");//System.out.println(" --- fields.length is: "+fields.length);
if(fields.length>1){
ArrayList<String> row = new ArrayList<String>();
System.out.println("fields[0] is: "+fields[0]);
String cleanedTerms = cleanTerms(fields[1]);
String[] words = cleanedTerms.split(" ");
for(int j=0;j<words.length;j++){
String word=words[j].trim();
word = word.toLowerCase();
if(isValidWord(word)){//outputarr.add(word);
System.out.println("words["+j+"] is: "+word);
row.add(word_id);//WORD_ID NEEDS TO BE CREATED BY SOME METHOD.
row.add(fields[0]);
row.add(word);
twoDimArrayList.add(row);
idx += 1;
}
}
}else{fieldslenlessthan2 += 1;}
}
System.out.println("........... fieldslenlessthan2 is: "+fieldslenlessthan2);
return twoDimArrayList;
}
The output of the above method currently looks like the following, with many rxcui values for some name values, and with many name values for some rxcui:
How do I change the code above so that the output is a list of unique pairs of name/rxcui values, summarizing all relevant data from the current output while removing only the redundancies?
If you just need a Collection of all words, use a HashSet Sets are primarily used for contains logic. If you need to associate a value with your string use a HashMap
public HashSet<String> getUniqueWords(String[] stringArray) {
HashSet<String> uniqueWords = new HashSet<String>();
for (String str : stringArray) {
uniqueWords.add(str);
}
return uniqueWords;
}
This will give you a collection of all the unique Strings in your array. If you need an ID use a HashMap
String[] strList; // your String array
int idCounter = 0;
HashMap<String, Integer> stringIDMap = new HashMap<String, Integer>();
for (String str : strList) {
if (!stringIDMap.contains(str)) {
stringIDMap.put(str, new Integer(idCounter));
idCounter++;
}
}
This will provide you a HashMap with unique String keys and unique Integer values. To get an id for a String you do this:
stringIDMap.get("myString"); // returns the Integer ID associated with the String "myString"
UPDATE
Based on the question update from the OP. I recommend creating an object that holds the String value and the rxcui. You can then place these in a Set or HashMap using a similar implementation to the one provided above.
public MyObject(String str, int rxcui); // The constructor for your new object
MyObject mo1 = new MyObject("hello", 5);
Either
mySet.add(myObject);
will work or
myMap.put(mo1.getStr, mo1.getRxcui);
What is the purpose of the unique word ID? Is the word itself not unique enough since you are not keeping duplicates?
A very basic way would be to keep a counter going as you are checking new words. For each word that doesn't already exist you could increase the counter and use the new value as the unique id.
Lastly, might I suggest you use a HashMap instead. It would allow you to both insert and retrieve words in O(1) time. I am not entirely sure what you are going for, but I think the HashMap might give you more range.
Edit2:
It would be something a little more along these lines. This should help you out.
public static Set<DataPair> getAllWords(String[] tempsArray) {
Set<DataPair> set = new HashSet<>();
for (String row : tempsArray) {
// PARSE YOUR STRING DATA
// the way you were doing it seemed fine but something like this
String[] rowArray = row.split(" ");
String word = row[1];
int id = Integer.parseInt(row[0]);
DataPair pair = new DataPair(word, id);
set.add(pair);
}
return set;
}
class DataPair {
private String word;
private int id;
public DataPair(String word, int id) {
this.word = word;
this.id = id;
}
public boolean equals(Object o) {
if (o instanceof DataPair) {
return ((DataPair) o).word.equals(word) && ((DataPair) o).id == id;
}
return false;
}
}

Extracting token from a string

I've a some strings like that "paddington road" and I need to extract the word "road" from this string. How can I do that?
The problem is that I need to process a list of streets and extract some words like "road" "park" "street" "boulevard" and many others.
What could be the best way to do that? The complexity is O(n*m) and if you consider that I process more than 5000 streets, the performance should be very important.
I'm extracting the values from a Postgres db and putting into a List but I'm not sure it's the best way, may be a hash table is faster to query?
I tried something like this:
// Parse selectedList
Iterator<String> it = streets.iterator();
Iterator<String> it_exception = exception.iterator();
int counter = streets.size();
while(it.hasNext()) {
while ( it_exception.hasNext() ) {
// remove substring it_exception.next() from it.next()
}
}
What do you think?
You can try Set:
Set<String> exceptions = new HashSet<String>(...);
for (String street : streets) {
String[] words = street.split(" ");
StringBuilder res = new StringBuilder();
for (String word : words) {
if (!exceptions.contains(word)) {
res.append(word).append(" ");
}
}
System.out.println(res);
}
I think complexity will be O(n), where n is a number of all words in streets.
You need to get a new iterator for your list of keywords at each iteration of the outer loop. The easiest way is to use the foreach syntax:
for (String streetName : streets) {
for (String keyword : keywords) {
// find if the string contains the keyword, and perhaps break if found to avoid searching for the other keywords
}
}
Don't preoptimize. 5000 is nothing for a computer, and street names are short strings. And if you place the most frequent keywords (street, rather than boulevard) at the beginning of the keyword list, you'll have less iterations.
List streets = new ArrayList<String>();
streets.add("paddington road");
streets.add("paddington park");
for (Object object : streets) {
String cmpstring = object.toString();
String[] abc = cmpstring.split(" ");
String secondwrd = abc[1];
System.out.println("secondwrd"+secondwrd);
}
you can keep secondwrd in a list or string buffer etc....

string compare in java

I have a ArrayList, with elements something like:
[string,has,was,hctam,gnirts,saw,match,sah]
I would like to delete the ones which are repeating itself, such as string and gnirts, and delete the other(gnirts). How do I go about achieving something as above?
Edit: I would like to rephrase the question:
Given an arrayList of strings, how does one go about deleting elements containing reversed strings?
Given the following input:
[string,has,was,hctam,gnirts,saw,match,sah]
How does one reach the following output:
[string,has,was,match]
Set<String> result = new HashSet<String>();
for(String word: words) {
if(result.contains(word) || result.contains(new StringBuffer(word).reverse().toString())) {
continue;
}
result.add(word);
}
// result
You can use a comparator that sorts the characters before checking them for equality. This means that compare("string", "gnirts") will return 0. Then use this comparator as you traverse through the list and copy the matching elements to a new list.
Another option (if you have a really large list) is to create an Anagram class that extends the String class. Override the hashcode method so that anagrams produce the same hashcode, then use a hashmap of anagrams to check your array list for anagrams.
HashSet<String> set = new HashSet<String>();
for (String str : arraylst)
{
set.add(str);
}
ArrayList<String> newlst = new ArrayList<String>();
for (String str : arraylst)
{
if(!set.contains(str))
newlst.add(str);
}
To remove duplicate items, you can use HashMap (), where as the key codes will be used by the sum of the letters (as each letter has its own code - is not a valid situation where two different words have an identical amount of code numbers), as well as the value - this the word. When adding a new word in a HashMap, if the amount of code letters of new words is identical to some of the existing key in a HashMap, then the word with the same key is replaced by a new word. Thus, we get the HashMap collection of words without repetition.
With regard to the fact that the bottom line "string" looks better "gnirts". It may be a situation where we can not determine which word is better, so the basis has been taken that the final form of the word is not important - thing is that there are no duplicate
ArrayList<String> mainList = new ArrayList<String>();
mainList.add("string,has,was,hctam,gnirts,saw,match,sah");
String[] listChar = mainList.get(0).split(",");
HashMap <Integer, String> hm = new HashMap<Integer, String>();
for (String temp : listChar) {
int sumStr=0;
for (int i=0; i<temp.length(); i++)
sumStr += temp.charAt(i);
hm.put(sumStr, temp);
}
mainList=new ArrayList<String>();
Set<Map.Entry<Integer, String>> set = hm.entrySet();
for (Map.Entry<Integer, String> temp : set) {
mainList.add(temp.getValue());
}
System.out.println(mainList);
UPD:
1) The need to maintain txt-file in ANSI
In the beginning, I replaced Scaner on FileReader and BufferedReader
String fileRStr = new String();
String stringTemp;
FileReader fileR = new FileReader("text.txt");
BufferedReader streamIn = new BufferedReader(fileR);
while ((stringTemp = streamIn.readLine()) != null)
fileRStr += stringTemp;
fileR.close();
mainList.add(fileRStr);
In addition, all the words in the file must be separated by commas, as the partition ishonoy lines into words by the function split (",").
If you have words separated by another character - replace the comma at the symbol in the following line:
String[] listChar = mainList.get(0).split(",");

Categories