Extracting token from a string

Extracting token from a string - java

I've a some strings like that "paddington road" and I need to extract the word "road" from this string. How can I do that?
The problem is that I need to process a list of streets and extract some words like "road" "park" "street" "boulevard" and many others.
What could be the best way to do that? The complexity is O(n*m) and if you consider that I process more than 5000 streets, the performance should be very important.
I'm extracting the values from a Postgres db and putting into a List but I'm not sure it's the best way, may be a hash table is faster to query?
I tried something like this:
// Parse selectedList
Iterator<String> it = streets.iterator();
Iterator<String> it_exception = exception.iterator();
int counter = streets.size();
while(it.hasNext()) {
while ( it_exception.hasNext() ) {
// remove substring it_exception.next() from it.next()
}
}
What do you think?

You can try Set:
Set<String> exceptions = new HashSet<String>(...);
for (String street : streets) {
String[] words = street.split(" ");
StringBuilder res = new StringBuilder();
for (String word : words) {
if (!exceptions.contains(word)) {
res.append(word).append(" ");
}
}
System.out.println(res);
}
I think complexity will be O(n), where n is a number of all words in streets.

You need to get a new iterator for your list of keywords at each iteration of the outer loop. The easiest way is to use the foreach syntax:
for (String streetName : streets) {
for (String keyword : keywords) {
// find if the string contains the keyword, and perhaps break if found to avoid searching for the other keywords
}
}
Don't preoptimize. 5000 is nothing for a computer, and street names are short strings. And if you place the most frequent keywords (street, rather than boulevard) at the beginning of the keyword list, you'll have less iterations.

List streets = new ArrayList<String>();
streets.add("paddington road");
streets.add("paddington park");
for (Object object : streets) {
String cmpstring = object.toString();
String[] abc = cmpstring.split(" ");
String secondwrd = abc[1];
System.out.println("secondwrd"+secondwrd);
}
you can keep secondwrd in a list or string buffer etc....

Related

Simple words finder

We have to find all simple words from a bunch of simple and compound words. For example:
Input: chat, ever, snapchat, snap, salesperson, per, person, sales, son, whatsoever, what so.
Output should be: chat, ever, snap, per, sales, son, what, so
My sample code:
private static String[] find(String[] words) {
// TODO Auto-generated method stub
//System.out.println();
ArrayList<String> alist = new ArrayList<String>();
Set<String> r1 = new HashSet<String>();
for(String s: words){
alist.add(s);
}
Collections.sort(alist,new Comparator<String>() {
public int compare(String o1, String o2) {
return o1.length()-o2.length();
}
});
//System.out.println(alist.toString());
int count= 0;
for(int i=0;i<alist.size();i++){
String check = alist.get(i);
r1.add(check);
for(int j=i+1;j<alist.size();j++){
String temp = alist.get(j);
//System.out.println(check+" "+temp);
if(temp.contains(check) ){
alist.remove(temp);
}
}
}
System.out.println(r1.toString());
String res[] = new String[r1.size()];
for(String i:words){
if(r1.contains(i)){
res[count++] = i;
}
}
return res;
}
I am unable to get a solution with the above code. Any suggestions or ideas
compound word = concatenation of two or more words;rest all words are considered as simple words
We have to remove all the compound words

Algorithm
Read the input into a set of Strings i.e. Set<String> input
Create a empty set for simple words i.e. Set<String> simpleWords
Create a empty set for compound words i.e. Set<String> compoundWords
Iterate over input. For each element
Let length of element be elemLength
Create a set Set<String> inputs of all Strings from the set input (excluding element) for which the below is true
Length less than element
Not present in compundWords
Create set of all permutations of inputs(by concatenating) with max length = elemLength i.e. Set<String> currentPermutations
See if any of currentPermutations is = element
If yes, add element into compoundWords
If no, continue with iteration
After the iteration is done place all Strings from input which are not present in compoundWords into simpleWords
That is your answer.
Before you start writing code decide the logic that you are going to use. Use descriptive variable names and you are basically done.
The reason your logic is not working has to do with the way you are checking temp.contains(check). This is checking for substring not a compound word as per your definition.

How to create vocabulary from Arrays of Strings

I have to make a vocabulary with unique words of some texts. I have texts converted to Arrays of Strings. Now I want the Array list with only unique words. So the first step, convert the first Array of Strings to a List<Strings> (I guess?) where all double words are filtered out. That is my first step, how do I do this, and do I use a List<String> or another String[]?
Second, the next String[] I 'read-in' should update the vocabulary List<String> but ONLY add new words from the text.
It must look something like:
public List<String> makeVocabulary(String[] tokens){
List<String> vocabulay = new ArrayList<>;
//add unique words from 'tokens' to vocabulary
return vocabulary;
}
TL;DR: how do I convert a whole bunch of String[] to one List<String> with only the unique words from the String[]'s?

Upon review of your code, it appears that you would be clearing vocabulary each time you run this command, so it can only be done once. If you'd like to make it more modular, do something like this:
public class yourClass
{
private List<String> vocabulary = new ArrayList<String>();
public List<String> makeVocabulary(String[] tokens)
{
for( int i = 0; i < tokens.length; i++ )
if( !vocabulary.contains( tokens[i] ) )
vocabulary.add(tokens[i]);
return vocabulary;
}
}

For determining unique tokens, use a Set implementation...
public List<String> makeVocabulary(String[] tokens){
Set<String> uniqueTokens = new HashSet<String>();
for(String token : tokens) {
uniqueTokens.add(token);
}
List<String> vocabulay = new ArrayList<String>(uniqueTokens);
return vocabulary;
}

One way to achieve your goal is to make use of the Set class as opposed to a List of strings. You could look into that e.g. like the code below.
public List<String> makeVocabulary(String[] tokens){
Set<String> temp = new HashSet<>;
//add unique words from 'tokens' to temp
List<String> vocabulary = new ArrayList<>;
vocabulary.addAll(temp);
return vocabulary;
}
If you can live with Set as the return type of makeVocabulary, you can just return temp.

How to Count Unique Values in an ArrayList?

I have to count the number of unique words from a text document using Java. First I had to get rid of the punctuation in all of the words. I used the Scanner class to scan each word in the document and put in an String ArrayList.
So, the next step is where I'm having the problem! How do I create a method that can count the number of unique Strings in the array?
For example, if the array contains apple, bob, apple, jim, bob; the number of unique values in this array is 3.
public countWords() {
try {
Scanner scan = new Scanner(in);
while (scan.hasNext()) {
String words = scan.next();
if (words.contains(".")) {
words.replace(".", "");
}
if (words.contains("!")) {
words.replace("!", "");
}
if (words.contains(":")) {
words.replace(":", "");
}
if (words.contains(",")) {
words.replace(",", "");
}
if (words.contains("'")) {
words.replace("?", "");
}
if (words.contains("-")) {
words.replace("-", "");
}
if (words.contains("‘")) {
words.replace("‘", "");
}
wordStore.add(words.toLowerCase());
}
} catch (FileNotFoundException e) {
System.out.println("File Not Found");
}
System.out.println("The total number of words is: " + wordStore.size());
}

Are you allowed to use Set? If so, you HashSet may solve your problem. HashSet doesn't accept duplicates.
HashSet noDupSet = new HashSet();
noDupSet.add(yourString);
noDupSet.size();
size() method returns number of unique words.
If you have to really use ArrayList only, then one way to achieve may be,
1) Create a temp ArrayList
2) Iterate original list and retrieve element
3) If tempArrayList doesn't contain element, add element to tempArrayList

Starting from Java 8 you can use Stream:
After you add the elements in your ArrayList:
long n = wordStore.stream().distinct().count();
It converts your ArrayList to a stream and then it counts only the distinct elements.

I would advice to use HashSet. This automatically filters the duplicate when calling add method.

Although I believe a set is the easiest solution, you can still use your original solution and just add an if statement to check if value already exists in the list before you do your add.
if( !wordstore.contains( words.toLowerCase() )
wordStore.add(words.toLowerCase());
Then the number of words in your list is the total number of unique words (ie: wordStore.size() )

This general purpose solution takes advantage of the fact that the Set abstract data type does not allow duplicates. The Set.add() method is specifically useful in that it returns a boolean flag indicating the success of the 'add' operation. A HashMap is used to track the occurrence of each original element. This algorithm can be adapted for variations of this type of problem. This solution produces O(n) performance..
public static void main(String args[])
{
String[] strArray = {"abc", "def", "mno", "xyz", "pqr", "xyz", "def"};
System.out.printf("RAW: %s ; PROCESSED: %s \n",Arrays.toString(strArray), duplicates(strArray).toString());
}
public static HashMap<String, Integer> duplicates(String arr[])
{
HashSet<String> distinctKeySet = new HashSet<String>();
HashMap<String, Integer> keyCountMap = new HashMap<String, Integer>();
for(int i = 0; i < arr.length; i++)
{
if(distinctKeySet.add(arr[i]))
keyCountMap.put(arr[i], 1); // unique value or first occurrence
else
keyCountMap.put(arr[i], (Integer)(keyCountMap.get(arr[i])) + 1);
}
return keyCountMap;
}
RESULTS:
RAW: [abc, def, mno, xyz, pqr, xyz, def] ; PROCESSED: {pqr=1, abc=1, def=2, xyz=2, mno=1}

You can create a HashTable or HashMap as well. Keys would be your input strings and Value would be the number of times that string occurs in your input array. O(N) time and space.
Solution 2:
Sort the input list.
Similar strings would be next to each other.
Compare list(i) to list(i+1) and count the number of duplicates.

In shorthand way you can do it as follows...
ArrayList<String> duplicateList = new ArrayList<String>();
duplicateList.add("one");
duplicateList.add("two");
duplicateList.add("one");
duplicateList.add("three");
System.out.println(duplicateList); // prints [one, two, one, three]
HashSet<String> uniqueSet = new HashSet<String>();
uniqueSet.addAll(duplicateList);
System.out.println(uniqueSet); // prints [two, one, three]
duplicateList.clear();
System.out.println(duplicateList);// prints []
duplicateList.addAll(uniqueSet);
System.out.println(duplicateList);// prints [two, one, three]

public class UniqueinArrayList {
public static void main(String[] args) {
StringBuffer sb=new StringBuffer();
List al=new ArrayList();
al.add("Stack");
al.add("Stack");
al.add("over");
al.add("over");
al.add("flow");
al.add("flow");
System.out.println(al);
Set s=new LinkedHashSet(al);
System.out.println(s);
Iterator itr=s.iterator();
while(itr.hasNext()){
sb.append(itr.next()+" ");
}
System.out.println(sb.toString().trim());
}
}

3 distinct possible solutions:
Use HashSet as suggested above.
Create a temporary ArrayList and store only unique element like below:
public static int getUniqueElement(List<String> data) {
List<String> newList = new ArrayList<>();
for (String eachWord : data)
if (!newList.contains(eachWord))
newList.add(eachWord);
return newList.size();
}
Java 8 solution
long count = data.stream().distinct().count();

Regarding arrayList

I have used scanner instead of string tokenizer ,, below is the piece of code...
Scanner scanner = new Scanner("Home,1;Cell,2;Work,3");
scanner.useDelimiter(";");
while (scanner.hasNext()) {
// System.out.println(scanner.next());
String phoneDtls = scanner.next();
// System.out.println(phoneDtls);
ArrayList<String> phoneTypeList = new ArrayList<String>();
if(phoneDtls.indexOf(',')!=-1) {
String value = phoneDtls.substring(0, phoneDtls.indexOf(','));
phoneTypeList.add(value);
}
Iterator itr=phoneTypeList.iterator();
while(itr.hasNext())
System.out.println(itr.next());
}
The ouput I get upon executing this...
Home
Cell
Work
As it is seen from the above code is that in the array list phoneTypeList we are finally storing the values..but the logic of finding out the value on the basisi of ',' is not that much great..that is ..
if(phoneDtls.indexOf(',')!=-1) {
String value = phoneDtls.substring(0, phoneDtls.indexOf(','));
phoneTypeList.add(value);
}
could you please advise me with some other alternative ..!! to achieve the same thing...!!thanks a lot in advance..!!

Well, since you asked if there is another way to do it then here is an alternative: You can split the string directly and do it with less code with the foreach statement:
String input = "Home,1;Cell,2;Work,3";
String[] splitInput = input.split(";");
for (String s : splitInput ) {
System.out.println(s.split(",")[0]);
}
No need to use the ArrayList<T> since you can iterate over an array as well.

could you try to split based on ',' STIRNG_VALUE.split(','); will return u an array with strings separated with , may be this helps

If i understand correctly. The problem statement is you want to maintain a list of Phone-Type-List. Like this: ["Home", "Cell", "Work"].
I suggest you keep this in a property file / config file / database which ever makes sense and load it to memory on start of you app.
If the input cannot be changed then as for the algorithm i couldn't think of a better one. Looks good.
You could use split function of string if that makes sense.
First use split on ";"
Then a split on ","

declare the arraylist outside the while loop.
try this, i have made some change for better performance too. hope you can compare and understand the change.
ArrayList<String> phoneTypeList = new ArrayList<String>();
Scanner scanner = new Scanner("Home,1;Cell,2;Work,3");
scanner.useDelimiter(";");
String phoneDtls = null;
String value = null;
while (scanner.hasNext()) {
phoneDtls = scanner.next();
if (phoneDtls.indexOf(',') != -1) {
value = phoneDtls.split(",")[0];
phoneTypeList.add(value);
}
}
Iterator itr = phoneTypeList.iterator();
while (itr.hasNext())
System.out.println(itr.next());
I have executed n got the result, check screenshot.

string compare in java

I have a ArrayList, with elements something like:
[string,has,was,hctam,gnirts,saw,match,sah]
I would like to delete the ones which are repeating itself, such as string and gnirts, and delete the other(gnirts). How do I go about achieving something as above?
Edit: I would like to rephrase the question:
Given an arrayList of strings, how does one go about deleting elements containing reversed strings?
Given the following input:
[string,has,was,hctam,gnirts,saw,match,sah]
How does one reach the following output:
[string,has,was,match]

Set<String> result = new HashSet<String>();
for(String word: words) {
if(result.contains(word) || result.contains(new StringBuffer(word).reverse().toString())) {
continue;
}
result.add(word);
}
// result

You can use a comparator that sorts the characters before checking them for equality. This means that compare("string", "gnirts") will return 0. Then use this comparator as you traverse through the list and copy the matching elements to a new list.
Another option (if you have a really large list) is to create an Anagram class that extends the String class. Override the hashcode method so that anagrams produce the same hashcode, then use a hashmap of anagrams to check your array list for anagrams.

HashSet<String> set = new HashSet<String>();
for (String str : arraylst)
{
set.add(str);
}
ArrayList<String> newlst = new ArrayList<String>();
for (String str : arraylst)
{
if(!set.contains(str))
newlst.add(str);
}

To remove duplicate items, you can use HashMap (), where as the key codes will be used by the sum of the letters (as each letter has its own code - is not a valid situation where two different words have an identical amount of code numbers), as well as the value - this the word. When adding a new word in a HashMap, if the amount of code letters of new words is identical to some of the existing key in a HashMap, then the word with the same key is replaced by a new word. Thus, we get the HashMap collection of words without repetition.
With regard to the fact that the bottom line "string" looks better "gnirts". It may be a situation where we can not determine which word is better, so the basis has been taken that the final form of the word is not important - thing is that there are no duplicate
ArrayList<String> mainList = new ArrayList<String>();
mainList.add("string,has,was,hctam,gnirts,saw,match,sah");
String[] listChar = mainList.get(0).split(",");
HashMap <Integer, String> hm = new HashMap<Integer, String>();
for (String temp : listChar) {
int sumStr=0;
for (int i=0; i<temp.length(); i++)
sumStr += temp.charAt(i);
hm.put(sumStr, temp);
}
mainList=new ArrayList<String>();
Set<Map.Entry<Integer, String>> set = hm.entrySet();
for (Map.Entry<Integer, String> temp : set) {
mainList.add(temp.getValue());
}
System.out.println(mainList);
UPD:
1) The need to maintain txt-file in ANSI
In the beginning, I replaced Scaner on FileReader and BufferedReader
String fileRStr = new String();
String stringTemp;
FileReader fileR = new FileReader("text.txt");
BufferedReader streamIn = new BufferedReader(fileR);
while ((stringTemp = streamIn.readLine()) != null)
fileRStr += stringTemp;
fileR.close();
mainList.add(fileRStr);
In addition, all the words in the file must be separated by commas, as the partition ishonoy lines into words by the function split (",").
If you have words separated by another character - replace the comma at the symbol in the following line:
String[] listChar = mainList.get(0).split(",");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting token from a string - java

Related

Simple words finder

How to create vocabulary from Arrays of Strings

How to Count Unique Values in an ArrayList?

Regarding arrayList

string compare in java

Categories

Resources