String Parsing in Java - how can I do this?

String Parsing in Java - how can I do this? - java

I have the following data:
Maths abc10 4 def08 6 ;
English hrd45 3 ngh05 10 ; .
The word at the beginning is a keyword which is also in an enum, the following data are username and login pairs (there can be an unlimited number of these. The data for each keyword is terminated by ';' and the data is terminated by '.'
I'm using the scanner class but I can't get it to loop so that it can produce the following:
Maths abc10 4
Maths def08 6
English hrd45 3
English ngh05 10
Any Ideas? Thanks!

Store it all in a Map<String, Map<String, Integer> >. Iterate over the lines, grabbing the subject and then the name. (Loop starts here). Check if the name is ; and exit if it is. Grab the score, then add it to a Map<String, Integer>. Then grab the next name. (Loop ends here). Add the current Map<String, Integer> to the big Map<>, then do it all over again.

You could have a look at the regular expressions java.util.regex, if it's homework you'll learn about those as well which, trust me, it doesn't hurt. For things this simple, it's not worth going the parser way IMHO.

Related

String Name; Method 1: Atleast 2 words, max 4 words, no special signs and only letters

I've been searching around and havn't quite found my answer.
At this moment me and along with my group have created a few classes resembling a Bank with Customer and Account and so on.
I've been struggling lately with trying to improve and secure our code by making our variable called "name" only respond to certain inputs.
In this case, I want to make it only possible for the person to enter name as such:
Atleast 2 words = (For the word part I've seen codes where you count towards the white space between but don't know yet what you do about the last word since there wont be a white space)
Max 4 words = ( Same thing here)
No special signs such as ,!%¤"#()=%/'¨. = ( for this, I've read something about "Matcher and pattern" )
Now I'm quite new to Java and I'm not asking for a code from someone, I'm asking for someone to point me in the right directions regarding codes, because alot of what i've seen like the Matcher and pattern are things that you import with downloading utils and stuff but I reckon that it's not needed and there should be a simpler more basic way as I'm not trying to get ahead of myself with copying codes just to get it done.
So yeah, the String "name" is used alot in our main class "Banklogic" where almost every method that adds something has the variable "name" in it, so it's quite important that I get this done.
I hope I was clear enough and any help would be appreciated! I'm gonna put the alarm for 3 hours before school to see what you guys have come up with so I can try and complete the code before our meeting! Thanks alot in advance :)

Since you asked for hints, you can use Regex to add such rules.
For Numbers only:
if(string.matches("[0-9\\W]")
//allow insertion of data else not
As for rules related Word Count:
string.split("\\W") will create an array separated by space character. You can count the number of elements in this array and allow/disallow input based on that.
As for no signs and only letters:
if(string.matches("[a-zA-Z\\W]")
// Allow Input else not
You can use Document Filter to implement these methods. Document filter will only allow text to be entered if you allow it to.
I hope this helped as a hint.
Also, note that \\W is for whitespaces. If you dont want to allow whitespaces, remove that char.
This is the most effective and simple way of doing the task.
EDIT:
This is a Class I wrote a little while ago to achieve such tasks. Just in case if you are interested....

Would Java indexOf (brute force method) be more practical for me or some other substring algorithm?

I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.
Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.
So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.
My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.
wordFromInput.contains(wordFromPattern);
To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.
Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...
not possible with hash table
for what I am trying to do with a substring.
I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:
1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)
I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.
Here is a snippet of my code though if anyone wants to see my issue more concretely.
While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().
I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.
So to clean it up:
lines of text from chat logs (2000-10,000+), haystack
1600+ patterns, needle(s)
mostly using Korean characters, although some English is included
Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.
I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.

You can replace the hashtable with a Trie.
Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.
This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.
However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.

What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.
There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.
Hope this helps!

I'm pretty sure string.contains is already highly optimized, so replacing it with something else is not going to do you a lot of good.
So the way to go, I suspect, is not to look for each and every bank-word in your chat words, but rather do multiple comparisons at once.
The first way to do it would be to create one huge regular expression that will match all your bank-words. Compile it and hope the regular expression package is efficient enough (chances are - it is). You will have a rather lengthy setup stage (the regex compilation), but matches should be a lot faster.

You can build an index of the words you need to match and count them as you process them. If you can use a HashMap to lookup the patterns for each word, the cost will be O(n * m)
You can use a HashMap for all the possible words, you can then dissect the words later.
e.g. say you need to match red and apple, you can combine the sum of
redapple = 1
applered = 0
red = 10
apple = 15
This means that red is actually 11 (10 + 1), and apple is 16 (15 + 1)

I don't know Korean so I imagine the same strategies used to tinker with Strings in Korean isn't necessarily possible in the way it is with English, but perhaps this strategy in pseudocode can be applied with your knowledge of Korean to make it work. (Java is of course still the same, but for example, in Korean is it still highly likely for the letters "ough" to be in succession? Are there even letters for "ough"? But with that being said, hopefully the principle can be applied
I would use String.toCharArray to create a two-dimensional array (or ArrayList if variable size needed). The
if (first letter of word matches keyword's first letter)//we have a candidate
skip to last letter of the current word //see comment below
if(last letter of word matches keyword's last letter)//strong candidate
iterate backwards to start+1 checking remainder of letters
The reason I suggest to skip to the last letter is because statistically a "consonant, vowel" for the first two letters of a word is significantly high, especially nouns, which will consist of alot of your keywords since any food is a noun (almost all the keyword examples you gave were matched that structure of consonant, vowel). And since there are only 5 vowels(plus y), the likelihood of the second letter "i" showing up in the keyword "pizza" is inherently highly likely, yet after that point there is still a good chance that the word may turn out to not be a match.
However if you know that the first letter and the last letter match, then you probably have a much stronger candidate and can then iterate in reverse. I think over larger sets of data, this would eliminate candidates much faster than checking letters in order. Basically you'd be letting too many fake candidates past the second iteration, thus increasing your overall conditional operations. It might sound like something small, but in a project like this there's lots of reiterating, so micro-optimizations will accumulate very quickly.
If this approach can be applied in a language that's probably structurally very different from English(I'm speaking from ignorance here though), then I think it might provide some efficiency for you whether you make it happen through iterating a char array or with a scanner, or any other construct.

The trick is to realise that if you can describe the string you are searching for as a regular expression you can also, by definition, describe it with a state machine.
At every character in your message start a state machine for every one of your 1600 patterns and pass the character through it. This sounds scary but believe me most of them will terminate immediately anyway so you aren't really doing a huge amount of work. Bear in mind that a state machine can usually be encoded with a simple switch/case or a ch == s.charAt at each step so they are close to the ultimate in light-weight.
Obviously you know what to do whenever one of your search machines terminates at the end of their search. Any that terminate before full-match can be discarded immediately.
private static class Matcher {
private final int where;
private final String s;
private int i = 0;
public Matcher ( String s, int where ) {
this.s = s;
this.where = where;
}
public boolean match(char ch) {
return s.charAt(i++) == ch;
}
public int matched() {
return i == s.length() ? where: -1;
}
}
// Words I am looking for.
String[] watchFor = new String[] {"flies", "like", "arrow", "banana", "a"};
// Test string to search.
String test = "Time flies like an arrow, fruit flies like a banana";
public void test() {
// Use a LinkedList because it is O(1) to remove anywhere.
List<Matcher> matchers = new LinkedList<> ();
int pos = 0;
for ( char c : test.toCharArray()) {
// Fire off all of the matchers at this point.
for ( String s : watchFor ) {
matchers.add(new Matcher(s, pos));
}
// Discard all matchers that fail here.
for ( Iterator<Matcher> i = matchers.iterator(); i.hasNext(); ) {
Matcher m = i.next();
// Should it be removed?
boolean remove = !m.match(c);
if ( !remove ) {
// Still matches! Is it complete?
int matched = m.matched();
if ( matched >= 0 ) {
// Todo - Should use getters.
System.out.println(" "+m.s +" found at "+m.where+" active matchers "+matchers.size());
// Complete!
remove = true;
}
}
// Remove it where necessary.
if ( remove ) {
i.remove();
}
}
// Step pos to keep track.
pos += 1;
}
}
prints
flies found at 5 active matchers 6
like found at 11 active matchers 6
a found at 16 active matchers 2
a found at 19 active matchers 2
arrow found at 19 active matchers 6
flies found at 32 active matchers 6
like found at 38 active matchers 6
a found at 43 active matchers 2
a found at 46 active matchers 3
a found at 48 active matchers 3
banana found at 45 active matchers 6
a found at 50 active matchers 2
There are several simple optimisations. With some simple pre-processing the most obvious is to use the current character to determine which matchers may be applicable.

This is a pretty broad question, so I won't go into too much detail, but roughly:
Pre-process the haystacks using something like broad lemmatizer to create "topic word only" versions of the messages by noting which topics all words in it cover. For example, any occurrences of "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", or "McDonalds" would cause the "topic" word "food" to be collected for that message. Some words may have multiple topics, eg "McDonalds" may be in the topics "food" and "business". Most words won't have any topic.
After this process, you'll have haystacks consisting of only "topic" words. Then create a Map<String, Set<Integer>> and populate it with the topic word and the Set of chat message ids that contain it. This is reverse index of topic word to the chat messages that contain it.
The runtime code to find all documents that contain all n words is then trivial and super fast - near O(#terms):
private Map<String, Set<Integer>> index; // pre-populated
Set<Integer> search(String... topics) {
Set<Integer> results = null;
for (String topic : topics) {
Set<Integer> hits = index.get(topic);
if (hits == null)
return Collections.emptySet();
if (results == null)
results = new HashSet<Integer>(hits);
else
results.retainAll(hits);
if (results.isEmpty())
return Collections.emptySet(); // exit early
}
return results;
}
This will perform near O(1), and tell you which messages share all search terms. If you just want the number, use the trivial size() of the returned Set.

Best way to find character at specific index in a compressed String

I know that we can do this by following ways
StringBuilder
Use substring
But i am looking a way where i have a compressed String say a5b4c2 etc which means a is 5 times b is 4 times etc so String is actually aaaaabbbbcc something like that.
So char at index 2 should return a and char at index 6 should return b.
What can be the best approach for this?
My question is more about what is the best approach to decompress String ?

My question is more about handling this compressed string rather than finding the character at specific index.
Decompress the string until you get the index you want to know. Or you could decompress the whole string and cache it.
What can be the best approach for this?
Without any more specific requirements, I believe the best approach is the simplest approach you can think of.
I would, parse each pair of the letter and the number in turn, reduce the index by that number and if the remaining index is < 0 you have the letter you want.

Check what index you are searching for, and start adding up the numbers of characters. Every time you add, check if the index falls within the previous interval and the current one. If it does, you've found what your character is, otherwise add again.
For example, the workflow given your string a5b4c2, if you want the character at index 7, could be like this:
current position: 0
index we are looking for: 7
add first character's count: 0+5 = 5
does 7 fall within 0 and 5? no, add again
current position: 5
add second character's count: 5+4 = 9
does 7 fall within 5 and 9? yes, so our character must be 'b'.
I'm not sure if this is more efficient or faster than decompressing the string and just using charAt() or something, it's just a different way of approaching it.
EDIT: Since the question is more about how to decompress the string, you could use a StringBuilder and use a for loop to append the correct number of the character to your string... sounds like the simplest way to me.

Spell checker solution in java

I need to implement a spell checker in java , let me give you an example for a string lets say "sch aproblm iseasili solved" my output is "such a problem is easily solved".The maximum length of the string to correct is 64.As you can see my string can have spaces inserted in the wrong places or not at all and even misspelled words.I need a little help in finding a efficient algorithm of coming up with the corrected string. I am currently trying to delete all spaces in my string and inserting spaces in every possible position , so lets say for the word (it apply to a sentence as well) "hot" i generate the next possible strings to afterwords be corrected word by word using levenshtein distance : h o t ; h ot; ho t; hot. As you can see i have generated 2^(string.length() -1) possible strings. So for a string with a length of 64 it will generate 2^63 possible strings, which is damn high, and afterwords i need to process them one by one and select the best one by a different set of parameters such as : - total editing distance (must take the smallest one)
-if i have more strings with same editing distance i have to choose the one with the fewer number of words
-if i have more strings with the same number of words i need to choose the one with the total maximum frequency the words have( i have a dictionary of the most frequent 8000 words along with their frequency )
-and finally if there are more strings with the same total frequency i have to take the smallest lexicographic one.
So basically i generate all possible strings (inserting spaces in all possible positions into the original string) and then one by one i calculate their total editing distance, nr of words ,etc. and then choose the best one, and output the corrected string. I want to know if there is a easier(in terms of efficiency) way of doing this , like not having to generate all possible combinations of strings etc.
EDIT:So i thought that i should take another approach on this one.Here is what i have in mind: I take the first letter from my string , and extract from the dictionary all the words that begin with that letter.After that i process all of them and extract from my string all possible first words. I will remain at my previous example , for the word "hot" by generating all possible combinations i got 4 results , but with my new algorithm i obtain only 2 "hot" , and "ho" , so it's already an improvement.Though i need a little bit of help in creating a recursive or PD algorithm for doing this . I need a way to store all possible strings for the first word , then for all of those all possible strings for the second word and so on and finally to concatenate all possibilities and add them into an array or something. There will still be a lot of combinations for large strings but not as many as having to do ALL of them. Can someone help me with a pseudocode or something , as this is not my strong suit.
EDIT2: here is the code where i generate all the possible first word from my string http://pastebin.com/d5AtZcth .I need to somehow implement this to do the same for the rest and combine for each first word with each second word and so on , and store all these concatenated into an array or something.

A few tips for you:
try correcting just small parts of the string, not everything at once.
90% of erros (IIRC) have 1 edit distance from the source.
you can use a phonetic index to match words against words that sound alike.
you can assume most typos are QWERTY errors (j=>k, h=>g), and try to check them first.
A few more ideas can be found in this nice article:
http://norvig.com/spell-correct.html

Help me understand question related to HashMap in Java

Im given a task which i am a little confused to understand. Here is the question statement:
The following program should read a file and store all its tokens in a member variable.
Your task is to write a single method that returns the number of items in tokenMap, the average length (as double value) of the elements in tokenMap, and the number of tokens starting with character "a".
Here the tokenMap is an object of type HashMap<String, Integer>;
I do have some idea about HashMap but what i want to know the "key value" for HashMap required is a single character or the whole word?? that i should store in tokenMap.
Also how can i compute the average length?

Looks like you have to use the entire word as the key.
The average length of tokens can be computed by summing the lengths of each token and dividing by the number of tokens.
In Java, you can find the number of tokens in the HashMap by tokenMap.size().
You can write loops that visit each member of the map like this:
for(String t: tokenMap.values()){
//t is a token
}
and if you look up String in the Java API docs you will see that it is easy to find the length of a String.

To compute the average length of the items in a hash map, you'll have to iterate over them all and count the length and calculate the average.
As for your other question about what to use for a key, how are we supposed to know? A hashmap can use practically any* value for a key.
*The value must be hashable, which is defined differently for different languages.

Reading the question closely, it seems that you have to read a file, extract each word and use it as the key value, and store the length of each key as the integer:
an example line
leads to a HashMap like this
an : 2
example : 7
line : 4
After you've built your map (made of keys mapping to entries, or seemingly elements in the question), you'll need to run some statistics over it to find
the number of keys (look at HashMap)
the average length of all keys (again, simple enough)
the number beginning with "a" (just look at the String)
Then make a value object containing these values and return it from the method that does the statistics.
I know I've given more information that you require, but someone else may benefit from a little extra help.

Guys there is some confusion. Im not asking for a solution. Im just confused for one thing.
For the time being, im gonna use String type as the key type.
The only confusion i have is once i read the file line by line, should i split it based upon words or based upon each character. So that the key value should be a single character type string or a String of whole word.
If you can go through the question statement, what do you suggest. That's all im asking.

should i split it based upon words or
based upon each character
The requirement is to make tokens, so you should split them based on words. Each word becomes a unique String key. It would make sense for the value to be the count of each token.
If the file you are reading has these three lines:
int alpha;
int beta;
float delta;
Then you should have something like
<"int", 2>
<";", 3>
<"alpha", 1>
<"beta", 1>
<"float", 1>
<"delta", 1>
(The semicolon may or may not be considered a token.)
Your average length would be ( 3x2 + 3x1 + 5 + 4 + 5 + 5) / 6.
Your length of tokens starting with "a" would be 5.0.
Look elsewhere on this forum for keySet and you should be good to go.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.