Comma separator confusion in Array and ArrayList - java

Why am I getting an error while initializing ArrayList with a trailing comma (,) whereas I understand through various websites that Array with trailing comma(s) will create holes and iterating that will skip the holes?
Java:
class ProblemA {
public static void main(String[] args) {
//HashMap<Integer,Integer> map=new HashMap<>();
ArrayList<Integer> map = new ArrayList<>(Arrays.asList(1,2,2,3,2,));
int[] arr = {1,2,2,3,2,};
}
}

hereas I understand through various websites that Array with trailing comma(s) will create holes and iterating that will skip the holes
Not in java it doesn't. That's referring to various other languages where you can e.g. type [, , ,] and this makes a list of 4x "nothing". But java doesn't have that - the closest thing to nothing that java has is null, but you'd have to type it out. List.of(null, null, null, null) would make 'a list of 4x nothing' in java, as near as can be.
So what about java's trailing commas
As per the java language spec, in method invocations, you can't just toss a pointless and completely optional comma at the end of an arguments list. It's simply not valid java. However, when writing an array initializer, you CAN do that. Why? Just cuz. Spec says so. At some point your question boils down to: "What was the language designer thinking" - which is beyond the scope of Stack Overflow. You'd have to ask them.
At any rate, you are allowed to do it, and it does nothing at all. There is zero difference between int[] x = {1,2,}; and int[] x = {1,2};. Both produce a 2-sized array with elements 1 and 2. The trailing comma is completely optional and has absolutely no effect. Whatsoever. Identical class files. Go ahead, try it, compile it, run a byte-differ on the class files that come out. Identical.
But... why?
The reason trailing commas are convenient at times is because of version control systems. Stop thinking about writing 5 lines of script and your first programming lessons for a moment - think working in a bigger team, producing a product that hundreds of thousands of users depend on, where billions of euros just go up in smoke if e.g. a security issue is plaguing the code base.
You want to be capable of working together in teams, which means not all code you look at is written by you, neccessarily. That means you may want to, for any given line, ask the system: Hey, who last changed this line, and can you show me the other stuff they did around that time, and it would be even better if I can see a comment or a link to a ticket in a ticketing system that explains why they did all this, because I need that context to know what's happening.
The solution to that is version control systems, where you 'check in', as a unit of work, a bunch of modifications to existing source files. "Today, I added these 5 lines here, removed these 2 lines here, changed these 2 lines, and added these 2 whole new files. That whole thing comprises the fix for the bug mentioned in ticket 12894". That sort of thing.
Once you have that, you get a convenient route for code review: That your edits aren't just pushed straight into what powers, say, google.com, but that someone else first tests this out and looks at your changes; a second pair of eyes. They can look at exactly what you changed.
And only now can you understand the point of optional trailing commas. They do absolutely nothing, but imagine you have a list of, say, known US states. You might have:
String[] STATES = {
"Alabama",
"Arkansas",
"New york"
};
and then someone wants to add a new state to the list, so they edit that code:
String[] STATES = {
"Alabama",
"Arkansas",
"New york",
"Hawaii"
};
The problem is, technically they changed 2 lines there. They changed the New York line, by adding a comma. That 'noises it all up'.
Let's say I notice that its a bit odd they didn't spell it New York - that Y is customarily capitalized. Possibly some other system requires that it's all lowercase except the first letter, perhaps. I want to see who edited/made that New york line, what ticket is associated with it, and when it happened. So I ask the system and I get.... your edit. The one that added Hawaii - the one that is utterly irrelevant. Only because of that silly trailing comma. A code reviewer will also have to review that edit which also seems pointless.
Imagine instead that we have a policy of always using that pointless trailing comma that does nothing at all, and we started with:
String[] STATES = {
"Alabama",
"Arkansas",
"New york",
};
instead. Now I can add that Hawaii line (adding a pointless trailing comma), and now I haven't noised it up. My 'commit' (list of changes I made, it's a term of art in the Version Control System world) is just the one line adding hawaii, nothing more.
That's the only reason that java, or any other language, allows this concept of 'trailing commas in comma separated lists that do not actually do anything'. They do not create holes or an extra entry. They exist only to keep commits as clean as they can be.
You may correctly conclude I dislike the fact that you can't put trailing commas in method invocations. I'll make a point to mention this when the opportunity arises in amber-dev or what not - I'm pretty sure the structure of the language spec can handle it.

Related

String Name; Method 1: Atleast 2 words, max 4 words, no special signs and only letters

I've been searching around and havn't quite found my answer.
At this moment me and along with my group have created a few classes resembling a Bank with Customer and Account and so on.
I've been struggling lately with trying to improve and secure our code by making our variable called "name" only respond to certain inputs.
In this case, I want to make it only possible for the person to enter name as such:
Atleast 2 words = (For the word part I've seen codes where you count towards the white space between but don't know yet what you do about the last word since there wont be a white space)
Max 4 words = ( Same thing here)
No special signs such as ,!%¤"#()=%/'¨. = ( for this, I've read something about "Matcher and pattern" )
Now I'm quite new to Java and I'm not asking for a code from someone, I'm asking for someone to point me in the right directions regarding codes, because alot of what i've seen like the Matcher and pattern are things that you import with downloading utils and stuff but I reckon that it's not needed and there should be a simpler more basic way as I'm not trying to get ahead of myself with copying codes just to get it done.
So yeah, the String "name" is used alot in our main class "Banklogic" where almost every method that adds something has the variable "name" in it, so it's quite important that I get this done.
I hope I was clear enough and any help would be appreciated! I'm gonna put the alarm for 3 hours before school to see what you guys have come up with so I can try and complete the code before our meeting! Thanks alot in advance :)
Since you asked for hints, you can use Regex to add such rules.
For Numbers only:
if(string.matches("[0-9\\W]")
//allow insertion of data else not
As for rules related Word Count:
string.split("\\W") will create an array separated by space character. You can count the number of elements in this array and allow/disallow input based on that.
As for no signs and only letters:
if(string.matches("[a-zA-Z\\W]")
// Allow Input else not
You can use Document Filter to implement these methods. Document filter will only allow text to be entered if you allow it to.
I hope this helped as a hint.
Also, note that \\W is for whitespaces. If you dont want to allow whitespaces, remove that char.
This is the most effective and simple way of doing the task.
EDIT:
This is a Class I wrote a little while ago to achieve such tasks. Just in case if you are interested....

How to generate all possible sentence from given tokens in Java

I am trying to generate all possible sentences from given token. It is a transliteration program. I have various possibilities for each token to be transliterated and I want to generate all possible sentences. e.g. if sentence is token1 token2 token3 and supposing token1 can be represented in 3 ways after transliteration, token2 can be represented by 2 ways and token3 can be represented by 4 ways then total possible sentences are 24. I am developed a general tree and then perform depth first traversal to generate all possible sentences. the problem is when sentence become long, the number of possibilities increases and I got "java.lang.OutOfMemoryError: Java heap space" error.
Is there any other way to generate all possible sentences?? At some instances I need to generate millions of sentences. Please Help!!!
You can't generate them all at once like that.
Depending on what you need them for, you should either do whatever that is or write them to a file.
Another thought, that still might not work, would be to not store every possible value but store a set of references/relationships. You can make this much more complex with n-grams and mMrkov chains, or simply have a a set of references, or even just have a list of array indexes.
So besides using storage space as a memory buffer, you can conceptualize instead of foo calling gen for the full set, have gen call foo after each one is generated.
[EDIT: looking back on this, (I was interested to see any other answers) I want to clarify that the function foo is whatever you're using them for and the function gen generates them (just in case it isn't clear, and especially for anyone who's first language isn't english)]

Would Java indexOf (brute force method) be more practical for me or some other substring algorithm?

I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.
Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.
So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.
My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.
wordFromInput.contains(wordFromPattern);
To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.
Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...
not possible with hash table
for what I am trying to do with a substring.
I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:
1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)
I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.
Here is a snippet of my code though if anyone wants to see my issue more concretely.
While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().
I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.
So to clean it up:
lines of text from chat logs (2000-10,000+), haystack
1600+ patterns, needle(s)
mostly using Korean characters, although some English is included
Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.
I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.
You can replace the hashtable with a Trie.
Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.
This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.
However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.
What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.
There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.
Hope this helps!
I'm pretty sure string.contains is already highly optimized, so replacing it with something else is not going to do you a lot of good.
So the way to go, I suspect, is not to look for each and every bank-word in your chat words, but rather do multiple comparisons at once.
The first way to do it would be to create one huge regular expression that will match all your bank-words. Compile it and hope the regular expression package is efficient enough (chances are - it is). You will have a rather lengthy setup stage (the regex compilation), but matches should be a lot faster.
You can build an index of the words you need to match and count them as you process them. If you can use a HashMap to lookup the patterns for each word, the cost will be O(n * m)
You can use a HashMap for all the possible words, you can then dissect the words later.
e.g. say you need to match red and apple, you can combine the sum of
redapple = 1
applered = 0
red = 10
apple = 15
This means that red is actually 11 (10 + 1), and apple is 16 (15 + 1)
I don't know Korean so I imagine the same strategies used to tinker with Strings in Korean isn't necessarily possible in the way it is with English, but perhaps this strategy in pseudocode can be applied with your knowledge of Korean to make it work. (Java is of course still the same, but for example, in Korean is it still highly likely for the letters "ough" to be in succession? Are there even letters for "ough"? But with that being said, hopefully the principle can be applied
I would use String.toCharArray to create a two-dimensional array (or ArrayList if variable size needed). The
if (first letter of word matches keyword's first letter)//we have a candidate
skip to last letter of the current word //see comment below
if(last letter of word matches keyword's last letter)//strong candidate
iterate backwards to start+1 checking remainder of letters
The reason I suggest to skip to the last letter is because statistically a "consonant, vowel" for the first two letters of a word is significantly high, especially nouns, which will consist of alot of your keywords since any food is a noun (almost all the keyword examples you gave were matched that structure of consonant, vowel). And since there are only 5 vowels(plus y), the likelihood of the second letter "i" showing up in the keyword "pizza" is inherently highly likely, yet after that point there is still a good chance that the word may turn out to not be a match.
However if you know that the first letter and the last letter match, then you probably have a much stronger candidate and can then iterate in reverse. I think over larger sets of data, this would eliminate candidates much faster than checking letters in order. Basically you'd be letting too many fake candidates past the second iteration, thus increasing your overall conditional operations. It might sound like something small, but in a project like this there's lots of reiterating, so micro-optimizations will accumulate very quickly.
If this approach can be applied in a language that's probably structurally very different from English(I'm speaking from ignorance here though), then I think it might provide some efficiency for you whether you make it happen through iterating a char array or with a scanner, or any other construct.
The trick is to realise that if you can describe the string you are searching for as a regular expression you can also, by definition, describe it with a state machine.
At every character in your message start a state machine for every one of your 1600 patterns and pass the character through it. This sounds scary but believe me most of them will terminate immediately anyway so you aren't really doing a huge amount of work. Bear in mind that a state machine can usually be encoded with a simple switch/case or a ch == s.charAt at each step so they are close to the ultimate in light-weight.
Obviously you know what to do whenever one of your search machines terminates at the end of their search. Any that terminate before full-match can be discarded immediately.
private static class Matcher {
private final int where;
private final String s;
private int i = 0;
public Matcher ( String s, int where ) {
this.s = s;
this.where = where;
}
public boolean match(char ch) {
return s.charAt(i++) == ch;
}
public int matched() {
return i == s.length() ? where: -1;
}
}
// Words I am looking for.
String[] watchFor = new String[] {"flies", "like", "arrow", "banana", "a"};
// Test string to search.
String test = "Time flies like an arrow, fruit flies like a banana";
public void test() {
// Use a LinkedList because it is O(1) to remove anywhere.
List<Matcher> matchers = new LinkedList<> ();
int pos = 0;
for ( char c : test.toCharArray()) {
// Fire off all of the matchers at this point.
for ( String s : watchFor ) {
matchers.add(new Matcher(s, pos));
}
// Discard all matchers that fail here.
for ( Iterator<Matcher> i = matchers.iterator(); i.hasNext(); ) {
Matcher m = i.next();
// Should it be removed?
boolean remove = !m.match(c);
if ( !remove ) {
// Still matches! Is it complete?
int matched = m.matched();
if ( matched >= 0 ) {
// Todo - Should use getters.
System.out.println(" "+m.s +" found at "+m.where+" active matchers "+matchers.size());
// Complete!
remove = true;
}
}
// Remove it where necessary.
if ( remove ) {
i.remove();
}
}
// Step pos to keep track.
pos += 1;
}
}
prints
flies found at 5 active matchers 6
like found at 11 active matchers 6
a found at 16 active matchers 2
a found at 19 active matchers 2
arrow found at 19 active matchers 6
flies found at 32 active matchers 6
like found at 38 active matchers 6
a found at 43 active matchers 2
a found at 46 active matchers 3
a found at 48 active matchers 3
banana found at 45 active matchers 6
a found at 50 active matchers 2
There are several simple optimisations. With some simple pre-processing the most obvious is to use the current character to determine which matchers may be applicable.
This is a pretty broad question, so I won't go into too much detail, but roughly:
Pre-process the haystacks using something like broad lemmatizer to create "topic word only" versions of the messages by noting which topics all words in it cover. For example, any occurrences of "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", or "McDonalds" would cause the "topic" word "food" to be collected for that message. Some words may have multiple topics, eg "McDonalds" may be in the topics "food" and "business". Most words won't have any topic.
After this process, you'll have haystacks consisting of only "topic" words. Then create a Map<String, Set<Integer>> and populate it with the topic word and the Set of chat message ids that contain it. This is reverse index of topic word to the chat messages that contain it.
The runtime code to find all documents that contain all n words is then trivial and super fast - near O(#terms):
private Map<String, Set<Integer>> index; // pre-populated
Set<Integer> search(String... topics) {
Set<Integer> results = null;
for (String topic : topics) {
Set<Integer> hits = index.get(topic);
if (hits == null)
return Collections.emptySet();
if (results == null)
results = new HashSet<Integer>(hits);
else
results.retainAll(hits);
if (results.isEmpty())
return Collections.emptySet(); // exit early
}
return results;
}
This will perform near O(1), and tell you which messages share all search terms. If you just want the number, use the trivial size() of the returned Set.

Is it possible to automate generation of wrong choices from a correct word?

The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?
No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.
There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.

Are there some better ways to implement find as you type in Java with a fairly small data set?

I've got about 2500 short phrases in a file. I want to be able to find phrases as I type possible substrings of them. My app has a text box and a list of phrases. The text box is initially empty and the list contains all 2500 phrases, since the empty string is a substring of all of them. As I type in the text box, the list updates so that it always only contains phrases which contain the text box's value as a substring.
At the moment I have one of Google's Multimaps, specifically:
LinkedHashMultimap<String, String>
with every single possible substring mapped to its possible matches. This takes a while to load (about a second) and I think it must be taking up quite a bit of space (which may be a concern in the future.) It's very fast with the lookups though.
Is there a way I could do this with some other data structure or strategy that would be quicker to load and take less space (possibly at the expense of the speed of the lookups)?
If your list only contains 2500 elements, a simple loop and checking contains() on all of them should be fast enough.
If it grows bigger and/or is too slow, you can apply some easy optimizations:
Don't search immediately as the user types each character, but introduce some delay. So if he types "foobar" really fast, you only search for "foobar", not first "f" then "fo" then "foo",...
Reuse your previous results: if the user first types "foo" and then extends that to "foobar", don't search in the whole original list again, but search inside the results for "foo" (because everything that contains "foobar" must contain "foo").
In my experience, these these basic optimizations already get you quite far.
Now, if the list grows so big that even that is too slow, some "smarter" optimizations as proposed in other answers here (tries, suffix trees,...) would be needed.
You'll want to look into using the Trie data structure.
Try simply looping over the entire list and calling contains() - doing that 2500 times is probably completely unnoticeable.
You definetely need a Suffix Tree.. (wiki)
(i think this implementation could be ok: link)
EDIT:
I've read your comment, you shouldn't blindly check if the string is a substring somewhere in you phrase, you usually start with a word, not with a space. So maybe it's better to tokenize words inside your phrase?
Are you allowed to do it? Otherwise the best way is to build an automata for every phrase or using similar algorithms (for example the Karp-Rabin string search algorithm).
Wouter Coekaerts has a good approach, but I would go a bit further.
Don't bring up anything when the textbox contains a single character. The results won't be useful. You may find that this is true for two characters as well.
Precompute the results for two characters. When there are two characters bring up the precomputed list.
When a third character is added do the 'contains' search on the list you have currently displayed (anything that doesn't contain c1c2 can't contain c1c2c3). By now the list should be small enough that 'contains' has perfectly adequate performance.
Similarly for four characters etc.
As said above, put in a little delay before starting the search. Or better still arrange for a search to be killed if another character is typed before it finishes.

Categories