Related
I am a .NET and JavaScript developer. Now I am working in Java, too.
In .NET LINQ and JavaScript arrow functions we have =>.
I know Java lambdas are not the same, but they are very similar. Are there any reasons (technical or non technical) that made java choose -> instead of =>?
On September 8, 2011, Brian Goetz of Oracle announced to the OpenJDK mailing list that the syntax for lambdas in Java had been mostly decided, but some of the "fine points" like which type of arrow to use were still up in the air:
This just in: the EG has (mostly) made a decision on syntax.
After considering a number of alternatives, we decided to essentially
adopt the C# syntax. We may still deliberate further on the fine points
(e.g., thin arrow vs fat arrow, special nilary form, etc), and have not
yet come to a decision on method reference syntax.
On September 27, 2011, Brian posted another update, announcing that the -> arrow would be used, in preference to C#'s (and the Java prototype's) usage of =>:
Update on syntax: the EG has chosen to stick with the -> form of the
arrow that the prototype currently uses, rather than adopt the =>.
He goes on to provide some description of the rationale considered by the committee:
You could think of this in two ways (I'm sure I'll hear both):
This is much better, as it avoids some really bad interactions with existing operators, such as:
x => x.age <= 0; // duelling arrows
or
Predicate p = x => x.size == 0; // duelling equals
What a bunch of idiots we are, in that we claimed the goal of doing what other languages did, and then made gratuitous changes "just for the sake of doing something different".
Obviously we don't think we're idiots, but everyone can have an opinion :)
In the end, this was viewed as a small tweak to avoid some undesirable
interactions, while preserving the overall goal of "mostly looks like
what lambdas look like in other similar languages."
Howard Lovatt replied in approval of the decision to prefer ->, writing that he "ha[s] had trouble reading Scala code". Paul Benedict of Apache concurred:
I am glad too. Being consistent with other languages is a laudable goal, but
since programming languages aren't identical, the needs for Java can lead to
a different conclusion. The fat arrow syntax does look odd; I admit it. So
in terms of vanity, I am glad to see that punted. The equals character is
just too strongly associated with assignment and equality.
Paigan Jadoth chimed in, too:
I find the "->" much better than "=>". If arrowlings at all instead of the
more regular "#(){...}" pattern, then something definitely distinct from the
gte/lte tokens is clearly better. And "because the others do that" has never
been a good argument, anyway :D.
In summary, then, after considering arguments on both sides, the committee felt that consistency with other languages (=> is used in Scala and C#) was less compelling than clear differentiation from the equality operators, which made -> win out.
But Lieven Lemiengre was skeptical:
Other languages (such as Scala or Groovy) don't have this problem because
they support some placeholder syntax.
In reality you don't write "x => x.age <= 0;"
But this is very common "someList.partition(x => x.age <= 18)" and I agree
this looks bad. Other languages make this clearer using placeholder syntax
"someList.partition(_.age <= 18)" or "someList.partition(it.age <= 18)"
I hope you are considering something like this, these little closures will
be used a lot!
(And I don't think replacing '=>' with '->' will help a lot)
Other than Lieven, I didn't see anyone who criticized the choice of -> and defended => replying on that mailing list. Of course, as Brian predicted, there were almost certainly opinions on both sides, but ultimately, a choice just has to be made in these types of matters, and the committee made the one they did for the stated reasons.
How many codes can be returned by KeyEvent.getKeyCode()?
What is the range of the codes returned by KeyEvent.getKeyCode()?
Are the KeyCodes listed on the javadoc for KeyEvent (VK_?) the only KeyCodes that can be returned, or are there more?
It doesn't look like there are any such guarantees, judging by this paragraph from the documentation:
WARNING: Aside from those keys that are defined by the Java language (VK_ENTER, VK_BACK_SPACE, and VK_TAB), do not rely on the values of the VK_ constants. Sun reserves the right to change these values as needed to accomodate a wider range of keyboards in the future.
These are all KeyCodes that can be returned by Key* method.
What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub
For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).
A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.
There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf
As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.
Is there any particular reason why this kind of literal is not included whereas hex and octal formats are allowed?
Java 7 includes it.Check the new features.
Example:
int binary = 0b1001_1001;
Binary literals were introduced in Java 7. See "Improved Integer Literals":
int i = 0b1001001;
The reason for not including them from day one is most likely the following: Java is a high-level language and has been quite restrictive when it comes to language constructs that are less important and low level. Java developers have had a general policy of "if in doubt, keep it out".
If you're on Java 6 or older, your best option is to do
int yourInteger = Integer.parseInt("100100101", 2);
actually, it is. in java7.
http://code.joejag.com/2009/new-language-features-in-java-7/
The associated bug is open since April 2004, has low priority and is considered as a request for enhancement by Sun/Oracle.
I guess they think binary literals would make the language more complex and doesn't provide obvious benefits...
There seems to be an impression here that implementing binary literals is complex. It isn't. It would take about five minutes. Plus the test cases of course.
Java 7 does allow binary literals !
Check this:
int binVal = 0b11010;
at this link:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
Perhaps it doesn't matter to the compiler once it optimizes, but in C/C++, I see most people make a for loop in the form of:
for (i = 0; i < arr.length; i++)
where the incrementing is done with the post fix ++. I get the difference between the two forms. i++ returns the current value of i, but then adds 1 to i on the quiet. ++i first adds 1 to i, and returns the new value (being 1 more than i was).
I would think that i++ takes a little more work, since a previous value needs to be stored in addition to a next value: Push *(&i) to stack (or load to register); increment *(&i). Versus ++i: Increment *(&i); then use *(&i) as needed.
(I get that the "Increment *(&i)" operation may involve a register load, depending on CPU design. In which case, i++ would need either another register or a stack push.)
Anyway, at what point, and why, did i++ become more fashionable?
I'm inclined to believe azheglov: It's a pedagogic thing, and since most of us do C/C++ on a Window or *nix system where the compilers are of high quality, nobody gets hurt.
If you're using a low quality compiler or an interpreted environment, you may need to be sensitive to this. Certainly, if you're doing advanced C++ or device driver or embedded work, hopefully you're well seasoned enough for this to be not a big deal at all. (Do dogs have Buddah-nature? Who really needs to know?)
It doesn't matter which you use. On some extremely obsolete machines, and in certain instances with C++, ++i is more efficient, but modern compilers don't store the result if it's not stored. As to when it became popular to postincriment in for loops, my copy of K&R 2nd edition uses i++ on page 65 (the first for loop I found while flipping through.)
For some reason, i++ became more idiomatic in C, even though it creates a needless copy. (I thought that was through K&R, but I see this debated in other answers.) But I don't think there's a performance difference in C, where it's only used on built-ins, for which the compiler can optimize away the copy operation.
It does make a difference in C++, however, where i might be a user-defined type for which operator++() is overloaded. The compiler might not be able to assert that the copy operation has no visible side-effects and might thus not be able to eliminate it.
As for the reason why, here is what K&R had to say on the subject:
Brian Kernighan
you'll have to ask dennis (and it might be in the HOPL paper). i have a
dim memory that it was related to the post-increment operation in the
pdp-11, though beyond that i don't know, so don't quote me.
in c++ the preferred style for iterators is actually ++i for some subtle
implementation reason.
Dennis Ritchie
No particular reason, it just became fashionable. The code produced
is identical on the PDP-11, just an inc instruction, no autoincrement.
HOPL Paper
Thompson went a step further by inventing the ++ and -- operators, which increment or decrement; their prefix or postfix position determines whether the alteration occurs before or after noting the value of the operand. They were not in the earliest versions of B, but appeared along the way. People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed. The PDP-7, however, did have a few ‘auto-increment’ memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson; the generalization to make them both prefix and postfix was his own. Indeed, the auto-increment cells were not used directly in implementation of the operators, and a stronger
motivation for the innovation was probably his observation that the translation of ++x was smaller than that of x=x+1.
For integer types the two forms should be equivalent when you don't use the value of the expression. This is no longer true in the C++ world with more complicated types, but is preserved in the language name.
I suspect that "i++" became more popular in the early days because that's the style used in the original K&R "The C Programming Language" book. You'd have to ask them why they chose that variant.
Because as soon as you start using "++i" people will be confused and curios. They will halt there everyday work and start googling for explanations. 12 minutes later they will enter stack overflow and create a question like this. And voila, your employer just spent yet another $10
Going a little further back than K&R, I looked at its predecessor: Kernighan's C tutorial (~1975). Here the first few while examples use ++n. But each and every for loop uses i++. So to answer your question: Almost right from the beginning i++ became more fashionable.
My theory (why i++ is more fashionable) is that when people learn C (or C++) they eventually learn to code iterations like this:
while( *p++ ) {
...
}
Note that the post-fix form is important here (using the infix form would create a one-off type of bug).
When the time comes to write a for loop where ++i or i++ doesn't really matter, it may feel more natural to use the postfix form.
ADDED: What I wrote above applies to primitive types, really. When coding something with primitive types, you tend to do things quickly and do what comes naturally. That's the important caveat that I need to attach to my theory.
If ++ is an overloaded operator on a C++ class (the possibility Rich K. suggested in the comments) then of course you need to code loops involving such classes with extreme care as opposed to doing simple things that come naturally.
At some level it's idiomatic C code. It's just the way things are usually done. If that's your big performance bottleneck you're likely working on a unique problem.
However, looking at my K&R The C Programming Language, 1st edition, the first instance I find of i in a loop (pp 38) does use ++i rather than i++.
Im my opinion it became more fashionable with the creation of C++ as C++ enables you to call ++ on non-trivial objects.
Ok, I elaborate: If you call i++ and i is a non-trivial object, then storing a copy containing the value of i before the increment will be more expensive than for say a pointer or an integer.
I think my predecessors are right regarding the side effects of choosing postincrement over preincrement.
For it's fashonability, it may be as simple as that you start all three expressions within the for statement the same repetitive way, something the human brain seems to lean towards to.
I would add up to what other people told you that the main rule is: be consistent. Pick one, and do not use the other one unless it is a specific case.
If the loop is too long, you need to reload the value in the cache to increment it before the jump to the begining.
What you don't need with ++i, no cache move.
In C, all operators that result in a variable having a new value besides prefix inc/dec modify the left hand variable (i=2, i+=5, etc). So in situations where ++i and i++ can be interchanged, many people are more comfortable with i++ because the operator is on the right hand side, modifying the left hand variable
Please tell me if that first sentence is incorrect, I'm not an expert with C.