Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I want to be able to compare a number of files (at most 30) against each other in order to find some sort of degree of similarity. It wouldn't need to be perfect I just want some sort of red flag if two files are unusually similar. What would be a good way to go about this?
You could use Regular Expressions (commonly known as regex: python regex docs). Using grouping, you could find variable and function names, unique lines of code (lines that aren't whitespace or comments), etc.
However, creating a system that is smart enough to be able to detect similarities on its own can be very difficult. If you had some way of getting a number between 0 and 1 of two files and their similarities, you could test it against a high threshold. Anything over the threshold (say for example, 0.97) could be considered suspicious.
Besides looking at the physical code, you could also observe code density in the files. Imagine if you printed out a page of code and turned it 90 degrees. You essentially get a graph of the number of lines on each file. Using that, you can observe where there are peaks and valleys to see where the code is more or less dense. Two similar files may have the same or very close code densities. Also, using this method you don't have to worry about looking for variable or function names that are the same as you aren't so much looking at the code itself but rather how it's organized
Fleshing out #mgilson's comment, here's a very simple start:
def file_similarity(path1, path2):
"Return float in [0., 1.] giving some measure of file similarity."
import difflib
with open(path1, "rb") as f1, open(path2, "rb") as f2:
s = difflib.SequenceMatcher(
lambda ch: ch in " \t", # don't sync on blanks or tabs
f1.read(),
f2.read())
return s.ratio()
Read the SequenceMatcher docs for more. In particular, if you have many files to compare, it's much more efficient to reuse a SequenceMatcher object (see the set_seq1() and set_seq2() methods). And if you're using a threshold, as the accepted answer suggested, see the real_quick_ratio() and quick_ratio() methods to slash the time more.
To get better results, I'd feed the files through a normalization transformation first, primarily to replace tab characters with spaces (tabs and spaces are as different to character comparison as, say, 'a' and '/', but the distinction is invisible to the human eye). Removing all-whitespace lines would probably help too.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm trying to learn Java. My current assignment is to build a simple four function calculator..... this would be easy given if/else and/ or switch statements, but I'm supposed to build this using methods.
The original input has to be put in as a single string, so, in my mind, I'm going to have to take the single string and create substrings, then somehow convert these substrings into double values, while deleting whatever whitespace could possibly be between characters. My current idea is to somehow identify the "+,-,*, or /" within the string and divide into substrings before and after these values, using the appropriate defined method for whichever operator to do the calculations....
The problem is that I can't see a good way to divide these up into substrings or how to convert the numbers involved into double values. Anyone got any advice for me? Keep in mind, what we have gone through is pretty limited and I feel like I'm missing something REALLY simple out there.
You can split a string based on a particular character using str.split("\\+"), for example. You can convert the split pieces of the string to doubles by using Double.parseDouble(str);
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
consider the following Strings:
he llo
goodbye
hello
= (goodbye)
(he)(llo)
good bye
helium
I'm trying to sort these in such a way that similar words comes together, I know
alphanumerical sorting is not an option
removing special chars ",-_ and etc then comparing is certainly helpful but results won't be as good as I hope for.
NOTE :
there might be few different desired ouput for this, one of which is :
DESIRED OUTPUT:
hello
he llo
(he)(llo)
helium
goodbye
good bye
= (goodbye)
so my question is that if there is a java package that compares strings and ultimately sort them based on it .
I've heard of terms such as n-gram and skip-gram but didn't quite understand them. I'm not even sure if they can be useful for me at all.
UPDATE:
finding similarities is certainly part of my question but the main problem is the sorting part.
Here's one possible approach.
Calculate the edit distance/Levenshtein distance between each pair of strings and then you use view the strings as a complete graph where the edge weights come from the edit distance. Choose a threshold for those weights and remove all the weights that to high. Then find the cliques in this graph. If your threshold is fairly low perhaps even finding connected components would be an option.
Note:
Perhaps it would be better to substitute some edit distance with one of the similarity measures in the link that #dognose posted.
Also, note that finding cliques will be very slow if you have a large numbers of strings
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am not new in java but when I write program I always use int Type of my variables. I want to know when I need to use int data, when byte, when long and so on... Can U explain me this with examples please.
If you are asking when to use float, double and long etc. This document can help you to understand.
For example, int is 32 bit but long is 64 bit. If you need to set a value over 32 bit you should use long to store data.
Good luck.
It basically depends on how you wanna use them and what considerations would you have (size, data type,..etc). I would recommend going throw Oracle's docs.
The types used should be the product of a deep thought about various of things, including (but not limited to) int over long (32bit vs 64bit), char over byte (user friendliness vs performance) , complex data structures over simple ones (Performance), backwards compatibility (JRE version), platforms the program's gonna run on (Windows? Unix? Mac OS?), readability of the code (Sometimes writing "byte x = 0xFF; char ch = (char)x; is worse than char ch = 'a' and the list goes on... of course some of the stuff I mentioned fit into more than one category.
This usually comes with experience. The more you code, the more platforms you want to support, the faster you want your program to respond etc...
You should always have a plan regarding your program:
What platforms will I support?
Is the task more important than performance?
...
...
I'm not saying you should carefully consider every type you choose, I'm saying you should always make the effort to tick all the V's and be satisfied about it, accomplishing everything you wanted.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am curious as to why Arrays are used when you could use an ArrayLists instead? Wouldn't it always be better to use an ArrayList?
Check out this comparison.
As you can see, there are important differences between the two constructs. You'll find APIs using one or the other (or both), and you have to understand the pros/cons and the functional differences between the two.
One particular difference is that a native array can store primitives without the inefficiencies of boxing/unboxing. That's significant when you have sizeable arrays representing data streams / data sets.
Note also that an ArrayList is not covariant. That is, an Integer[] is a Number[], but an ArrayList<Integer> is not a ArrayList<Number>. See here for more details.
Arrays and ArrayLists serve different, though sometimes overlapping, purposes.
But aside from that, the simple fact is that you don't code in a vacuum, and so you're going to use APIs that involve arrays, so learning about them isn't optional. Arrays are an absolutely fundamental structure in computer science.
There are many things you can do with arrays that you can't do with ArrayLists. For example,
Read multiple bytes or code-units from a file using a byte[] or a char[]
Call a native numeric or graphics library that deals with vectors, or matrices represented as float[]s or double[]s
You can use other non-primitive-array abstractions instead of ArrayList, but sometimes the best way to represent a contiguous region of memory divided into numeric units is using an array.
It is very important to have a proper foundation in programming fundamentals. These fundamentals include things like variable types, functions, arguments, arrays, and so on. Without a proper foundation, everything else you learn can collapse since you don't have a proper foundation. In other words, lets say you want to learn a language, you could start to learn whole phrases by memorization, but it is nice to get some words rolling off the tongue and start digesting and piecing how you can make sentences out of words instead of brute forcing it and memorizing all the sentences you need.
It is important for you to learn arrays if you wish to have a better understanding of what happens "behind the scenes", or if you want to learn a lower level language like C or assembly languages.
May I ask where you are learning Java? Is this for school/college? If so, please try to get something out of working with Arrays, because they will be used in courses like Data Structures and especially if you start using a language that doesn't have an ArrayList equivalent.
Learn all you can about it. Understanding what's happening is rewarding.
String []array = { "I","DO","NOT","RESIZE" };
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
What would be the best way to write a Java program that would simulate machine code? For example, I need to create a series of instructions such as add, subtract, increment, decrement, etc.
Let's say I'm writing the add instruction which accepts 3 parameters/registers (adding the values in the first 2 registers and storing the result in the 3rd). Is it as simple as writing a function such as:
int add(int x, int y) {
int result;
result = x + y;
return result; }
I'm also open to the possibility that I'm way off base here. Any input would be much appreciated.
If you just want to write Java code that will be more or less 1:1 with machine instructions I'd suggest you create variables for all of the registers and define methods for most of the instructions (similar to what you suggested). But this will not "restrict" what you can do the way real machine instructions do, since you can multiply the BX reg by the AX reg when the machine may not allow that.
Better would be to define a class that represents the machine state (ie, registers and RAM) and methods on the class for all of the instructions. Then you couldn't multiply BX times AX unless there were a MUL_BX_AX method. Many methods would not have parameters (because the registers are inside the "opaque" object), but some would have parms where the "real" instructions would accept an offset or whatever. (Eg, ADD_AX_IMMED(5).)
Added: There is the issue of branching, though, that would require some additional thought. Java doesn't have a GOTO equivalent that would fill the role very well, so initially (until you think of something better) you might have to use standard if/else logic, et al, testing "condition codes" in the machine state class.
The best way to simulate assembly would be to handle the raw bits and bytes and do the operations accordingly.
Sure you could do that, but the big thing is how to switch on the op-codes and do all the address-field calculations.
Typically, address fields can contain literal constants, global addresses, registers, offsets relative to registers, etc.
It depends if you're simulating a simple machine or a real one.