I am attempting to implement the following Basic Sliding Window algorithm in Java. I get the basic idea of it, but I am a bit confused by some the wording, specifically the sentence in bold:
A sliding window of fixed width w is moved across the file,
and at every position k in the file, the fingerprint of
its content is computed. Let k be a chunk boundary
(i.e., Fk mod n = 0). Instead of taking the hash of the
entire chunk, we choose the numerically smallest fingerprint
of a sliding window within this chunk. Then we compute a hash
of this randomly chosen window within the chunk. Intuitively,
this approach would permit small edits within the chunks to
have less impact on the similarity computation. This method
produces a variable length document signature, where the
number of fingerprints in the signature is proportional to
the document length.
Please see my code/results below. Am I understanding the basic idea of the algorithm? As per the text in bold, what does it mean to "choose the numerically smallest fingerprint of a sliding window within its chunk"? I am currently just hashing the entire chunk.
code:
public class BSW {
/**
* #param args
*/
public static void main(String[] args) {
int w = 15; // fixed width of sliding window
char[] chars = "Once upon a time there lived in a certain village a little
country girl, the prettiest creature who was ever seen. Her mother was
excessively fond of her; and her grandmother doted on her still more. This
good woman had a little red riding hood made for her. It suited the girl so
extremely well that everybody called her Little Red Riding Hood."
.toCharArray();
List<String> fingerprints = new ArrayList<String>();
for (int i = 0; i < chars.length; i = i + w) {
StringBuffer sb = new StringBuffer();
if (i + w < chars.length) {
sb.append(chars, i, w);
System.out.println(i + ". " + sb.toString());
} else {
sb.append(chars, i, chars.length - i);
System.out.println(i + ". " + sb.toString());
}
fingerprints.add(hash(sb));
}
}
private static String hash(StringBuffer sb) {
// Implement hash (MD5)
return sb.toString();
}
}
results:
0. Once upon a tim
15. e there lived i
30. n a certain vil
45. lage a little c
60. ountry girl, th
75. e prettiest cre
90. ature who was e
105. ver seen. Her m
120. other was exces
135. sively fond of
150. her; and her gr
165. andmother doted
180. on her still m
195. ore. This good
210. woman had a lit
225. tle red riding
240. hood made for h
255. er. It suited t
270. he girl so extr
285. emely well that
300. everybody call
315. ed her Little R
330. ed Riding Hood.
That is not a sliding window. All you have done is break up the input into disjoint chunks. An example of a sliding window would be
Once upon a time
upon a time there
a time there lived
etc.
The simple answer is NO per my understanding (I once studied sliding window algorithm years ago, so I just remember the principles, while cannot remember some details. Correct me if you have more insightful understanding).
As the name of the algorithm 'Sliding Window', your window should be sliding not jumping as it says
at every position k in the file, the fingerprint of its content is computed
in your quotes. That is to say the window slides one character each time.
Per my knowledge, the concept of chunks and windows should be distinguished. So should be fingerprint and hash, although they could be the same. Given it too expense to compute hash as fingerprint, I think Rabin fingerprint is a more proper choice. The chunk is a large block of text in the document and a window highlight a small portion in a chunk.
IIRC, the sliding windows algorithm works like this:
The text file is chunked at first;
For each chunk, you slide the window (a 15-char block in your running case) and compute their fingerprint for each window of text;
You now have the fingerprint of the chunk, whose length is proportional to the length of chunk.
The next is how you use the fingerprint to compute the similarity between different documents, which is out of my knowledge. Could you please give us the pointer to the article you referred in the OP. As an exchange, I recommend you this paper, which introduce a variance of sliding window algorithm to compute document similarity.
Winnowing: local algorithms for document fingerprinting
Another application you can refer to is rsync, which is a data synchronisation tool with block-level (corresponding to your chunk) deduplication. See this short article for how it works.
package com.perturbation;
import java.util.ArrayList;
import java.util.List;
public class BSW {
/**
* #param args
*/
public static void main(String[] args) {
int w = 2; // fixed width of sliding window
char[] chars = "umang shukla"
.toCharArray();
List<String> fingerprints = new ArrayList<String>();
for (int i = 0; i < chars.length+w; i++) {
StringBuffer sb = new StringBuffer();
if (i + w < chars.length) {
sb.append(chars, i, w);
System.out.println(i + ". " + sb.toString());
} else {
sb.append(chars, i, chars.length - i);
System.out.println(i + ". " + sb.toString());
}
fingerprints.add(hash(sb));
}
}
private static String hash(StringBuffer sb) {
// Implement hash (MD5)
return sb.toString();
}
}
this program may help you. and please try to make more efficent
I have a very simple code taken from this example, where I am using the Lin, Path and Wu-Palmer similarity measures to compute the similarity between two words. My code is as follows:
import edu.cmu.lti.lexical_db.ILexicalDatabase;
import edu.cmu.lti.lexical_db.NictWordNet;
import edu.cmu.lti.ws4j.RelatednessCalculator;
import edu.cmu.lti.ws4j.impl.Lin;
import edu.cmu.lti.ws4j.impl.Path;
import edu.cmu.lti.ws4j.impl.WuPalmer;
public class Test {
private static ILexicalDatabase db = new NictWordNet();
private static RelatednessCalculator lin = new Lin(db);
private static RelatednessCalculator wup = new WuPalmer(db);
private static RelatednessCalculator path = new Path(db);
public static void main(String[] args) {
String w1 = "walk";
String w2 = "trot";
System.out.println(lin.calcRelatednessOfWords(w1, w2));
System.out.println(wup.calcRelatednessOfWords(w1, w2));
System.out.println(path.calcRelatednessOfWords(w1, w2));
}
}
And the scores are as expected EXCEPT when both words are identical. If both words are the same (e.g. w1 = "walk"; w2 = "walk";), the three measures I have should each return 1.0. But instead, they are returning 1.7976931348623157E308.
I have used ws4j before (the same version, in fact), but I have never seen this behavior. Searching online has not yielded any clues. What could possibly be going wrong here?
P.S. The fact that the Lin, Wu-Palmer and Path measures should return 1 can also be verified with the online demo provided by ws4j
I had a similar problem, and here's what's going on here. I hope that other people who run into this problem will find by response helpful.
If you have noticed, the online demo allows you to choose word sense by specifying word in the following format: word#pos_tag#word_sense. For example, a noun gender with the first word sense would be gender#n#1.
Your code snippet uses the first word sense by default. When I calculate WuPalmer similarity between "gender" and "sex", it will return 0.26. If I use online demo, it will return 1.0. But if we use "gender#n#1" and "sex#n#1" the online demo will return 0.26, so there is no discrepancy. The online demo calculates the max of all pos tag / word sense pairs. Here's a corresponding snippet of code that should do the trick:
ILexicalDatabase db = new NictWordNet();
WS4JConfiguration.getInstance().setMFS(true);
RelatednessCalculator rc = new Lin(db);
String word1 = "gender";
String word2 = "sex";
List<POS[]> posPairs = rc.getPOSPairs();
double maxScore = -1D;
for(POS[] posPair: posPairs) {
List<Concept> synsets1 = (List<Concept>)db.getAllConcepts(word1, posPair[0].toString());
List<Concept> synsets2 = (List<Concept>)db.getAllConcepts(word2, posPair[1].toString());
for(Concept synset1: synsets1) {
for (Concept synset2: synsets2) {
Relatedness relatedness = rc.calcRelatednessOfSynset(synset1, synset2);
double score = relatedness.getScore();
if (score > maxScore) {
maxScore = score;
}
}
}
}
if (maxScore == -1D) {
maxScore = 0.0;
}
System.out.println("sim('" + word1 + "', '" + word2 + "') = " + maxScore);
Also, this will give you 0.0 similarity on non-stemmed word forms, e.g. 'genders' and 'sex.' You can use a porter stemmer included in ws4j to make sure you stem words beforehand if needed.
Hope this helps!
I had raised this issue at the googlecode site for ws4j, and it turns out that indeed it was a bug. The reply I received is as follows:
This looks like it is due to attempting to override a protected static field (this can't be done in Java). The attached patch fixes the issue by moving the definition of min and max the fields to non-static final members in RelatednessCalculator and adding getters. Implementations then provide their min/max values through super constructor calls.
Patch can be applied with patch -p1 < 0001-Cannot-override-static-members-replacing-fields-with.patch
And here is the (now resolved) issue on their site.
Here is why -
In jcn we have...
sim(c1, c2) = 1 / distance(c1, c2)
distance(c1, c2) = ic(c1) + ic(c2) - (2 * ic(lcs(c1, c2)))
where c1, c2 are the two concepts,
ic is the information content of the concept.
lcs(c1, c2) is the least common subsumer of c1 and c2.
Now, we don't want distance to be 0 (=> similarity will become
undefined).
distance can be 0 in 2 cases...
(1) ic(c1) = ic(c2) = ic(lcs(c1, c2)) = 0
ic(lcs(c1, c2)) can be 0 if the lcs turns out to be the root
node (information content of the root node is zero). But since
c1 and c2 can never be the root node, ic(c1) and ic(c2) would be 0
only if the 2 concepts have a 0 frequency count, in which case, for
lack of data, we return a relatedness of 0 (similar to the lin case).
Note that the root node ACTUALLY has an information content of
zero. Technically, none of the other concepts can have an information
content value of zero. We assign concepts zero values, when
in reality their information content is undefined (due to zero
frequency counts). To see why look at the formula for information
content: ic(c) = -log(freq(c)/freq(ROOT)) {log(0)? log(1)?}
(2) The second case that distance turns out to be zero is when...
ic(c1) + ic(c2) = 2 * ic(lcs(c1, c2))
(which could have a more likely special case ic(c1) = ic(c2) =
ic(lcs(c1, c2)) if all three turn out to be the same concept.)
How should one handle this?
Intuitively this is the case of maximum relatedness (zero
distance). For jcn this relatedness would be infinity... But we
can't return infinity. And simply returning a 0 wouldn't work...
since here we have found a pair of concepts with maximum
relatedness, and returning a 0 would be like saying that they
aren't related at all.
1.7976931348623157E308 is the value of Double.MAX_VALUE but the maximum value of some similarity/relatedness algo (Lin, WuPalmer and Path) are between 0 and 1. Then , for identical synset, the maxium value can be returned is 1. Into the version of my repo (https://github.com/DonatoMeoli/WS4J) i fixed this and other bugs.
Now, for two identical words, the values returned are:
HirstStOnge 16.0
LeacockChodorow 1.7976931348623157E308
Lesk 1.7976931348623157E308
WuPalmer 1.0
Resnik 1.7976931348623157E308
JiangConrath 1.7976931348623157E308
Lin 1.0
Path 1.0
Done in 67 msec.
Process finished with exit code 0
I have the following problem:
I have 2 Strings of DNA Sequences (consisting of ACGT), which differ in one or two
spots.
Finding the differences is trivial, so let's just ignore that
for each difference, I want to get the consensus symbol (e.g. M for A or C) that represents both possibilities
I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow.
What is a fast, easy to maintain way to implement that? Some kind of lookup table perhaps, or a matrix for the combinations? Any code samples would be greatly appreciated. I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).
Update: there seems to be a bit of confusion here. The consensus symbol is a single char, that stands for a single char in both sequences.
String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2. Sooo, I want a consensus string to be ACST, because S stands for "either C or G"
I want to make a method like this:
char getConsensus(char a, char b)
Update 2: some of the proposed methods work if I only have 2 sequences. I might need to do several iterations of these "consensifications", so the input alphabet could increase from "ACGT" to "ACGTRYKMSWBDHVN" which would make some of the proposed approaches quite unwieldy to write and maintain.
You can just use a HashMap<String, String> which maps the conflicts/differences to the consensus symbols. You can either "hard code" (fill in the code of your app) or fill it during the startup of your app from some outside source (a file, database etc.). Then you just use it whenever you have a difference.
String consensusSymbol = consensusMap.get(differenceString);
EDIT: To accomodate your API request ;]
Map<String, Character> consensusMap; // let's assume this is filled somewhere
...
char getConsensus(char a, char b) {
return consensusMap.get("" + a + b);
}
I realize this look crude but I think you get the point. This might be slightly slower than a lookup table but it's also a lot easier to maintain.
YET ANOTHER EDIT:
If you really want something super fast and you actuall use the char type you can just create a 2d table and index it with characters (since they're interpreted as numbers).
char lookup[][] = new char[256][256]; // all "english" letters will be below 256
//... fill it... e. g. lookup['A']['C'] = 'M';
char consensus = lookup['A']['C'];
A simple, fast solution is to use bitwise-OR.
At startup, initialize two tables:
A sparse 128-element table to map a nucleotide to a single bit. 'Sparse' means you only have to set the members that you'll use: the IUPAC codes in upper and lowercase.
A 16-element table to map a bitwise consensus to an IUPAC nucleotide code.
To get the consensus for a single position:
Use the nucleotides as indices in the first table, to get the bitwise representations.
Bitwise-OR the bitwise representations.
Use the bitwise-OR as an index into the 16-element table.
Here's a simple bitwise representation to get you started:
private static final int A = 1 << 3;
private static final int C = 1 << 2;
private static final int G = 1 << 1;
private static final int T = 1 << 0;
Set the members of the first table like this:
characterToBitwiseTable[ 'd' ] = A | G | T;
characterToBitwiseTable[ 'D' ] = A | G | T;
Set the members of the second table like this:
bitwiseToCharacterTable[ A | G | T ] = 'd';
Given that they are all unique symbols, I'd go for an Enum:
public Enum ConsensusSymbol
{
A("A"), // simple case
// ....
GTUC("B"),
// etc
// last entry:
AGCTU("N");
// Not sure what X means?
private final String symbol;
ConsensusSymbol(final String symbol)
{
this.symbol = symbol;
}
public String getSymbol()
{
return symbol;
}
}
Then, when you encounter a difference, use .valueOf():
final ConsensusSymbol symbol;
try {
symbol = ConsensusSymbol.valueOf("THESEQUENCE");
} catch (IllegalArgumentException e) { // Unknown sequence
// TODO
}
For instance, if you encounter GTUC as a String, Enum.valueOf("GTUC") will return the GTUC enum value, and calling getSymbol() on that value will return "B".
The possible combinations are around 20. So there is not a real performace issue.
If you do not wish to do a big if else block, the fastest solution would be to build a Tree data structure. http://en.wikipedia.org/wiki/Tree_data_structure. This is the fastest way to do what you want to do.
In a tree, you put all the possible combinations and you input the string and it traverses the tree to find the longest matching sequence for a symbol
Do you want an illustrated example?
PS: All Artificial Intelligence softwares uses the Tree apporach which is the fastest and the most adapted.
Considered reading multiple sequences at once - I would:
put all characters from the same position in the sequence to a set
sort and concatenate values in the set and use enum.valueOf() as in fge's example
acquired value use as a key to a EnumMap having consesus symbols as a values
There are probably ways hot o optimize the second and the first steps.
A possible solution using enums, inspired by pablochan, with a little input from biostar.stackexchange.com:
enum lut {
AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), RD('D'), RH('N'), RV('V'), RN('N'),
YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), YD('N'), YH('H'), YV('N'), YN('N'),
KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), KD('D'), KH('N'), KV('N'), KN('N'),
MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), MD('N'), MH('H'), MV('V'), MN('N'),
SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), SD('N'), SH('N'), SV('V'), SN('N'),
WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), WD('D'), WH('H'), WV('N'), WN('N'),
BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), BD('N'), BH('N'), BV('N'), BN('N'),
DA('D'), DC('N'), DG('D'), DT('D'), DR('D'), DY('N'), DK('D'), DM('N'), DS('N'), DW('D'), DB('N'), DD('D'), DH('N'), DV('N'), DN('N'),
HA('H'), HC('H'), HG('N'), HT('H'), HR('N'), HY('H'), HK('N'), HM('H'), HS('N'), HW('H'), HB('N'), HD('N'), HH('H'), HV('N'), HN('N'),
VA('V'), VC('V'), VG('V'), VT('N'), VR('V'), VY('N'), VK('N'), VM('V'), VS('V'), VW('N'), VB('N'), VD('N'), VH('N'), VV('V'), VN('N'),
NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');
char consensusChar = 'X';
lut(char c) {
consensusChar = c;
}
char getConsensusChar() {
return consensusChar;
}
}
char getConsensus(char a, char b) {
return lut.valueOf("" + a + b).getConsensusChar();
}
Let's say I have two same strings inside an ArrayList... is there a way to check for that? Also is there a way to check for how many times a string of the same exact is in the ArrayList?
So let's say I have the following a ArrayString.
os.println(itemIDdropped + "|" + spawnX + "|" + spawnY + "|" + currentMap + "|drop|" + me.getUsername());
1. 1|3|5|1|drop|Dan
2. 2|5|7|2|drop|Luke
3. 1|3|5|2|drop|Dan
4. 3|3|5|1|drop|Sally
Here is what the numbers/letters mean for the 1-4 strings...
item ID, X pos, Y pos, Map it's on, command drop, user who dropped it
Then let's say I split it up doing this:
String[] itemGrnd = serverItems.get(i).split("\\|");
Now, let's say I have a for-loop like this one:
for (int i = 0; x < serverItems.size(); i++) {
System.out.println(serverItems.get(i));
}
I want to find where X, Y, and Map or in this case itemGrnd[1], itemGrnd[2], and itemGrnd[3] are the same in ANY other String in the ArrayList I found via serverItems.get(i).
And if WE DO find any... (which in the example I provide above... x, y, and map are the same for 1 and 4... then create an IF statment to do it ONCE... because I don't want to do this:
(BTW I do keep track of my variables for client-side so don't worry about that. Yes I know it's named spawnX and spawnY.. that's just the current X,Y.
if (spawnX.equals(Integer.parseInt(itemGrnd[1])) &&
spawnY.equals(Integer.parseInt(itemGrnd[2])) &&
currentMap.equals(Integer.parseInt(itemGrnd[3]))) {
}
Now, if I were to do THIS.... 1 and 4 were be processed through here. I Only want do it ONCE.. at ALL times if we find multple strings (like 1 and 4)
Thanks
Well, it's pretty simple to find out how many String ArrayList has:
public class ArrayListExample {
private static final String TO_FIND = "cdt";
public static void main(String[] args) {
ArrayList<String> al = new ArrayList<String>();
al.add("abc");
al.add("dde");
//4 times
al.add(TO_FIND);
al.add(TO_FIND);
al.add(TO_FIND);
al.add(TO_FIND);
ArrayList<String> al1 = (ArrayList<String>) al.clone();
int count = 0;
while (al1.contains(TO_FIND)) {
al1.remove(TO_FIND);
count++;
}
System.out.println(count);
}
}
A faster way would be to sort the ArrayList and then see how many different elements are there.
Denis_k solution is the O(n^2) and with sorting it is O(nlog(n)).
If you want that kind of behavior, you might want to check the Apache Commons / Collections project, which extends the default java collections by some useful interfaces, including the bag interface (JavaDoc), which seems to be what you are looking for: a collection that knows how many elements of one kind it contains. The downside: there is no support for generics yet, so you are working with raw collections.
Or Google Guava, which is a more modern library and has a similar concept, called MultiSet which is probably more user-friendly. Guava however does not yet have a release version although it is already widely used in production.
Oh, and there's also a standard JavaSE function: Collections.frequency(Collection, Object), although it supposedly performs badly.