I fear this won't be an easy question. I've been thinking about a proper solution for this problem for a long time and hope that a fresh bunch of brains have a better view on the problem - let's get to it:
Data:
What we're working with here is a csv file containing multiple columns, the relevant ones for this problem are:
User ID (Integer, ranging from 3 to 8 digits, multiple entries with the same UserID exist) LIST IS SORTED BY THIS
Query (String)
Epoc (Long, epoc time value)
clickurl (String)
Every entry in the data we're working with here has !null values for these attributes.
Example Data:
SID,UID,query,rawdate,timestamp,timegap,epoc,lengthwords,lengthchars,rank,clickurl
5,142,westchester.gov,2006-03-20 03:55:57,Mon Mar 20 03:55:57 CET 2006,0,1142823357504,1,15,1,http://www.westchestergov.com
10,142,207 ad2d 530,2006-04-08 01:31:14,Sat Apr 08 01:31:14 CEST 2006,10000,1144452674507,3,12,1,http://www.courts.state.ny.us
11,142,vera.org,2006-04-08 08:38:42,Sat Apr 08 08:38:42 CEST 2006,11000,1144478322507,1,8,1,http://www.vera.org
Note: there are multiple entries that have the same value for 'Epoc', this is due to the tools used to gather the data
Note2: the list has a size of ~700000, just fyi
Goal: Match pairs of entries that have the same query
Scope: entries that share the same UserID
Due to the mentioned anomaly in the data gathering process, the following has to be considered:
If two entries share the same value for 'Query' and for 'Epoc' , the following elements in the list have to be checked for these criteria until the next entry has a different value for one of these attributes. The group of entries that shared the same Query and Epoc values are to be considered as -one- entry, so in order to match a pair, another entry has to be found that matches the 'Query' value. For lack of a better name, let's call a group that shares the same Query and Epoc value a 'chain'
Now that this is out, it gets a bit easier, there are 3 types of pair compositions we can get out of this:
Entry & Entry
Entry & Chain
Chain & Chain
Type 1 here just means two entries in the list that share the same value for 'Query', but not for 'Epoc'.
So this sums up the Equal Query Pairs
There's also the case of Different Query Pairs which can be described as the following:
After we have matched the equal query pairs, there's the possibility that there are entries which have not been paired with other entries because their query didn't match - every entry that has not been matched to another entry because of this is part of the set called 'different queries'
The members of this set have to be paired without following any criteria, but chains are still treated as -one- entry of the pair.
As for matching the pairs in general, there may be no redundant pairs - a single entry can be part of n many pairs, but two individual entries can only form one pair.
EXAMPLE:
The following entries are to be paired
UID,Query,Epoc,Clickurl
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,raspberry pi,1141394164710,http://www.raspberrypi.org/
772,stackoverflow,1141394274810,http://en.wikipedia.org/wiki/Buffer_overflow
772,stackoverflow,1141394274850,http://www.stackoverflow.com
772,tall women,1141394275921,http://www.tallwomen.org/
772,raspberry pi,1141394277991,http://www.raspberrypi.org/
772,Donuts,114139427999,http://de.wikipedia.org/wiki/Donut
772,stackoverflow,1141394279999,http://www.stackoverflow.com
772,something,1141399299991,http:/something.else/something/
In this example, donuts is a chain, therefore the pairs are(linenumbers without header):
Equal Query Pairs:(1-3,9) (4,8) (5,6) (5,10) (6,10)
Different Query Pairs: (7,11)
My -failed- approach to the problem:
The algorithm I developed to solve this works as follow:
Iterate the list of entries until the value for UserID changes.
Then, applied to a separate list that only contains the just iterated elements that share the same UserID:
for (int i = 0; i < list.size(); i++) {
Entry tempI = list.get(i);
Boolean iMatched = false;
//boolean to save whether or not c1 is set
Boolean c1done = false;
Boolean c2done = false;
//Hashsets holding the clickurl values of the entries that form a pair
HashSet<String> c1 = null;
HashSet<String> c2 = null;
for (int j = i + 1; j < list.size(); j++) {
Entry tempJ = list.get(j);
// Queries match
if (tempI.getQuery().equals(tempJ.getQuery())) {
// wheter or not Entry at position i has been matched or not
if (!iMatched) {
iMatched = true;
}
HashSet<String> e1 = new HashSet<String>();
HashSet<String> e2 = new HashSet<String>();
int k = 0;
// Times match
HashSet<String> chainset = new HashSet<String>();
if (tempI.getEpoc() == tempJ.getEpoc()) {
chainset.add(tempI.getClickurl());
chainset.add(tempJ.getClickurl());
} else {
e1.add(tempI.getClickurl());
if (c1 == null) {
c1 = e1;
c1done = true;
} else {
if (c2 == null) {
c2 = e1;
c2done = true;
}
}
}
//check how far the chain goes and get their entries
if ((j + 1) < list.size()) {
Entry tempjj = list.get(j + 1);
if (tempjj.getEpoc() == tempJ.getEpoc()) {
k = j + 1;
//search for the end of the chain
while ((k < list.size())
&& (tempJ.getQuery().equals(list.get(k)
.getQuery()))
&& (tempJ.getEpoc() == list.get(k).getEpoc())) {
chainset.add(tempJ.getClickurl());
chainset.add(list.get(k).getClickurl());
k++;
}
j = k + 1; //continue the iteration at the end of the chain
if (c1 == null) {
c1 = chainset;
c1done = true;
} else {
if (c2 == null) {
c2 = chainset;
c2done = true;
}
}
// Times don't match
}
} else {
e2.add(tempJ.getClickurl());
if (c1 == null) {
c1 = e2;
c1done = true;
} else {
if (c2 == null) {
c2 = e2;
c2done = true;
}
}
}
/** Block that compares the clicks in the Hashsets and computes the resulting data
* left out for now to not make this any more complicated than it already is
**/
// Queries don't match
} else {
if (!dq.contains(tempJ)) { //note: dq is an ArrayList holding the entries of the differen query set
dq.add(tempJ);
}
}
if (j == al.size() - 1) {
if (!iMatched) {
dq.add(tempI);
}
}
}
if (dq.size() >= 2) {
for (int z = 0; z < dq.size() - 1; z++) {
if (dq.get(z + 1) != null) {
/** Filler, iterate dq just like the normal list with two loops
*
**/
}
}
}
}
So, using an excessive amount of loops I try to match the pairs, resulting in a horribly long runtime which's end I have not seen up until this point
Okay I hope I didn't forget anything crucial, I'll add further needed information later
If you've made it this far, thanks for reading - hopefully you have an idea that might help me
Use SQL to import the data into a db and then perform the queries. Your txt file is too large; it's no wonder that it takes so long to go through it. :)
First, remove all but one entry from each chain. To do this, sort by (userid, query, epoch), remove duplicates.
Then, scan the sorted list. take all entries for a (userid, query) pair. If there is only one, save it in a list for later processing, else emit all pairs.
For all the entries for a given user that You have saved for later processing (these are type 2 & 3), emit pairs.
Related
Given the following datatype Testcase (XQuery, Testpath, FirstInputFile, SecondInputFile, Expected)
how can I properly delete duplicates.
Definition of duplicates:
If FirstInputFile already in the list as SecondInputFile vice versa.
Here is the Testdata
tcs.add(new HeaderAndBodyTestcase("XQ 1", "/1", "FAIL", "FAIL2", "FAILED"));
tcs.add(new HeaderAndBodyTestcase("XQ 1", "/1", "FAIL2", "FAIL", "FAILED"));
tcs.add(new HeaderAndBodyTestcase("XQ 2", "/2", "FAIL4", "FAIL3", "FAILED2"));
tcs.add(new HeaderAndBodyTestcase("XQ 2", "/2", "FAIL3", "FAIL4", "FAILED2"));
and here is the function
protected void deleteExistingDuplicatesInArrayList(final ArrayList<HeaderAndBodyTestcase> list) {
for (int idx = 0; idx < list.size() - 1; idx++) {
if (list.get(idx).firstInputFile.equals(list.get(idx).secondInputFile)
|| (list.get(idx + 1).firstInputFile.equals(list.get(idx).firstInputFile)
&& list.get(idx).secondInputFile.equals(list.get(idx + 1).secondInputFile)
|| (list.get(idx).firstInputFile.equals(list.get(idx + 1).secondInputFile)
&& list.get(idx).secondInputFile.equals(list.get(idx + 1).firstInputFile)))) {
list.remove(idx);
}
}
}
This solution is already working, but seems very crappy, so is there a better solution to this?
put everything in a Set using a comparator if necessary, and create a list from this set if you really need a List (and not a Collection)
Set<HeaderAndBodyTestcase> set = new Hashset<>(list);
Given your rather peculiar "equality" constraints, I think the best way would be to maintain two sets of already seen first- and second input files and a loop:
Set<String> first = new HashSet<>();
Set<String> second = new HashSet<>();
for (HeaderAndBodyTestcase tc : tcs) {
if (! first.contains(tc.getSecondInputFile()) &&
! second.contains(tc.getFirstInputFile())) {
first.add(tc.getFirstInputFile());
second.add(tc.getSecondInputFile());
System.out.println(tc); // or add to result list
}
}
This will also work if "equal" elements do not appear right after each other in the original list.
Also note that removing elements from a list while iterating the same list, while working sometimes, will often yield unexpected results. Better create a new, filtered list, or if you have to remove, create an Iterator from that list and use it's remove method.
On closer inspections (yes, it took me that long to understand your code), the conditions in your current working code are in fact much different than what I understood from your question, namely:
remove element if first and second is the same (actually never checked for the last element in the list)
remove element if first is the same as first on last, and second the same as second on last
remove if first is same as last second and vice versa
only consider consecutive elements (from comments)
Given those constraints, the sets are not needed and also would not work properly considering that both the elements have to match (either 'straight' or 'crossed'). Instead you can use pretty much your code as-is, but I would still use an Iterator and keep track of the last element, and also split the different checks to make the whole code much easier to understand.
HeaderAndBodyTestcase last = null;
for (Iterator<HeaderAndBodyTestcase> iter = list.iterator(); iter.hasNext();) {
HeaderAndBodyTestcase curr = iter.next();
if (curr.firstInputFile.equals(curr.secondInputFile)) {
iter.remove();
}
if (last != null) {
boolean bothEqual = curr.firstInputFile.equals(last.firstInputFile)
&& curr.secondInputFile.equals(last.secondInputFile);
boolean crossedEqual = curr.secondInputFile.equals(last.firstInputFile)
&& curr.firstInputFile.equals(last.secondInputFile);
if (bothEqual || crossedEqual) {
iter.remove();
}
}
last = curr;
}
Right now I have an array of "Dragon"s. Each item has two values. An ID and a Count. So my array would look something like this:
Dragon[] dragons = { new Dragon(2, 4),
new Dragon(83, 199),
new Dragon(492, 239),
new Dragon(2, 93),
new Dragon(24, 5)
};
As you can see, I have two Dragons with the ID of 2 in the array. What I would like to accomplish is, when a duplicate is found, just add the count of the duplicate to the count of the first one, and then remove the duplicate Dragon.
I've done this sort of successfully, but I would end up with a null in the middle of the array, and I don't know how to remove the null and then shuffle them.
This is what I have so far but it really doesn't work properly:
public static void dupeCheck(Dragon[] dragons) {
int end = dragons.length;
for (int i = 0; i < end; i++) {
for (int j = i + 1; j < end; j++) {
if (dragons[i] != null && dragons[j] != null) {
if (dragons[i].getId() == dragons[j].getId()) {
dragons[i] = new Item(dragons[i].getId(), dragons[i].getCount() + dragons[j].getCount());
dragons[j] = null;
end--;
j--;
}
}
}
}
}
You should most probably not maintain the dragon count for each dragon in the dragon class itself.
That aside, even if you are forced to use an array, you should create an intermeditate map to store your dragons.
Map<Integer, Dragon> idToDragon = new HashMap<>();
for (Dragon d : yourArray) {
// fetch existing dragon with that id or create one if none present
Dragon t = idToDragon.computeIfAbsent(d.getId(), i -> new Dragon(i, 0));
// add counts
t.setCount(t.getCount() + d.getCount());
// store in map
idToDragon.put(d.getId(), t);
}
Now the map contains a mapping between the dragons' ids and the dragons, with the correct counts.
To create an array out of this map, you can just
Dragon[] newArray = idToDragon.values().toArray(new Dragon[idToDragon.size()]);
You may be force to store the result in an array but that doesn't mean that you're force to always use an array
One solution could be using the Stream API, group the items adding the count and save the result into an array again. You can get an example of how to use the Stream API to sum values here. Converting a List<T> into a T[] is quite straightforward but anyways, you have an example here
The size of an array cannot be changed after it's created.
So you need to return either a new array or list containing the merged dragons.
public static Dragon[] merge(Dragon[] dragonArr) {
return Arrays.stream(dragonArr)
// 1. obtain a map of dragon IDs and their combined counts
.collect(groupingBy(Dragon::getId, summingInt(Dragon::getCount)))
// 2. transform the map entries to dragons
.entrySet().stream().map(entry -> new Dragon(entry.getKey(), entry.getValue()))
// 3. collect the result as an array
.toArray(Dragon[]::new);
}
I have a map in which values have references to lists of objects.
//key1.getElements() - produces the following
[Element N330955311 ({}), Element N330955300 ({}), Element N3638066598 ({})]
I would like to search the list of every key and find the occurrence of a given element (>= 2).
Currently my approach to this is every slow, I have a lot of data and I know execution time is relative but it takes 40seconds~.
My approach..
public String occurance>=2 (String id)
//Search for id
//Outer loop through Map
//get first map value and return elements
//inner loop iterating through key.getElements()
//if match with id..then iterate count
//return Strings with count == 2 else return null
The reason why this is so slow is because I have a lot of ids which I'm searching for - 8000~ and I have 3000~ keys in my map. So its > 8000*3000*8000 (given that every id/element exists in the key/valueSet map at least once)
Please help me with a more efficient way to make this search. I'm not too deep into practicing Java, so perhaps there's something obvious I'm missing.
Edited in real code after request:
public void findAdjacents() {
for (int i = 0; i < nodeList.size(); i++) {
count = 0;
inter = null;
container = findIntersections(nodeList.get(i));
if (container != null) {
intersections.add(container);
}
}
}
public String findIntersections(String id) {
Set<Map.Entry<String, Element>> entrySet = wayList.entrySet();
for (Map.Entry entry : entrySet) {
w1 = (Way) wayList.get(entry.getKey());
for (Node n : w1.getNodes()) {
container2 = String.valueOf(n);
if (container2.contains(id)) {
count++;
}
if (count == 2) {
inter = id;
count = 0;
}
}
}
if (inter != (null))
return inter;
else
return null;
}
Based on the pseudocode provided by you, there is no need to iterate all the keys in the Map. You can directly do a get(id) on the map. If the Map has it, you will get the list of elements on which you can iterate and get the element if its count is > 2. If the id is not there then null will be returned. So in that case you can optimize your code a bit.
Thanks
I have following two maps in the following manner:
Map<String,List<String>> sourceTags = sourceList.get(block);
Map<String,List<String>> targetTags = targetList.get(block);
I want to compare the list of values in sourceTags with list of values in targetTags corresponding to the key.
Now, the values in a map entry will be in the following manner :
SourceTag = [20C=[:ABC//0000000519983150], 22F=[:CAMV//MAND, :CAMV//MANDA], 98A[:XDTE//20160718,:MEET//20160602,:RDTE//20160719]
TargetTag = [20C=[:ABC//0000000519983150], 22F=[:CAMV//MAND],98A=[:MEET//20160602,:RDTE//20160719]
I want the output as below :
Blockquote
key-22F, compare the list of values with sub-key being CAMV, if sub-key exists, the compare the difference, else if sub-key not exists then also report.
Blockquote
Again, Key-98A, sub-Keys:XDTE,MEET,RDTE. If sub-key exists and found difference in values in source and target, then report. else if sub-key not found report as not found in source or target, same is the case with values.
if(sub-key found){
//compare their values
}else{
//report as sub-key not found
}
I have written the following program :
EDITED the Program
Set tags = sourceTags.keySet();
for(String targetTag : tags){
if(targetTags.containsKey(targetTag)){
List<String> sourceValue = sourceTags.get(targetTag);
List<String> targetValue = targetTags.get(targetTag);
for(String sValue : sourceValue){
for(String tValue : targetValue){
if(sValue.length() > 4 && tValue.length() > 4){
//get keys for both source and target
String sKey = sValue.substring(1, 5);
String tKey = tValue.substring(1,5);
//get values for both source and target
String sTagValue= sValue.substring(sValue.lastIndexOf(sKey), sValue.length());
String tTagValue = tValue.substring(tValue.lastIndexOf(tKey),tValue.length());
if(sKey.equals(tKey)){
if(!sTagValue.equals(tTagValue)){
values = createMessageRow(corpValue, block ,targetTag, sTagValue,tTagValue);
result.add(values);
}
}
}
}
}
}else{
System.out.println(sourceTags.get(targetTag).get(0));
values = createMessageRow(corpValue,block,targetTag,sourceTags.get(targetTag).get(0),"","Tag: "+targetTag+" not availlable in target");
result.add(values);
}
After executing, the comparison report shows wrong values.
Please help!!
Actually, your code has a major logical flow. When you compare the List contained in the two Maps accessed with the same key, you do this:
for(int index = 0; index < Math.max(sourceValue.size(), targetValue.size()); index ++ ){
if(index<sourceValue.size() && index<targetValue.size()){
//Do your comparations...
}
That means that you proceed along the two lists with the same index and then you compare the two items. You never compare an item of the first list with an item of the second list that doesn't have the same index.
I'll give you an example: having two lists
LIST_A = (A, B, C)
LIST_B = (C, B, A)
these are the comparisons you're making:
A == C
B == B
C == A
It's obvious then that even if the two lists contains the same elements the only correspondence you'll find is B == B.
You need to compare every item of the first list with ALL the items of the second one, to get all the matching pairs. Something like (without optimizations and elegance for clarity's sake):
for(String sValue : sourceValue){
for(String tValue : targetValue){
if(sValue.length() > 4 && tValue.length() > 4){
String sKey = sValue.substring(1,5);
String tKey = tValue.substring(1,5);
if(sKey.equals(tKey)){
//Do your logic...
}
}
}
}
This way, you don't even need to proceed in the other list when the index reaches the end of the first one like you do now...
I have been given an assignment to change to upgrade an existing one.
Figure out how to recode the qualifying exam problem using a Map for each terminal line, on the
assumption that the size of the problem is dominated by the number of input lines, not the 500
terminal lines
The program takes in a text file that has number, name. The number is the PC number and the name is the user who logged on. The program returns the user for each pc that logged on the most. Here is the existing code
public class LineUsageData {
SinglyLinkedList<Usage> singly = new SinglyLinkedList<Usage>();
//function to add a user to the linked list or to increment count by 1
public void addObservation(Usage usage){
for(int i = 0; i < singly.size(); ++i){
if(usage.getName().equals(singly.get(i).getName())){
singly.get(i).incrementCount(1);
return;
}
}
singly.add(usage);
}
//returns the user with the most connections to the PC
public String getMaxUsage(){
int tempHigh = 0;
int high = 0;
String userAndCount = "";
for(int i = 0; i < singly.size(); ++i){//goes through list and keeps highest
tempHigh = singly.get(i).getCount();
if(tempHigh > high){
high = tempHigh;
userAndCount = singly.get(i).getName() + " " + singly.get(i).getCount();
}
}
return userAndCount;
}
}
I am having trouble on the theoretical side. We can use a hashmap or a treemap. I am trying to think through how I would form a map that would hold the list of users for each pc? I can reuse the Usage object which will hold the name and the count of the user. I am not supposed to alter that object though
When checking if Usage is present in the list you perform a linear search each time (O(N)). If you replace your list with the Map<String,Usage>, you'll be able to search for name in sublinear time. TreeMap has O(log N) time for search and update, HashMap has amortized O(1)(constant) time.
So, the most effective data structure in this case is HashMap.
import java.util.*;
public class LineUsageData {
Map<String, Usage> map = new HashMap<String, Usage>();
//function to add a user to the map or to increment count by 1
public void addObservation(Usage usage) {
Usage existentUsage = map.get(usage.getName());
if (existentUsage == null) {
map.put(usage.getName(), usage);
} else {
existentUsage.incrementCount(1);
}
}
//returns the user with the most connections to the PC
public String getMaxUsage() {
Usage maxUsage = null;
for (Usage usage : map.values()) {
if (maxUsage == null || usage.getCount() > maxUsage.getCount()) {
maxUsage = usage;
}
}
return maxUsage == null ? null : maxUsage.getName() + " " + maxUsage.getCount();
}
// alternative version that uses Collections.max
public String getMaxUsageAlt() {
Usage maxUsage = map.isEmpty() ? null :
Collections.max(map.values(), new Comparator<Usage>() {
#Override
public int compare(Usage o1, Usage o2) {
return o1.getCount() - o2.getCount();
}
});
return maxUsage == null ? null : maxUsage.getName() + " " + maxUsage.getCount();
}
}
Map can also be iterated in the time proportional to it's size, so you can use the same procedure to find maximum element in it. I gave you two options, either manual approach, or usage of Collections.max utility method.
With simple words: You use a LinkedList (singly or doubly) when you have a list of items, and you usually plan to traverse them,
and a Map implementation when you have "Dictionary-like" entries, where a key corresponds to a value and you plan to access the value using the key.
In order to convert your SinglyLinkedList to a HashMap or TreeMap, you need find out which property of your item will be used as your key (it must be an element with unique values).
Assuming you are using the name property from your Usage class, you can do this
(a simple example):
//You could also use TreeMap, depending on your needs.
Map<String, Usage> usageMap = new HashMap<String, Usage>();
//Iterate through your SinglyLinkedList.
for(Usage usage : singly) {
//Add all items to the Map
usageMap.put(usage.getName(), usage);
}
//Access a value using its name as the key of the Map.
Usage accessedUsage = usageMap.get("AUsageName");
Also note that:
Map<string, Usage> usageMap = new HashMap<>();
Is valid, due to diamond inference.
I Solved this offline and didn't get a chance to see some of the answers which looked to be both very helpful. Sorry about that Nick and Aivean and thanks for the responses. Here is the code i ended up writing to get this to work.
public class LineUsageData {
Map<Integer, Usage> map = new HashMap<Integer, Usage>();
int hash = 0;
public void addObservation(Usage usage){
hash = usage.getName().hashCode();
System.out.println(hash);
while((map.get(hash)) != null){
if(map.get(hash).getName().equals(usage.name)){
map.get(hash).count++;
return;
}else{
hash++;
}
}
map.put(hash, usage);
}
public String getMaxUsage(){
String str = "";
int tempHigh = 0;
int high = 0;
//for loop
for(Integer key : map.keySet()){
tempHigh = map.get(key).getCount();
if(tempHigh > high){
high = tempHigh;
str = map.get(key).getName() + " " + map.get(key).getCount();
}
}
return str;
}
}