application about generating pairs of frequent itemset - java

I am doing an application that will compute all 2 size frequent itemset from a set of transactions. That is the application will have as input a data file (space delimited text file - with the items encoded as integers) and a percentage, given as an integer (e.g. input 2 represents 2%). The application will output in a distinct file each pair of numbers that appears together in the same transaction (a transaction is represented by one line in the file) in more than 2% of all transactions (where 2% is the percentage given in the input). The output file will contain each pair of items in a line together with their support (the number of transactions where they appear) also the application will output (on the screen on in a file) the duration (the time needed to execute the task).
the data file will be like
55 22 33 123 231 414
21 43 432 435 231 4324 534
22 21 33 123 231 534 666 222
...
each line is called a transaction and the input file contains thousands of transactions.
I am thinking about using the data mining rule first to find all the single numbers whose appear frequency is larger than 2% in each transaction, and then form pairs for each transaction and at last compare each pair and generate the output file.
anyone has some ideas or code for this please help, if you have code(better in java) for this that will be very helpful Thanks a lot.

Here's one way to count the integers.
public class IntCount {
public static void main(String[] args) {
count("123 234 456 678 789 234 234 123");
}
public static void count(String transactionLine) {
String[] parts = transactionLine.split(" ");
Map<String, Integer> hashTable = new HashMap<String, Integer>();
// Count duplicates
for (String s : parts) {
if (hashTable.get(s) == null) hashTable.put(s, 1);
else hashTable.put(s, hashTable.get(s) + 1);
}
for (String s : hashTable.keySet()) {
System.out.println("s: " + s + " count: " + hashTable.get(s));
}
}
}
Now you can start working through determining the 2% part.

Do each transaction one at a time. For each transaction, find all numbers that are paired. Put them in a HashTable<Integer,Integer> with the number as the key and a value of 1. If there is already an entry, increment the value.
Once you have processed all transactions, go through the HashMap and look for values greater than 2% the total number of transactions. These are your winners.
They can be output directly to a file, or stored in another data structure for sorting first.

What you want to do, is basically find all all frqequent 2-itemsets. And itemset that has 'k' elements is called k-itemset.
The easiest way for your task would be to modify any open source apriory implementation in java to stop enumerating itemsets, after finding all frequent 2-itemsets. It wouldn't be that difficult, because Apriori, starts by counting all the 1 itemsets, then it takes all the frequent 1-itemsets, Generates candidate 2-itemsets using them, scans the database again, counts the support for those candidate 2-itemsets, chooses the frequent ones, generates candidate 3 itemsets and so on...
For example, suppose that frequent 1 itemsets are the following
a, c, d
Then, the algorithm generates all possible 2 itemsets as following
ac, ad, cd
Counts their support by scanning the database again and filters out infrequent ones.

You could just create a 2-dimensional array of size n x n where n is the number of items.
The matrix would store the support of each pair of items.
Then you scan the transactions and increase the count in the matrix.
After finishing reading the database, you have all itemsets of size 2 and their frequency in the matrix.
Note that, for efficiency, a triangular matrix is generally used.

Related

Array and file creation basic Java Array Problem

I happen to be having a hard time with a problem. A test will be coming up soon for me and I'm just not so sure as to how to get this problem done. (Nothing too advanced) My main issue is that the reading of and creating the final file. the problem is right below. Any help is appreciated, thanks!
I've been told to add a summary due to the difficulty to understand what im asking. So essentially you have a file, text.txt and the program takes said file and adds all the numbers in it, spaces make the numbers distinguished and if a zero is at the front it needs to get taken away. All numbers are separated (whatever number of spaces are between them) is all separated with a space + space and a space = at the end followed by space the result of the addition. I'm looking for help as how to:
a) Make the new file
b) Make the +s and =s
c) Transferring from the original file
d) How to do the addition (dealing with large space gaps and doing the summing)
IM STILL VERY NEW TO JAVA SO ANY LONGER EXPLANATIONS WOULD BE GREATLY APPRECIATED
The approach you are to implement is to store each integer in an array of digits, with one digit per array element. We will be using arrays of length 25, so we will be able to store integers up to 25 digits long. We have to be careful in how we store these digits. Consider, for example, storing the numbers 38423 and 27. If we store these at the “front” of the array with the leading digit of each number in index 0 of the array, then when we go to add these numbers together, we’re likely to add them like this:
38423
27
Thus, we would be adding 3 and 2 in the first column and 8 and 7 in the second column. Obviously this won’t give the right answer. We know from elementary school arithmetic that we have to shift the second number to the right to make sure that we have the appropriate columns lined up:
38423
27
To simulate this right-shifting of values, we will store each value as a sequence of exactly 25 digits, but we’ll allow the number to have leading 0’s. For example, the problem above is converted into:
0000000000000000000038423
0000000000000000000000027
Now the columns line up properly and we have plenty of space at the front in case we have even longer numbers to add to these.
The data for your program will be stored in a file called sum.txt. Each line of the input file will have a different addition problem for you to solve. Each line will have one or more integers to be added together. Take a look at the input file at the end of this write-up and the output you are supposed to produce. Notice that you produce a line of output for each input line showing the addition problem you are solving and its answer. Your output should also indicate at the end how many lines of input were processed. You must exactly reproduce this output.
You should use the techniques described in chapter 6 to open a file, to read it line by line, and to process the contents of each line. In reading these numbers, you won’t be able to read them as ints or longs because many of them are too large to be stored in an int or long. So you’ll have to read them as String values using calls on the method next(). Your first task, then, will be to convert a String of digits into an array of 25 digits. As described above, you’ll want to shift the number to the right and include leading 0’s in front. Handout #15 provides an example of how to process an array of digits. Notice in particular the use of the String method charAt and the method Character.getNumericValue. These will be helpful for solving this part of the problem.
You are to add up each line of numbers, which means that you’ll have to write some code that allows you to add together two of these numbers or to add one of them to another. This is something you learned in Elementary School to add starting from the right, keeping track of whether there is a digit to carry from one column to the next. Your challenge here is to take a process that you are familiar with and to write code that performs the corresponding task.
Your program also must write out these numbers. In doing so, it should not print any leading 0’s. Even though it is convenient to store the number internally with leading 0’s, a person reading your output would rather see these numbers without any leading 0’s.
You can assume that the input file has numbers that have 25 or fewer digits and that the answer is always 25 digits or fewer. Notice, however, that you have to deal with the possibility that an individual number might be 0 or the answer might be 0. There will be no negative integers in the input file.
You should solve this problem using arrays that are exactly 25 digits long. Certain bugs can be solved by stretching the array to something like 26 digits, but it shouldn’t be necessary to do that and you would lose style points if your arrays require more than 25 digits.
The choice of 25 for the number of digits is arbitrary (a magic number), so you should introduce a class constant that you use throughout that would make it easy to modify your code to operate with a different number of digits.
Consider the input file as an example of the kind of problems your program must solve. We might use a more complex input file for actual grading.
The Java class libraries include classes called BigInteger and BigDecimal that use a strategy similar to what we are asking you to implement in this program. You are not allowed to solve this problem using BigInteger or BigDecimal. You must solve it using arrays of digits.
The sample program of handout #18 should be particularly helpful to study to prepare you for this programming problem. It has some significant differences from the task you are asked to solve, but it also has some similarities that you will find helpful to study. The textbook has a discussion of the program starting in section 7.6.
You may assume that the input file has no errors. In particular, you may assume that each line of input begins with at least one number and that each number and each answer will be 25 digits or fewer. There will be whitespace separating the various numbers, although there is no guarantee about how much whitespace there will be between numbers.
You will again be expected to use good style throughout your program and to comment each method and the class itself. A major portion of the style points will be awarded based on how you break this program down into static methods. As with the sample program in handout #18, try to think in terms of logical subtasks of the overall task and create different methods for different subtasks. You should have at least four static methods other than main and you are welcome to introduce more than four if you find it helpful.
Your program should be stored in a file called Sum.java. You will need to include the files Scanner.java and sum.txt from the class web page (under the “assignments” link) in the same folder as your program. For those using DrJava, you will either have to use a full path name for the file sum.txt (see section 6.2.2 of the book) or you will have to put the file in the same directory as the DrJava program.
Input file sum.txt
82384
204 435
22 31 12
999 483
28350 28345 39823 95689 234856 3482 55328 934803
7849323789 22398496 8940 32489 859320
729348690234239 542890432323 534322343298
3948692348692348693486235 5834938349234856234863423
999999999999999999999999 432432 58903 34
82934 49802390432 8554389 4789432789 0 48372934287
0
0 0 0
7482343 0 4879023 0 8943242
3333333333 4723 3333333333 6642 3333333333
Output that should be produced
82384 = 82384
204 + 435 = 639
22 + 31 + 12 = 65
999 + 483 = 1482
28350 + 28345 + 39823 + 95689 + 234856 + 3482 + 55328 + 934803 = 1420676
7849323789 + 22398496 + 8940 + 32489 + 859320 = 7872623034
729348690234239 + 542890432323 + 534322343298 = 730425903009860
3948692348692348693486235 + 5834938349234856234863423 = 9783630697927204928349658
999999999999999999999999 + 432432 + 58903 + 34 = 1000000000000000000491368
82934 + 49802390432 + 8554389 + 4789432789 + 0 + 48372934287 = 102973394831
0 = 0
0 + 0 + 0 = 0
7482343 + 0 + 4879023 + 0 + 8943242 = 21304608
3333333333 + 4723 + 3333333333 + 6642 + 3333333333 = 10000011364
Total lines = 14

Efficient way to store and search number combinations

I have an application where i generate unique combinations of numbers like where each combination is unique.I want to store the combinations in a way that i should be able to retrieve a few random combinations efficently later.
Here is an example of one combination
35 36 37 38 39 40 50
There are ~4 millions of combinations in all.
How should I store the data so that i can retrieve combinations later?
Since your combinations are unique and you actually don't have a query criteria on your numbers, it does not matter how you store them in the database. Just insert them in some table. To retrieve X random combinations simply do:
SELECT * FROM table ORDER BY RANDOM() LIMIT X
See:
Select random row(s) in SQLite
On storing array of integers in SQLite:
Insert a table of integers - int[] - into SQLite database,
I think there might be a different solution; in the sense of: do you really have to store all those combinations?!
Assuming that those combinations are just "random" - you could be using some (smart) maths, to some function getCombinationFor(), like
public List<Integer> getCombinationFor(long whatever)
that uses a fixed algorithm to create a unique result for each incoming input.
Like:
getCombinationFor(0): gives 0 1 2 3 10 20 30
getCombinationFor(1): gives 1 2 3 4 10 20 30 40
The above is of course pretty simple; and depending on your requirements towards those sequences you might require something much complicated. But: for sure, you can define such a function to return a permutation of a fixed set of numbers within a certain range!
The important thing is: this function returns a unique List for each and any input; and also important: given a certain sequence, you can immediately determine the number that was used to create that sequence.
So instead of generating a huge set of data containing unique sequences, you simply define an algorithm that knows how to create unique sequences in a deterministic way. If that would work for you, it completely frees you from storing your sequences at all!
Edit: just remembered that I was looking into something kinda "close" to this question/answer ... see here!
Sending data:
Use UNIQUE index. On Unique Violation Exception just rerandomize and send next record.
Retrive:
The simplest possible to implement is: Use HashSet, feed it whith random numbers (max is number of combinations in your database -1) while hashSet.size() is less then desired number of records to retrive. Use numbers from hashSet as ID's or rownums of records (combinations) to select the data: WHERE ID in (ids here).

How to reduce the time complexity of bucket filling program?

I was solving a problem which states following:
There are n buckets in a row. A gardener waters the buckets. Each day he waters the buckets between positions i and j (inclusive). He does this for t days for different i and j.
Output the volume of the waters in the buckets assuming initially zero volume and each watering increases the volume by 1.
Input: first line contains t and n seperated by spaces.
The next t lines contain i and j seperated by spaces.
Output: a single line showing the volume in the n buckets seperated by spaces.
Example:
Input:
2 2
1 1
1 2
Output:
2 1
Constraints:
0 <= t <= 104; 1 <= n <= 105
I tried this problem. But I use O(n*t) algorithm. I increment each time the bucket from i to j at each step. But this shows time limit error. Is there any efficient algorithm to solve this problem. A small hint would suffice.
P.S: I have used C++ and Java as tags bcoz the program can be programmed in both the languages.
Instead of remembering the amount of water in each bucket, remember the difference between each bucket and the previous one.
have two lists of the intervals, one sorted by upper, one by lower bound
then iterate over n starting with a volume v of 0.
On each iteration over n
check if the next interval starts at n
if so increase v by one and check the next interval.
do the same for the upper bounds but decrease the volume
print v
repeat with the next n
I think the key observation here is that you need to figure out a way to represent your (possibly) 105 buckets without actually allocating space for each and every one of them, and tracking them separately. You need to come up with a sparse representation to track your buckets, and the water inside.
The fact that your input comes in ranges gives you a good hint: you should probably make use of ranges in your sparse representation. You can do this by just tracking the buckets on the ends of each range.
I suggest you do this with a linked list. Each list node will contain 2 pieces of information:
a bucket number
the amount of water in that bucket
You assume that all buckets between the current bucket and the next bucket have the same volume of water.
Here's an example:
Input:
5 30
1 5
4 20
7 13
25 30
19 27
Here's what would happen on each step of the algorithm, with step 1 being the initial state, and each successive step being what you do after parsing a line.
1:0→NULL (all buckets are 0)
1:1→6:0→NULL (1-5 have 1, rest are 0)
1:1→4:2→6:1→21:0→NULL (1-3 have 1, 4-5 have 2, 6-20 have 1, rest have 0)
1:1→4:2→6:1→7:2→14:1→21:0→NULL
1:1→4:2→6:1→7:2→14:1→21:0→25:1→NULL
1:1→4:2→6:1→7:2→14:1→19:2→21:1→25:2→28:1→NULL
You should be able to infer from the above example that the complexity with this method is actually O(t2) instead of O(n×t), so this should be much faster. As I said in my comment above, the bottleneck this way should actually be the parsing and output rather than the actual computation.
Here's an algorithm with space and time complexity O(n)
I am using java since I am used to it
1) Create a hashset of n elements
2) Each time a watering is made increase the respective elements count
3) After file parsing is complete then iterate over hashset to calculate result.

How to read the data stored in a text file into an array, then sort that data and store it

I have a text file containing 10 rows. Each row has 10 elements separated by commas which are already sorted row wise like:
3463,34957,44443,50481,71036,73503,74289,76671,82462,92527
1456,2731,18159,20440,32962,38562,49321,64220,67615,72541
1073,6217,9695,27372,30624,38021,47851,68479,76834,88021
7930,11882,17681,27267,32131,45096,59008,69156,72843,94146
2381,4359,30194,40730,73714,74721,75127,78830,86753,89475
1466,21335,21369,23342,36973,50888,67891,78069,90346,99970
15015,16628,21012,25483,42387,42519,45472,49552,57193,71449
1751,8833,35433,39972,44475,47604,51601,59108,87957,94764
10728,17248,31885,41453,41479,54785,81400,83554,86014,87105
228,9479,25187,50956,70720,71878,78744,84341,86637,88225
Now i want to sort these 100 elements without disturbing the row order (i.e: The smallest number (228) should be at first position and largest number (99970) should be at the last position and i need to store those fully Sorted numbers into another file.
I am facing problem to add these numbers in Array and then i want to know how to sort these. The constraint is not more than 10 elements should be in RAM at a time.
I have started to written some code for this purpose to get data from the file:
public static void main(String args[])
{
File file = new File("SortedLines.txt");
FileInputStream fis = null;
String st;
try
{
fis = new FileInputStream(file);
int content;
while ((content = fis.read()) != -1)
{
// convert to char and display it
System.out.print((char)content);
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
For each row:
split the String read from the file by ','
create a new Integer[splittedString.length], iterate over the string array, create an Integer from that String with Integer.parseInt(..) and put that in the appropriate position of the created Integer[]
call Arrays.sort(..) with the created array
If the numbers to sort is in the order you mention above, you could simply use Arrays.sort():
Create one array containing all numbers (I understand these are integers), say myUnsortedArray
Call Arrays.sort(myUnsortedArray).
That should do the job of sorting the array. You can then transform it the way you want.
Hope this helps.
The comments to answers above seem to slightly shift the problem here. You stated the problem as follows:
in real i have a file containing 1 million entries in 1000
* 1000 matrix form and i can't take more than 10000 elements in RAM at a time
And then:
"You are given a Million of numbers and you have to find the 100th
smallest number." PROBLEM CONSTRAINT: Not more than 10,000 elements
can be in RAM at a time
I would suggest you to edit your original question to reflect this problem -- if this is really the ultimate problem you are trying to solve.
What I understand of your problem is: you have to find the 100th smallest number out of a 1000*1000 matrix (please note: this is quite different from saying you've for 1 million numbers), with the constraint that no more than 10.000 numbers can be kept in memory. If I am correct, a potential solution may be:
Load one matrix row in memory as an array, let's call it minValues
Sort it with Arrays.sort() as suggested before.
Keep the lowest and highest value in temporary variables, let's call them a and b
for each values of the subsequent rows, let's call it x, check whether a < x < b. If that's the case, insert the value in minValues. this will naturally push the last element out of the array, so you'll have to change the value for b
At the end of all iterations, minValues will contain the smallest 100 elements in the matrix, just pick the last (i.e. b) and that will be your 100th smallest element.
You can parameterise this method with any value (e.g. if you need the smallest 157th element), and the memory footprint is
100 elements (for MinValues) + 1000 elements (the row under
inspection) + 2 (a and b) = 1102 elements
Still way below the maximum 10.000 elements limit. Performance in terms of speed may not be great, but this requirements was not in the picture -- and anyway, when dealing with large amounts of data under small memory requirements you do have to trade some performance.
I'd love to hear of a better way of achieving the goal.
EDIT: I'd suggest you also check out the Frederickson and Johnson algorithm. It solves the issue in O(K) time, where K is the element shought after (100 in your case). Not sure about the memory footprint though.
Hope this helps.

Searching an array for sum of values

I have a system that generates values in a text file which contains values as below
Line 1 : Total value possible
Line 2 : No of elements in the array
Line 3(extra lines if required) : The numbers themselves
I am now thinking of an approach where I can subtract the total value from the first integer in the array and then searching the array for the remainder and then doing the same until the pair is found.
The other approach is to add the two integers in the array on a permutation and combination basis and finding the pair.
As per my analysis the first solution is better since it cuts down on the number of iterations.Is my analysis correct here and is there any other better approach?
Edit :
I'll give a sample here to make it more clear
Line 1 : 200
Line 2=10
Line 3 : 10 20 80 78 19 25 198 120 12 65
Now the valid pair here is 80,120 since it sums up to 200 (represented in line one as Total Value possible in the input file) and their positions in the array would be 3,8.So find to this pair I listed out my approach where I take the first element and I subtract it with the Total value possible and searching the other element through basic search algorithms.
Using the example here I first take 10 and subtract it with 200 which gives 190,then I search for 190,if it is found then the pair is found otherwise continue the same process.
Your problem is vague, but if you are looking for a pair in the array that is summed to a certain number, it can be done in O(n) on average using hash tables.
Iterate the array, and for each element:
(1) Check if it is in the table. If it is - stop and return there is such a pair.
(2) Else: insert num-element to the hash table.
If your iteration terminated without finding a match - there is no such pair.
pseudo code:
checkIfPairExists(arr,num):
set <- new empty hash set
for each element in arr:
if set.contains(element):
return true
else:
set.add(num-element)
return false
The general problem of "is there a subset that sums to a certain number" is NP-Hard, and is known as the subset-sum problem, so there is no known polynomial solution to it.
If you're trying to find a pair (2) numbers which sum to a third number, in general you'll have something like:
for(i=0;i<N;i++)
for(j=i+1;j<N;j++)
if(numbers[i]+numbers[j]==result)
The answer is <i,j>
end
which is O(n^2). However, it is possible to do better.
If the list of numbers is sorted (which takes O(n log n) time) then you can try:
for(i=0;i<N;i++)
binary_search 'numbers[i+1:N]' for result-numbers[i]
if search succeeds:
The answer is <i, search_result_index>
end
That is you can step through each number and then do a binary search on the remaining list for its companion number. This takes O(n log n) time. You may need to implement the search function above yourself as built-in functions may just walk down the list in O(n) time leading to an O(n^2) result.
For both methods, you'll want to check to for the special case that the current number is equal to your result.
Both algorithms use no more space than is taken by the array itself.
Apologies for the coding style, I'm not terribly familiar with Java and it's the ideas here which are important.

Categories