Confusion in building a Svm Training Set - java

I am currently testing the training phase of my Binary SVM Java implementation.
I have tested it for small data shown below, but I need to apply my svm to a known dataset like spam/not spam, images, etc.
My SVM is capable of reading numeric values so I need to test it with some real data also.
Later I want to move on to images.
To find a real data set, I searched through different repos, but all I could find was numerical values + characters, text, etc.
And I found a spam Archive.
But how do I proceed with that?
I think I need to convert the text into numerical data using tfidf and then apply my SVM.
But how do I indicate them as 1/-1 class.
Normally the input would be of this format right?
0 0 1
3 4 1
5 9 1
12 1 1
8 7 1
9 8 -1
6 12 -1
10 8 -1
8 5 -1
14 8 -1
How do I bring the spam archive data into the above format?

It's all about the features selections. The input is of course the pairs of documents and labels. But the feature extraction is included in the training process. The most straightforward way is the binary representation, in which we check whether a particular word occurs in some particular documents. It is also referred to term frequency: the ith components in the feature vector is the time word wi occurs in one document. Here the vector is a established dictionary that included all the words in the training documents. You may also consider the inverse document frequency: number of times that wi occurs in all documents divided by the total number of documents.
FYI, one research paper about SVM on spam:
http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/00788645-SVMspam.pdf

Related

How to replace numbers with same 'stem' that occur in two files using same logic?

So basically I have two .txt files that have the same numbers (16 digits), with the first 8 digits all being identical (eg. 12345678) and then the next 8 digits being random (eg. 38462943). What I have been trying to do is replace the numbers in both files to any unique random 16 digit numbers, using the same logic in both files.
TL;DR - The problem I'm having is how can I locate the same numbers in each file and then replace them using the same logic?
**note - the files do not just contain the numbers I want randomized, there is other information on the same line (eg.line 1 1234, 1234567800000234, 5678)
example: (notice how numbers are same but are not in same order)
File 1
1234567800000234
1234567800011523
1234567800284828
File 2
1234567800284828
1234567800011523
1234567800000234
expected output (just want numbers randomized, doesn't matter if stem is changed or not)
File 1
9348384028472894
9350148852541329
9761213142823690
File 2
9761213142823690
9350148852541329
9348384028472894
**edit - clairification
You have two real options here.
Use a pseudorandom conversion on the numbers that meet the criteria -- run it through a SHA1 or something. This will produce the same output every time, so you don't have to keep track. It won't be truly random though.
Keep a list of every substitution you've made, so that when you go through the second file you can use that list to find out what the correct substitution should be.

figuring out the size of object to handle max item size in dynamoDb [duplicate]

I'm trying to compute the size of an item in dynamoDB and I'm not able to understand the definition.
The definition I found : An item size is the sum of lengths of its attribute names and values (binary and UTF-8 lengths). So it helps if you keep attribute names short.
Does it mean that if I put a number in the database, example: 1 it'll take the size of an int? a long? a double? Will it take the same amount of space than 100 or 1000000 or it'll take only the size of the corresponding binary?
And what is the computing for String?
Is there someone that knows how to compute it?
That's a non trivial topic indeed - You already quoted the somewhat sloppy definition from the Amazon DynamoDB Data Model:
An item size is the sum of lengths of its attribute names and values
(binary and UTF-8 lengths).
This is detailed further down the page within Amazon DynamoDB Data Types a bit:
String - Strings are Unicode with UTF8 binary encoding.
Number - Numbers are positive or negative exact-value decimals and integers. A number can have up to 38 digits of precision after the decimal point, and can be between 10^-128 to 10^+126. The representation in Amazon DynamoDB is of variable length. Leading and trailing zeroes are trimmed.
A similar question than yours has been asked in the Amazon DynamoDB forum as well (see Curious nature of the "Number" type) and the answer from Stefano#AWS sheds more light on the issue:
The "Number" type has 38 digits of precision These are actual decimal
digits. So it can represent pretty large numbers, and there is no
precision loss.
How much space does a Number value take up? Not too
much. Our internal representation is variable length, so the size is
correlated to the actual (vs. maximum) number of digits in the value.
Leading and trailing zeroes are trimmed btw. [emphasis mine]
Christopher Smith's follow up post presents more insights into the resulting ramifications regarding storage consumption and its calculation, he concludes:
The existing API provides very little insight in to storage
consumption, even though that is part (admittedly not that
significant) of the billing. The only information is the aggregate
table size, and even that data is potentially hours out of sync.
While Amazon does not expose it's billing data via an API yet, they they'll hopefully add an option to retrieve some information regarding item size to the DynamoDB API at some point, as suggested by Christopher.
I found this answer in amazon developer forum answered by Clarence#AWS:
eg:-
"Item":{
"time":{"N":"300"},
"feeling":{"S":"not surprised"},
"user":{"S":"Riley"}
}
in order to calculate the size of the above object:
The item size is the sum of lengths of the attribute names and values,
interpreted as UTF-8 characters. In the example, the number of bytes of
the item is therefore the sum of
Time : 4 + 3
Feeling : 7 + 13
User : 4 + 5
Which is 36
For the formal definition, refer to:
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/WorkingWithDDItems.html
An item’s size is the sum of all its attributes’ sizes, including the hash and range key attributes.
Attributes themselves have a name and a value. Both the name and value contribute to an attribute’s size.
Names are sized the same way as string values. All values are sized differently based on their data type.
If you're interested in the nitty-gritty details, have a read of this blog post.
Otherwise, I've also created a DynamoDB Item Size and Consumed Capacity Calculator that accurately determines item sizes.
Numbers are easily DynamoDB's most complicated type. AWS does not publicly document how to determine how many bytes are in a number. They say this is so they can change the internal implementation without anyone being tied to it. What they do say, however, sounds simple but is more complicated in practice.
Very roughly, though, the formula is something like 1 byte for every 2 significant digits, plus 1 extra byte for positive numbers or 2 for negative numbers. Therefore, 27 is 2 bytes and -27 is 3 bytes. DynamoDB will round up if there’s an uneven amount of digits, so 461 will use 3 bytes (including the extra byte). Leading and trailing zeros are trimmed before calculating the size.
You can use the algorithm for computing DynamoDB item size in the DynamoDB Storage Backend for Titan DynamoDBDelegate class.
All the above answers skip the issue of storing length of attributes as well as length of attribute names and the type of each attribute.
The DynamoDB Naming Guide says names can be 1 to 255 characters long which implies a 1 byte name length overhead.
We can work back from the 400kb maximum item limit to know there's an upper limit on the length required for binary or string items - they don't need to store more than a 19bit number for the length.
Using a bit of adaptive coding, I would expect:
Numbers have a 1 byte leading type and length value but could also be coded into a single byte (eg: a special code for a Zero value number, with no value bytes following)
String and binary have 1-3 bytes leading type and length
Null is just a type byte without a value
Bool is a pair of type bytes without any other value
Collection types have 1-3 bytes leading type and length.
Oh, and DynamoDB is not schemaless. It is schema-per-item because it's storing the types, names and lengths of all these variable length items.
An approximation to how much an item occupies in your DynamoDB table is to do a get petition with the boto3 library.
This is not an exact solution on to which is the size of an element, but it will help you to make an idea. When performing a batch_get_item(**kwargs) you get a response that includes the ConsumedCapacity in the following form:
....
'ConsumedCapacity': [
{
'TableName': 'string',
'CapacityUnits': 123.0,
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'Table': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
},
'LocalSecondaryIndexes': {
'string': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
}
},
'GlobalSecondaryIndexes': {
'string': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
}
}
},
]
...
From there you can see how much capacity units it took and you can extract and aproximated size of the item. Obviously this is based in your configuration of the system due to the fact that:
One read request unit represents one strongly consistent read request, or two eventually consistent read requests, for an item up to 4 KB in size. Transactional read requests require 2 read request units to perform one read for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units. The total number of read request units required depends on the item size, and whether you want an eventually consistent or strongly consistent read.
The simplest approach will be to create a item in the table and export the item to csv file which is a option available in DynamoDB. The size of the csv file will give you the item size approximately.

N-way merge sort a 2G file of strings

This is another question from cracking coding interview, I still have some doubt after reading it.
9.4 If you have a 2 GB file with one string per line, which sorting algorithm
would you use to sort the file and why?
SOLUTION
When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.
Doubt:
When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong?
Thanks!
http://en.wikipedia.org/wiki/External_sorting
A quick look on Wikipedia tells me that during the merging process you never hold a whole chunk in memory. So basically, if you have K chunks, you will have K open file pointers but you will only hold one line from each file in memory at any given time. You will compare the lines you have in memory and then output the smallest one (say, from chunk 5) to your sorted file (also an open file pointer, not in memory), then overwrite that line with the next line from that file (in our example, file 5) into memory and repeat until you reach the end of all the chunks.
First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all.
And as to the storage required, there are two possibilities.
The first is to merge the sorted data in groups of two. Say you have three groups:
A: 1 3 5 7 9
B: 0 2 4 6 8
C: 2 3 5 7
With that method, you would merge A and B in to a single group Y then merge Y and C into the final result Z:
Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B).
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C).
This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations.
The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C).
This involves only one merge operation but it requires more storage, basically one element per list.
Which of these you choose depends on the available memory and the element size.
For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.
Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.
So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.
You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.
The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

Reducing the granularity of a data set

I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:
# Students Year Subject Teacher
1 30 7 Math Mrs Smith
2 28 7 Math Mr Cork
3 20 8 Math Mrs Smith
4 20 8 English Mr White
5 18 8 English Mr Book
6 10 12 Math Mrs Jones
Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:
# Students Year Subject
1 58 7 Math
2 20 8 Math
3 38 8 English
4 10 12 Math
My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being able to 'roll up' these collections quickly may become very important.
My initial (perhaps naive) thought would be to do something along the following lines;
Until I exhaust the list of records:
Each 'unique' record that I come
across is added as a key to a
hashmap.
If I encounter a record that
has the same data for this new level
of aggregation, add its quantity to
the existing one.
Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.
"Get a new cache" not an option I'm afraid :)
-Dave.
Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.
Of course, you would have to build the index when populating the cache and maintain it as data is modified.

Is there a way in which i can give the following matrix as an input to a kmeans clustering program?

imagine I have the following "Pageview matrix"
COLUMN HEADINGS: books placement resources br aca
Each row represents a session
so this is my matrix,sample:
4 5 0 2 2
1 2 1 7 3
1 3 6 1 6
saved in a .txt file
Can i give this as an input to a k-means program and obtain clusters based on the highest frequency of occurrence?? How do i use it?
Can i give this as an input to a k-means program and obtain clusters based on the highest frequency of occurrence?
This is not what k-means does.
You can feed it to a k-means algorithm though. Each row is just a point in a 5d space - what part are you having trouble with?

Categories