Using libsvm in Java for String classification

Using libsvm in Java for String classification - java

Looking around I was not able to find a good way to use libsvm with Java and I still have some open questions:
1) It is possible to use only libsvm or I have to use also weka? If any, what's the difference?
2) When using String type data how can I pass the training set as Strings? I was using matlab for a similar problem for proteins classification and there I just gave the strings to the machine without problem. Is there a way to do this in Java?
Here is an incomplete example of what I did in matlab (it works):
[~,posTrain] = fastaread('dataset/1.25.1.3_d1ilk__.pos-train.seq');
[~,posTest] = fastaread('dataset/1.25.1.3_d1ilk__.pos-test.seq');
trainKernel = spectrumKernel(trainData,k);
testKernel = spectrumKernel(testData,k);
trainKf =[(1:length(trainData))', trainKernel];
testKf = [(1:length(testData))', testKernel];
disp('custom');
model = libsvmtrain(trainLabel,trainKf,'-t 4');
[~, accuracy, ~] = libsvmpredict(testLabel,testKf,model)
As you can see I read the file in fasta format and feed them to libsvm but libsvm for java look like it wants something called Node that is made of double. What I did is to take byte[] from the String and then transform them into Double. Is it correct?
3) How to use a custom kernel? I've found this line of code
KernelManager.setCustomKernel(custom_kernel);
but with my libsvm.jar I don't find. Which lib do I have to use?
Sorry for the multiple questions, I hope you will give me a brief overview of what is going on here.
Thanks.

Please note that I've used LIBSVM for MATLAB, but not for Java. I can only really answer question 1, but hopefully this still helps:
It definitely is possible to use libsvm only, and the code is located here: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. Note that jlibsvm is a port of libsvm, and it seems to be easier to use and more optimized for Java. As far as I can tell, weka just has a wrapper class that runs libsvm anyways (it even requires the libsvm.jar), though I mainly based it off of this: https://weka.wikispaces.com/LibSVM.

Related

Apache Solr - How to index source code files

I want to write a program which is able to search in source code files for specific patterns ... in other words: the input is a piece of code for example:
int fib (int i) {
int pred, result, temp;
pred = 1;
result = 0;
while (i > 0) {
temp = pred + result;
result = pred;
pred = temp;
i = i-1;
}
return(result);
}
The output are files that contain this piece of code or similar code.
In the Open Source World code is reused in other projects. Especially libraries are often copied into projects. To make bug fixing easier I need to be able to know in which projects specific libraries or code is used.
Therefore I want to try to use apache solr. I don't know if its a good idea (I am would be happy about everything that could help me)
My plan is to index my source code files ... therefore I need some tools? to tokenize source code files. Like give me all names of functions, variables etc. The output I can use to feed the solr index. But I am not sure maybe there are already tokenizer or dataimporthandler in apache solr that do the trick?

I am not sure if this can be done using solr, since different projects may use different naming conventions.
Have a look at the link below if it helps:
Tools for Code Seacrh

Apache Solr is probably not the best option here. You have more like tree/graph comparison problem than string comparison here. I'd recommend using specialized tools for that.
If you do want to do it by hand, you basically need a parser with tree traversal API or some other way to get the stream/tree of tokens. This would very much depend on the language you are parsing. Something like ANTLR might be one way to go if it has the grammar for your language.
Alternatively, you could extract the information from the compiled code, if it is structured enough. For Java, something like ASM may do the job.
But you would still have to figure out the representation. Answering - to yourself - the question of how do I know these two pieces of code are similar should be the right first step.

What language is this (think it's Java?), and how do I test (using a browser ide) the math is correct in it?

div(1, sum(1, exp(sum(div(5, product(100, .1)), -5))))
I'm using this in a Solr query, and want to verify that it is the same as :
Where x is 5.
Is this language Java?
If it is, why am I getting this output here:
http://ideone.com/LWYWtU
If it isn't, what language is this and how do I test it?
Thanks in advance for your help.
EDIT: To add more of the surrounding code, here is the full boost value I'm sending to Solr:
if(exists(query({!frange l=0 u=60 v=product(geodist(),0.621371)})),div(1, sum(1, exp(sum(div(product(5), product(100, .1)), -5)))),0)
The reason I think it might be Java is because in the docs, it says Most Java Math functions are now supported, including: and then lists the math functions I ended up using for code.

Solr is Java, but that's not relevant since this is a set of functions that Solr parses and evaluate itself (and not related to Java, except that the backing functions are implemented in Java).
As far as I can say from what you've mapped the functions correctly, as long as the 5 in product(5) is the same as X. You shouldn't need product there, as the value can be included in div directly as far as I can see.
A way to validate it would be to use debugQuery in Solr and see what the value is evaluated as, and then compare it to your own value. Remember that floating point evaluation can introduce a few uncertanities.

Explain the functionality of JSON

I think it is better to understand why I have so many problems with JSON that I explain you what my goal is:
I work with Googles App Engine. There I want to store data. The data looks like
user - username
question - question
date1 - date1
date2 - date2
An Android App have the "simple" function to: Send the data which the user has entered and reviece the data from the complete database.
Ok, fine.
So I searched for a good "API" for that. The question about that was: "how can I read the data" and "how can I sent". The "simple" anwere was: use JSON.. . Many people say's that to me.
The first step was to show the data from the database. I write in python that:
json.dumps({"info": [{'user': 'username1', 'question': 'question1', 'date1':'date1', 'date2':'date1'}, {'user': 'username2', 'question': 'question2', 'date1':'date2', 'date2':'date2'}]})
It works. On the Client site I write in Java these:
JSONObject ob = new JSONObject(result);
JSONArray arNames = ob.getJSONArray("info");
for(int i = 0; i < arNames.length(); i++){
JSONObject c = arNames.getJSONObject(i);
Log.i("name", c.getString("name"));
Log.i("frage", c.getString("question"));
}
These works also.
But (and now the main question about the thread!):
Why we use JSON to format?! Why? I can with this data an other simple "API" without the JSON libarys and classes.
Example:
If I say on the Server site only:
!user:user;question:question;date1:date1;date2:date2
!user:user1;question:question1;data2:date3;date2:date3
... and so one...
On the Client site the same:
[READ THE DATA WITH ClientHTTP]
String[] all = result.Split("!");
for(int i = 0; i<all.length; i+= 1)
{
String[] split2 = all[i].Split(";");
String[] user = split2[0].Split(":");
// user[1] holds now the user
String[] split3 = split2[1].Split(";");
String[] questinn = split3.Split(":");
// question[1] holds now the question
... AND SO ONE!
So, why I use JSON? My option or example do the same. But with my own Syntax..
Thank you for help

JSON is a standard format and it's implementations make it easy to use -- No split() and other stuff necessary. Also, it's supported by all kinds of programming languages (like Python and Java in your own example) and so it provides a simple way to exchange data between completly different systems.
And it's well thought out and could for example also handle questions with ':' or ';' in it. A case where your suggested solution would fail.

I am not sure with JSON but there alreday was a thread explaining JSON (google knows everything). Maybe you can find some help here:
What is JSON and why would I use it?
http://www.copterlabs.com/blog/json-what-it-is-how-it-works-how-to-use-it/
EDIT: I forgot to answer the question why not to use your own function. Of course you can use it and it works. But a lot of services give a JSON to you. It is like a standard. Furthermore there is an JavaClass. So you do not have to do the work which others already have done (see: http://goo.gl/9X4HU)
Best regards

Don't do it by hand, it's error-prione and violates DRY (don't repeat yourself). Instead:
On server use a REST framework that automatically produces JSON. For example RESTEasy. Search the net for examples.
On Android use either built in support for JSON or better use on of well-known and tested libs: GSON or Jackson. See some speed comparisons. Alternativelly you can use Spring Android, which mashes networking+JSON in one easy to use package.

I use JSON in Android because it is lightweight data format which I can easily convert to Java objects using this google library.
You always have 2 possibilities - to use some library, or to write the code by yourself. I'm not saying that using the library is always an option, but in many cases it can save your time and reduce errors. It's up to you to decide.

Java API for plural forms of English words

Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?

Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"

jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"

I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.

considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".

Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).

If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.

Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/

How do I convert a Java Hashtable to an NSDictionary (obj-C)?

At the server end (GAE), I've got a java Hashtable.
At the client end (iPhone), I'm trying to create an NSDictionary.
myHashTable.toString() gets me something that looks darned-close-to-but-not-quite-the-same-as [myDictionary description]. If they were the same, I could write the string to a file and do:
NSDictionary *dict = [NSDictionary dictionaryWithContentsOfFile:tmpFile];
I could write a little parser in obj-C to deal with myHashtable.toString(), but I'm sort-of hoping that there's a shortcut already built into something, somewhere -- I just can't seem to find it.
(So, being a geek, I'll spend far longer searching the web for a shortcut than it would take me to write & debug the parser... ;)
Anyway -- hints?
Thanks!

I would convert the Hashtable into something JSON-like and take it on the iPhone side.
Hashtable.toString() is not ideal, it will have problem with spaces, comma and quotation marks.
For JSON-to-NSDictionary, you can find the json-framework tools under http://www.json.org/

As j-16 SDiZ mentioned, you need to serialize your hashtable. It can be to json, xml or some other format. Once serialized, you need to deserialize them into an NSDictionary. JSON is probably the easiest format to do this with plenty of libraries for both Objective-C and Java. http://json.org has a list of libraries.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using libsvm in Java for String classification - java

Related

Apache Solr - How to index source code files

What language is this (think it's Java?), and how do I test (using a browser ide) the math is correct in it?

Explain the functionality of JSON

Java API for plural forms of English words

How do I convert a Java Hashtable to an NSDictionary (obj-C)?

Categories

Resources