Apache Solr - How to index source code files - java

I want to write a program which is able to search in source code files for specific patterns ... in other words: the input is a piece of code for example:
int fib (int i) {
int pred, result, temp;
pred = 1;
result = 0;
while (i > 0) {
temp = pred + result;
result = pred;
pred = temp;
i = i-1;
}
return(result);
}
The output are files that contain this piece of code or similar code.
In the Open Source World code is reused in other projects. Especially libraries are often copied into projects. To make bug fixing easier I need to be able to know in which projects specific libraries or code is used.
Therefore I want to try to use apache solr. I don't know if its a good idea (I am would be happy about everything that could help me)
My plan is to index my source code files ... therefore I need some tools? to tokenize source code files. Like give me all names of functions, variables etc. The output I can use to feed the solr index. But I am not sure maybe there are already tokenizer or dataimporthandler in apache solr that do the trick?

I am not sure if this can be done using solr, since different projects may use different naming conventions.
Have a look at the link below if it helps:
Tools for Code Seacrh

Apache Solr is probably not the best option here. You have more like tree/graph comparison problem than string comparison here. I'd recommend using specialized tools for that.
If you do want to do it by hand, you basically need a parser with tree traversal API or some other way to get the stream/tree of tokens. This would very much depend on the language you are parsing. Something like ANTLR might be one way to go if it has the grammar for your language.
Alternatively, you could extract the information from the compiled code, if it is structured enough. For Java, something like ASM may do the job.
But you would still have to figure out the representation. Answering - to yourself - the question of how do I know these two pieces of code are similar should be the right first step.

Related

How to modify update a large file with small content changes at specific indexes

I need to modify a file. We've already written a reasonably complex component to build sets of indexes describing where interesting things are in this file, but now I need to edit this file using that set of indexes and that's proving difficult.
Specifically, my dream API is something like this
//if you'll let me use kotlin for a second, assume we have a simple tuple class
data class IdentifiedCharacterSubsequence { val indexOfFirstChar : int, val existingContent : String }
//given these two structures
List<IdentifiedCharacterSubsequences> interestingSpotsInFile = scanFileAsPerExistingBusinessLogic(file, businessObjects);
Map<IdentifiedCharacterSubsequences, String> newContentByPreviousContentsLocation = generateNewValues(inbterestingSpotsInFile, moreBusinessObjects);
//I want something like this:
try(MutableFile mutableFile = new com.maybeGoogle.orApache.MutableFile(file)){
for(IdentifiedCharacterSubsequences seqToReplace : interestingSpotsInFile){
String newContent = newContentByPreviousContentsLocation.get(seqToReplace);
mutableFile.replace(seqToReplace.indexOfFirstChar, seqtoReplace.existingContent.length, newContent);
//very similar to StringBuilder interface
//'enqueues' data changes in memory, doesnt actually modify file until flush call...
}
mutableFile.flush();
// ...at which point a single write-pass is made.
// assumption: changes will change many small regions of text (instead of large portions of text)
// -> buffering makes sense
}
Some notes:
I cant use RandomAccessFile because my changes are not in-place (the length of newContent may be longer or shorter than that of seq.existingContent)
The files are often many megabytes big, thus simply reading the whole thing into memory and modifying it as an array is not appropriate.
Does something like this exist or am I reduced to writing my own implementation using BufferedWriters and the like? It seems like such an obvious evolution from io.Streams for a language which typically emphasizes indexed based behaviour heavily, but I cant find an existing implementation.
Lastly: I have very little domain experience with files and encoding schemes, so I have taken no effort to address the 'two-index' character described in questions like these: Java charAt used with characters that have two code units. Any help on this front is much appreciated. Is this perhaps why I'm having trouble finding an implementation like this? Because indexes in UTF-8 encoded files are so pesky and bug-prone?

Using libsvm in Java for String classification

Looking around I was not able to find a good way to use libsvm with Java and I still have some open questions:
1) It is possible to use only libsvm or I have to use also weka? If any, what's the difference?
2) When using String type data how can I pass the training set as Strings? I was using matlab for a similar problem for proteins classification and there I just gave the strings to the machine without problem. Is there a way to do this in Java?
Here is an incomplete example of what I did in matlab (it works):
[~,posTrain] = fastaread('dataset/1.25.1.3_d1ilk__.pos-train.seq');
[~,posTest] = fastaread('dataset/1.25.1.3_d1ilk__.pos-test.seq');
trainKernel = spectrumKernel(trainData,k);
testKernel = spectrumKernel(testData,k);
trainKf =[(1:length(trainData))', trainKernel];
testKf = [(1:length(testData))', testKernel];
disp('custom');
model = libsvmtrain(trainLabel,trainKf,'-t 4');
[~, accuracy, ~] = libsvmpredict(testLabel,testKf,model)
As you can see I read the file in fasta format and feed them to libsvm but libsvm for java look like it wants something called Node that is made of double. What I did is to take byte[] from the String and then transform them into Double. Is it correct?
3) How to use a custom kernel? I've found this line of code
KernelManager.setCustomKernel(custom_kernel);
but with my libsvm.jar I don't find. Which lib do I have to use?
Sorry for the multiple questions, I hope you will give me a brief overview of what is going on here.
Thanks.
Please note that I've used LIBSVM for MATLAB, but not for Java. I can only really answer question 1, but hopefully this still helps:
It definitely is possible to use libsvm only, and the code is located here: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. Note that jlibsvm is a port of libsvm, and it seems to be easier to use and more optimized for Java. As far as I can tell, weka just has a wrapper class that runs libsvm anyways (it even requires the libsvm.jar), though I mainly based it off of this: https://weka.wikispaces.com/LibSVM.

Need a real simple way to read json value in java

I am currently writing an application in Java, and am struggling to extract the values from a String which is in a JSON format.
Could someone help me with the easiest, most simplest way to extract data from this string? I'd prefer not to use external library if at all possible.
{"exchange":{"status":"Enabled","message":"Broadband on Fibre Technology","exchange_code":"NIWBY","exchange_name":"WHITEABBEY"},"products":[{"name":"20CN ADSL Max","likely_down_speed":1.5,"likely_up_speed":0.15,"availability":true....
Could someone explain how I could return the "likely down speed" of "20CN ADSL Max for example?
Thanks
Currently , there is no way in Java to parse json without an external lib (or your own implementation).
The org.json library is a standard when working with JSON.
You can use this snippet along with the library to achieve what you asked:
JSONObject obj = new JSONObject(" .... ");
JSONArray arr = obj.getJSONArray("products");
for (int i = 0; i < arr.length(); i++) {
String name = arr.getJSONObject(i).getString("name");
if ( name.equals("20CN ADSL Max") ) {
String s = arr.getJSONObject(i).getString("likely down speed");
}
}
Hope this helps.
For sure it's possible to do the parsing yourself, but it'll be much faster if you rely upon an existing library such as org.json.
With that, you can easily convert the string into a JSON object and extract all the fields you need.
If an existing library is not an option, you'll need to build yourself the tree describing the object in order to extract the pair key-values
While this may seem like a very simple, straightforward task, it gets rather complicated rather quickly.
Check out the SO thread How to parse JSON in Java. There is unfortunately not a single, clear solution to that question as shown in that thread. But I guess the org.json library seems to be the most popular solution.
If your application needs to handle arbitrary JSON, I would advise against trying to build your own parser.
Whatever your objections are to using an external library, get over them.

Explain the functionality of JSON

I think it is better to understand why I have so many problems with JSON that I explain you what my goal is:
I work with Googles App Engine. There I want to store data. The data looks like
user - username
question - question
date1 - date1
date2 - date2
An Android App have the "simple" function to: Send the data which the user has entered and reviece the data from the complete database.
Ok, fine.
So I searched for a good "API" for that. The question about that was: "how can I read the data" and "how can I sent". The "simple" anwere was: use JSON.. . Many people say's that to me.
The first step was to show the data from the database. I write in python that:
json.dumps({"info": [{'user': 'username1', 'question': 'question1', 'date1':'date1', 'date2':'date1'}, {'user': 'username2', 'question': 'question2', 'date1':'date2', 'date2':'date2'}]})
It works. On the Client site I write in Java these:
JSONObject ob = new JSONObject(result);
JSONArray arNames = ob.getJSONArray("info");
for(int i = 0; i < arNames.length(); i++){
JSONObject c = arNames.getJSONObject(i);
Log.i("name", c.getString("name"));
Log.i("frage", c.getString("question"));
}
These works also.
But (and now the main question about the thread!):
Why we use JSON to format?! Why? I can with this data an other simple "API" without the JSON libarys and classes.
Example:
If I say on the Server site only:
!user:user;question:question;date1:date1;date2:date2
!user:user1;question:question1;data2:date3;date2:date3
... and so one...
On the Client site the same:
[READ THE DATA WITH ClientHTTP]
String[] all = result.Split("!");
for(int i = 0; i<all.length; i+= 1)
{
String[] split2 = all[i].Split(";");
String[] user = split2[0].Split(":");
// user[1] holds now the user
String[] split3 = split2[1].Split(";");
String[] questinn = split3.Split(":");
// question[1] holds now the question
... AND SO ONE!
So, why I use JSON? My option or example do the same. But with my own Syntax..
Thank you for help
JSON is a standard format and it's implementations make it easy to use -- No split() and other stuff necessary. Also, it's supported by all kinds of programming languages (like Python and Java in your own example) and so it provides a simple way to exchange data between completly different systems.
And it's well thought out and could for example also handle questions with ':' or ';' in it. A case where your suggested solution would fail.
I am not sure with JSON but there alreday was a thread explaining JSON (google knows everything). Maybe you can find some help here:
What is JSON and why would I use it?
http://www.copterlabs.com/blog/json-what-it-is-how-it-works-how-to-use-it/
EDIT: I forgot to answer the question why not to use your own function. Of course you can use it and it works. But a lot of services give a JSON to you. It is like a standard. Furthermore there is an JavaClass. So you do not have to do the work which others already have done (see: http://goo.gl/9X4HU)
Best regards
Don't do it by hand, it's error-prione and violates DRY (don't repeat yourself). Instead:
On server use a REST framework that automatically produces JSON. For example RESTEasy. Search the net for examples.
On Android use either built in support for JSON or better use on of well-known and tested libs: GSON or Jackson. See some speed comparisons. Alternativelly you can use Spring Android, which mashes networking+JSON in one easy to use package.
I use JSON in Android because it is lightweight data format which I can easily convert to Java objects using this google library.
You always have 2 possibilities - to use some library, or to write the code by yourself. I'm not saying that using the library is always an option, but in many cases it can save your time and reduce errors. It's up to you to decide.

Java API for plural forms of English words

Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?
Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"
jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"
I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.
considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".
Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).
If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.
Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/

Categories