Crawling local filesystem - how to test that

Crawling local filesystem - how to test that - java

I'm planning to build an application which would crawl a part of a local filesystem (a subtree) in a depth-first-search manner and process all files it finds, except for some configurable exceptions.
To give an example, let's say I have a directory structure like this:
> documents
- generic-doc.txt
> mails
- mail-01.txt
- mail-02.txt
- mail-03.txt
> unread
- mail-04.txt
> invoices
> paid
- invoice-01.pdf
- invoice-02.pdf
> unpaid
- invoice-03.pdf
I also have an exclusion rule like this:
exclude = "documents/mails/unread | documents/invoices"
Given these data on input, my application would process the following documents:
generic-doc.txt
mail-01.txt
mail-02.txt
mail-03.txt
(e.g. it would process all files, except for those located in the documents/mails/unread and documents/invoices folders)
In future, I might need to implement various forms of exlusion rules.
What is the best way to test the implementation of the crawling module (e.g. that when given an exclusion rule, the module would return the correct set of documents)? Can it be done without using a real filesystem?

Extract the exclusion ruling to a separate module/class/object and test that in isolation. Then make sure, that your crawler asks the ExclusionRule before processing a file.
A sketch
public interface FileExcluder {
boolean isExcluded(File aFile);
}
Note that there is already the FileFilter that provides a similar service, maybe you can reuse that abstraction.

If you are using Java 7 you can create a dummy Filesystem. (Assuming you are using that)
You can create an interface which can be mocked out for all file handling operations but it's likely to be much simpler to create test files and test those (and delete them when finished)

Related

How to create and use a config file to choose a path of a decision tree?

I have an application in Java that makes use of different mathematical models to classify and cluster some objects.
Currently, I am just hard coding which model I want to run and all its parameters.
I would like to set up a configuration file that allows for a certain sequence of steps to be taken. For example, here is a decision tree of a few possible paths.
model
- model a
- parameter a1
- parameter a2
- model b
- object a
- object_a_parameter1
- object_a_paremeter2
- parameter b
Apriori clusters:
- set manually
- automatic estimation
- technique1
- parameter 1
- technique2
- parameter 1
- parameter 2
...
...
...
The difference between the above requirements and what I've found on google is that depending on which branch is taken, the number and types of inputs can vary. It's not just a simple user name and password changes, for example.
One thought I had was to have a top-level configuration file, and then a sub configuration for each step in the path taken. It seems too complicated though:
model : {
type: "PAM"
parameters_config : "some/other/file.json_or_whatever"
}
apriori_k : {
method : "spectral gap"
parameters_config : "some/other/file.yaml_or_whatever"
}
similarity : {
method : "Gaussian Kernel"
parameter_config : "you/get/the idea"
.
.
.
}
And even if this method is a good one, I have no idea how to get code to execute the defined steps based in the config file. Especially because some paths require extra objects and those objects have their own parameters too.
Is there an established best practice for this sort of thing? If not, what are some of my options?

How to create a tensorflow serving client for the 'wide and deep' model?

I've created a model based on the 'wide and deep' example (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py).
I've exported the model as follows:
m = build_estimator(model_dir)
m.fit(input_fn=lambda: input_fn(df_train, True), steps=FLAGS.train_steps)
results = m.evaluate(input_fn=lambda: input_fn(df_test, True), steps=1)
print('Model statistics:')
for key in sorted(results):
print("%s: %s" % (key, results[key]))
print('Done training!!!')
# Export model
export_path = sys.argv[-1]
print('Exporting trained model to %s' % export_path)
m.export(
export_path,
input_fn=serving_input_fn,
use_deprecated_input_fn=False,
input_feature_key=INPUT_FEATURE_KEY
My question is, how do I create a client to make predictions from this exported model? Also, have I exported the model correctly?
Ultimately I need to be able do this in Java too. I suspect I can do this by creating Java classes from proto files using gRPC.
Documentation is very sketchy, hence why I am asking on here.
Many thanks!

I wrote a simple tutorial Exporting and Serving a TensorFlow Wide & Deep Model.
TL;DR
To export an estimator there are four steps:
Define features for export as a list of all features used during estimator initialization.
Create a feature config using create_feature_spec_for_parsing.
Build a serving_input_fn suitable for use in serving using input_fn_utils.build_parsing_serving_input_fn.
Export the model using export_savedmodel().
To run a client script properly you need to do three following steps:
Create and place your script somewhere in the /serving/ folder, e.g. /serving/tensorflow_serving/example/
Create or modify corresponding BUILD file by adding a py_binary.
Build and run a model server, e.g. tensorflow_model_server.
Create, build and run a client that sends a tf.Example to our tensorflow_model_server for the inference.
For more details look at the tutorial itself.

Just spent a solid week figuring this out. First off, m.export is going to deprecated in a couple weeks, so instead of that block, use: m.export_savedmodel(export_path, input_fn=serving_input_fn).
Which means you then have to define serving_input_fn(), which of course is supposed to have a different signature than the input_fn() defined in the wide and deep tutorial. Namely, moving forward, I guess it's recommended that input_fn()-type things are supposed to return an InputFnOps object, defined here.
Here's how I figured out how to make that work:
from tensorflow.contrib.learn.python.learn.utils import input_fn_utils
from tensorflow.python.ops import array_ops
from tensorflow.python.framework import dtypes
def serving_input_fn():
features, labels = input_fn()
features["examples"] = tf.placeholder(tf.string)
serialized_tf_example = array_ops.placeholder(dtype=dtypes.string,
shape=[None],
name='input_example_tensor')
inputs = {'examples': serialized_tf_example}
labels = None # these are not known in serving!
return input_fn_utils.InputFnOps(features, labels, inputs)
This is probably not 100% idiomatic, but I'm pretty sure it works. For now.

MessageSource equivalent in Django?

I come from a Java/Spring background and I've just recently moved to Python/Django. I'm working on a new project from scratch with Django. I was wondering how Django handles common String messages. Is there one single common file that can be called in a resources folder? For example, in Spring, we have a MessageSource is a key/value pair properties file that is global to most of the app. Is there something similar in Django? If so, how does it work for the normal app side and the unit tests side?

You could take a look over Django's messages framework.
Also, you can use key-value pairs in Python, with dicts:
# Upper case because it is constant
LOGIN_ERRROS = {
'login_error_message': 'message here',
...
}
You could put this in a file, you can even name it message_source.py, inside you app and import it when you need it:
For example, in your view:
# views.py
...
from myapp.message_source import LOGIN_ERRORS

Django uses the standard gettext + .po files for internationalization/translation. Check out the Translation docs for all the steps needed: https://docs.djangoproject.com/en/1.9/topics/i18n/translation/

Get list of modified / new / removed files using org.jenkinsci.plugins.gitclient.GitClient

How can I get the list of modified, new, and removed files using org.jenkinsci.plugins.gitclient.GitClient ?
Right now, I'm doing something like:
String status = ((CliGitAPIImpl) gitClient).launchCommand("ls-files", "--deleted", "--modified", "--others", SOME_DIRECTORY);
for (String toCommit : status.split("\\R")) {
gitClient.add(toCommit);
}
but I don't like this approach. first, because it relies on CliGitAPIImpl (and other Jenkins installations could use other classes, like for example RemoteGitImpl which doesn't implement the launchCommand method). second, I'm already using gitClient to create branches, add files to be committed, commit, push, etc., therefore I would prefer to use some API rather than just calling launchCommand method.
--
Thanks,
Jose

JRuby: Calling Java Code From A Rack App And Keeping It In Memory

I currently know Java and Ruby, but have never used JRuby. I want to use some RAM- and computation-intensive Java code inside a Rack (sinatra) web application. In particular, this Java code loads about 200MB of data into RAM, and provides methods for doing various calculations that use this in-memory data.
I know it is possible to call Java code from Ruby in JRuby, but in my case there is an additional requirement: This Java code would need to be loaded once, kept in memory, and kept available as a shared resource for the sinatra code (which is being triggered by multiple web requests) to call out to.
Questions
Is a setup like this even possible?
What would I need to do to accomplish it? I am not even sure if this is a JRuby question per se, or something that would need to be configured in the web server. I have experience with Passenger and Unicorn/nginx, but not with Java servers, so if this does involve configuration of a Java server such as Tomcat, any info about that would help.
I am really not sure where to even start looking, or if there is a better way to be approaching this problem, so any and all recommendations or relevant links are appreciated.

Yes, a setup it's possibile ( see below about Deployment ) and to accomplish it I would suggest to use a Singleton
Singletons in Jruby
with reference to question: best/most elegant way to share objects between a stack of rack mounted apps/middlewares? I agree with Colin Surprenant's answer, namely singleton-as-module pattern which I prefer over using the singleton mixin
Example
I post here some test code you can use as a proof of concept:
JRuby sinatra side:
#file: sample_app.rb
require 'sinatra/base'
require 'java' #https://github.com/jruby/jruby/wiki/CallingJavaFromJRuby
java_import org.rondadev.samples.StatefulCalculator #import you java class here
# singleton-as-module loaded once, kept in memory
module App
module Global extend self
def calc
#calc ||= StatefulCalculator.new
end
end
end
# you could call a method to load data in the statefull java object
App::Global.calc.turn_on
class Sample < Sinatra::Base
get '/' do
"Welcome, calculator register:#{App::Global.calc.display}"
end
get '/add_one' do
"added one to calculator register, new value:#{App::Global.calc.add(1)}"
end
end
You can start it in tomcat with trinidad or simply with rackup config.ru but you need:
#file: config.ru
root = File.dirname(__FILE__) # => "."
require File.join( root, 'sample_app' ) # => true
run Sample # ..in sample_app.rb ..class Sample < Sinatra::Base
something about the Java Side:
package org.rondadev.samples;
public class StatefulCalculator {
private StatelessCalculator calculator;
double register = 0;
public double add(double a) {
register = calculator.add(register, a);
return register;
}
public double display() {
return register;
}
public void clean() {
register = 0;
}
public void turnOff() {
calculator = null;
System.out.println("[StatefulCalculator] Good bye ! ");
}
public void turnOn() {
calculator = new StatelessCalculator();
System.out.println("[StatefulCalculator] Welcome !");
}
}
Please note that the register in here is only a double but in your real code you can have a big data structure in your real scenario
Deployment
You can deploy using Mongrel, Thin (experimental), Webrick (but who would do that?), and even Java-centric application containers like Glassfish, Tomcat, or JBoss. source: jruby deployments
with TorqueBox that is built on the JBoss Application Server.
JBoss AS includes high-performance clustering, caching and messaging functionality.
trinidad is a RubyGem that allows you to run any Rack based applet wrap within an embedded Apache Tomcat container
Thread synchronization
Sinatra will use Mutex#synchronize method to place a lock on every request to avoid race conditions among threads. If your sinatra app is multithreaded and not thread safe, or any gems you use is not thread safe, you would want to do set :lock, true so that only one request is processed at a given time. .. Otherwise by default lock is false, which means the synchronize would yield to the block directly.
source: https://github.com/zhengjia/sinatra-explained/blob/master/app/tutorial_2/tutorial_2.md

Here are some instructions for how to deploy a sinatra app to Tomcat.
The java code can be loaded once and reused if you keep a reference to the java instances you have loaded. You can keep a reference from a global variable in ruby.
One thing to be aware of is that the java library you are using may not be thread safe. If you are running your ruby code in tomact, multiple requests can execute concurrently, and those requests may all access your shared java library. If your library is not thread safe, you will have to use some sort of synchronization to prevent multiple threads accessing it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.