Encog - EarlyStoppingStrategy with validation set - java

I would like to stop training a network once I see the error calculated from the validation set starts to increase. I'm using a BasicNetwork with RPROP as the training algorithm, and I have the following training iteration:
double validationError = 999.999;
while(!stop){
train.iteration(); //weights are updated here
System.out.println("Epoch #" + epoch + " Error : " + train.getError()) ;
//I'm just comparing to see if the error on the validation set increased or not
if (network.calculateError(validationSet) < validationError)
validationError = network.calculateError(validationSet);
else
//once the error increases I stop the training.
stop = true ;
System.out.println("Epoch #" + epoch + "Validation Error" + network.calculateError(validationSet));
epoch++;
}
train.finishTraining();
Obviously this isn't working because the weights have already been changed before figuring out if I need to stop training or not. Is there anyway I can take a step back and use the old weights?
I also see the EarlyStoppingStrategy class which is probably what I need to use by using the addStrategy() method. However, I really don't understand why the EarlyStoppingStrategy constructor takes both the validation set and the test set. I thought it would only need the validation set and the test set shouldn't be used at all until I test the output of the network.

Encog's EarlyStoppingStrategy class implements an early stopping strategy according to this paper:
Proben1 | A Set of Neural Network Benchmark Problems and Benchmarking Rules
(a full cite is included in the Javadoc)
If you just want to stop as soon as a validation set's error no longer improves you may just want to use the Encog SimpleEarlyStoppingStrategy, found here:
org.encog.ml.train.strategy.end.SimpleEarlyStoppingStrategy
Note, that SimpleEarlyStoppingStrategy requires Encog 3.3.

Related

finding twitter4j retweet rate limit

I would like to know how to find in java a rate limit for specific function in twitter4j
twitter.retweetStatus(lstatusId);
I would like to know how to check if im passing the rate limit for retweeting.
You can add following piece of code to check for rate limit status, you would need to provide for specific url, for example if you are fetching "statuses/retweets_of_me" then your url-to-get-rate-limit-status-of would be "statuses/retweets_of_me"
You would need to find mapping to Rest API url for each of your streaming API call and use that url to get the rate limit status.
if (twitter.getRateLimitStatus().get(<<url-to-get-rate-limit-status-of>>).getRemaining() == 0) {
System.out.println("Adding wait of - " + twitter.getRateLimitStatus().get(<<url-to-get-rate-limit-status-of>>).getSecondsUntilReset() + " seconds");
Thread.sleep(twitter.getRateLimitStatus().get(<<url-to-get-rate-limit-status-of>>).getSecondsUntilReset()*1000);
}
Hope this helps.

Is JasperReports 10 times faster than Birt?

I am doing an evaluation of JasperReports and Birt reporting engines.
I designed a simple report in both tools where I give 20 values to the report as parameters and fill 6 other values from an SQL selection in the report as detail relation (this means that I have many rows of them).
I programmed the creation of both reports in Java and the PDF export (I think both reporting engines use iText)
I measured the time each report needed. The reports are exactly the same and they are ran from the same process.
The report was ran for 10 sets of values. So I measured the time for each of the 10 reports. The result was:
Printing Jasper reports for 10 values. Measuring time needed.
110
109
141
125
110
125
110
125
109
110
Jasper Finished!!!
Printing Birt reports for 10 values. Measuring time needed.
1063
1017
1095
1079
1063
1079
1048
1064
1079
1080
Birt Finished!!!
The numbers are in msecs.
Is it possible that Jasper is 10 times faster than Birt. Am I doing something wrong with my code that slows things down for Birt? I am posting the code I used in each case:
JasperReports:
// Export Jasper report
long startTime = System.currentTimeMillis();
JasperPrint myJasperPrint;
JRExporter myJRExporter = new net.sf.jasperreports.engine.export.JRPdfExporter();
try {
myJRExporter.setParameter(JRExporterParameter.OUTPUT_FILE_NAME, "C:/Workspace/myProject/jasperReport" + reportNr + ".pdf");
myJasperPrint = JasperFillManager.fillReport("C:/Workspace/myProject/reports/testReport.jasper", jasperParametersMap, connection);
myJRExporter.setParameter(JRExporterParameter.JASPER_PRINT, myJasperPrint);
myJRExporter.exportReport();
return (System.currentTimeMillis() - startTime);
} catch (JRException ex) {
System.out.println(ex);
}
Birt:
// Export Birt report
String format = HTMLRenderOption.OUTPUT_FORMAT_PDF;
EngineConfig config = new EngineConfig();
config.setEngineHome("C:\\Tools\\Eclipse\\plugins\\org.eclipse.birt.report.viewer_4.2.2.v201302041142\\birt");
HTMLEmitterConfig hc = new HTMLEmitterConfig();
HTMLCompleteImageHandler imageHandler = new HTMLCompleteImageHandler();
hc.setImageHandler(imageHandler);
config.setEmitterConfiguration(HTMLRenderOption.OUTPUT_FORMAT_HTML, hc);
ReportEngine engine = new ReportEngine(config);
IReportRunnable report = null;
String reportFilepath = "C:/Workspace/EntireJ/Besuchblatt/reports/new_report.rptdesign";
HTMLRenderOption options = new HTMLRenderOption();
options.setOutputFormat(format);
options.setOutputFileName("C:/Workspace/myProject/birtReport" + reportNr + ".pdf");
long startTime = System.currentTimeMillis();
try {
report = engine.openReportDesign(reportFilepath);
}
catch (EngineException e) {
System.err.println("Report " + reportFilepath + " not found!\n");
engine.destroy( );
return;
}
IRunAndRenderTask task = engine.createRunAndRenderTask(report);
task.setRenderOption(options);
task.setParameterValues(parametersMap);
try {
task.run();
return (System.currentTimeMillis() - startTime);
}
catch ( EngineException e1 ) {
System.err.println( "Report " + reportFilepath + " run failed.\n");
System.err.println( e1.toString( ) );
}
engine.destroy( );
Is there a way to optimize Birt's performance in my case?
After reading similar discussions and completing my evaluation I think in most cases Birt is actually much slower than Jasper. There are things to do in order to make it faster, but they cost time at the moment, whereas Jasper already gives a good performance for basic reporting needs. I don't know if it could perform better than Jasper in case I set it up better or optimized the code or the report template, but in most similar cases I read on internet discussions people just accept this performance and leave it as it is. Here is an example of an issue at the openMRS which closed unsolved: https://tickets.openmrs.org/browse/BIRT-30
I hope the following image doesn't downvote me, but I was really tempted to post it. I also thought to send it to my boss as an answer to the evaluation, but I'd rather not:
If somebody need it...
Java application on Intel i3 with 4cores 5Gb.
Oracle database server.
Similar report template for jasper and birt that makes 20 requests to database and 20 sub-requests (subreports).
Goal:
Generate 6000 pdf documents in 30 threads ( 200 documents per thread ).
QA:
why birt 2.6.2?
we are using it currently and we compared it to 4.5 - no real benefits for us.
birt 4.+ makes call to getParameterMetaData() that is not implemented in oracle ojdbc6 and partially in ojdbc7 and just slows down execution
why 2.6.2 patched?
there is a problem in birt 2.+ and 3.+ and maybe in later versions: all dataset parameters evaluated through javascript and parsed/compiled versions of those scripts are not cached. described here. Evaluated JS columns are perfectly cached in ReportRunnable.
Why Jasper with Continuation Subreport Runner?
Continuation Subreport Runner (described here) runs all subreports in thread of the main report thread. By default jasper 6.2 uses ThreadPoolSubreportRunnerFactory that (i think by mistake) holds all previously retrieved data in the memory until full GC executed and it starts enormous number of threads.
I think it is because you create and destroy a BIRT report engine on each run. You should initialize a report engine only once, and keep it for example in a static variable of a class for next report generations. This will be much faster
The engine is designed to be reused. You should create it once, then run 10 reports. The engine loads a lot of classes when the first reports runs - later runs will be much faster. Also, the engine caches fonts.
Your test setup is not fair.

One (log.isDebugEnabled()) condition each debug statement occurrence

I would like to check with the community, if this is an accepted practice, to have multiple if conditions for each debug statement occurring in a row:
if (log.isDebugEnabled()) log.debug("rproductType = "+ producteType);
if (log.isDebugEnabled()) log.debug("rbundleFlag = " + bundleFrlag);
if (log.isDebugEnabled()) log.debug("rmrktSegment = " + mrktSeegment);
if (log.isDebugEnabled()) log.debug("rchannelTy = " + channelrTy);
if (log.isDebugEnabled()) log.debug("rcompanyPartner = " + coempanyPartner);
if (log.isDebugEnabled()) log.debug("rpreSaleDate = " + preSaleDaete);
if (log.isDebugEnabled()) log.debug("rportNC = " + portrNC);
if (log.isDebugEnabled()) log.debug("rLDC debug end");
I am personally supportive of have a single if condition to wrap up the entire log statements since they are appearing in a row. What are your inputs on this? Or do you see why the original author had wanted to have an if condition for each debug statement?
Thanks!
At best, it is messy. At worst, it performs absolutely redundant function calls.
The only potential difference in logic between sharing ifs is if the debugging option is somehow changed mid-call (possibly by a config reload). But capturing that extra half-a-call really isn't worth the wall of gross code.
Just change it. Don't Repeat Yourself
The reason the if is there at all is to avoid the overhead of building the debug strings if you aren't in debug mode; that part you should keep (Or not keep, if you find this is not a performance critical part of your application).
Edit FYI, by "change it", I mean do this instead:
if (log.isDebugEnabled())
{
log.debug("rproductType = "+ producteType);
log.debug("rbundleFlag = " + bundleFrlag);
// etc
}
The if condition is in there for increased speed.
It is intended to avoid the computational cost of the disabled debug statements. That is, if you have your log level set to ERROR, then there is no need to create the actual message.
http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Category.html
My personal feeling is that the
if (log.isDebugEnabled())
statments are a micro-optimization that just makes the code harder to read.
I know this was once a best practice for log4j, but many of those checks are performed prior the log staments themselves ( you can see in the source code ) and the only operations you are saving yourself is the string concatenations. I would remove them.
I would do it like that:
if (log.isDebugEnabled()) {
StringBuilder builder = new StringBuilder();
builder.append("rproductType = ");
builder.append(producteType);
builder.append("rbundleFlag = ");
builder.append(bundleFrlag);
builder.append("rproductType = ");
builder.append(mrktSeegment);
builder.append("rchannelTy = ");
builder.append(channelrTy);
builder.append("rcompanyPartner = ");
builder.append(coempanyPartner);
builder.append("rpreSaleDate = ");
builder.append(preSaleDaete);
builder.append("rportNC = ");
builder.append(portrNC);
builder.append("rLDC debug end");
log.debug(builder.toString());
}
You have only 2 isDebugEnabled checks in this code: one at the beginning, one in log.debug. The first one prevents the creation of the builder and several short living objects (you heap will thank you)
The second one is obsolet in this case. But it is an pretty easy check, so I think the cost the builder would be higher. Given the fact that in most production system the debug level is off, I think this is the best option.
To summarize, I use isDebugEnabled when my debug statement - or anything else what I need to do for my debug message - becomes more complex then usual. In most cases this is when it comes to String concatenations.
Never use isDebugEnabled for single line log statements. As it has been already mentioned. the log.debug method calls it itself.
Cheers
Christian
Depends. Log4j does this very check at the beginning of every method—isDebugEnabled() before debug statement, isWarnEnabled() before warn, etc.,—by default.
This does not mean checks are not required. Checks can save procssing if any of the parameters passed invoke computation. E.g., LOGGER.debug(transformVars(a, b, c, d)); would result in unnecessary execution of transform(), if debug is not enabled!

Best way to time something in Selenium

I'm writing some Selenium tests in Java, and I'm mostly trying to use verifications instead of assertions because the things I'm checking aren't very dependent so I don't want to abort if one little thing doesn't work. One of the things I'd like to keep an eye on is whether certain Drupal pages are taking forever to load. What's the best way to do that?
Little example of the pattern I'm using.
selenium.open("/m");
selenium.click("link=Android");
selenium.waitForPageToLoad("100000");
if (selenium.isTextPresent("Epocrates")) {
System.out.println(" Epocrates confirmed");
} else {
System.out.println("Epocrates failed");
}
Should I have two "waitForPagetoLoad" statements (say, 10000 and 100000) and if the desired text doesn't show up after the first one, print a statement? That seems clumsy. What I'd like to do is just a line like
if (timeToLoad>10000) System.out.println("Epocrates was slow");
And then keep going to check whether the text was present or not.
waitForPageToLoad will wait until the next page is loaded. So you can just do a start/end timer and do your if:
long start = System.currentTimeMillis();
selenium.waitForPageToLoad("100000");
long timeToLoad= (System.currentTimeMillis()-start);
if (timeToLoad>10000) System.out.println("Epocrates was slow");
Does your text load after the page is loaded? I mean, is the text inserted dynamically? Otherwise the text should be present as soon as the page was loaded.
selenium.isTextPresent
doesn't wait. It only checks the currently available page.
The best method to wait for something in Selenium is as follow:
Reporter.log("Waiting for element '" + locator + "' to appear.");
new Wait()
{
public boolean until()
{
return selenium.isElementPresent(locator);
}
}.wait("The element '" + locator + "' did not appear within "
+ timeout + " ms.", timeout);
The Waiter is part of selenium you only have to import it.
Also here is a framework that you can use. It's opensource, handles mostly everything and can be easily expanded.
https://sourceforge.net/projects/webauthfw/
Use it well and give us credit hehe. :)
Cheers,
Gergely.
In a Selenium integration test, I did it like so, using nano-time and converting to a double to get seconds:
long endTime = System.nanoTime();
long duration = (endTime - startTime);
Reporter.log("Duration was: " + ((double)duration / 1000000000.0) + " seconds.", true);
assertTrue( duration >=0 || duration <= 1000, "Test that duration of default implicit
timeout is less than 1 second, or nearly 0.");

Logback SMTPAppender Limiting Rate

How can I limit the rate of emails a Logback SMTPAppender, so that it would email me at most once every n minutes?
I have setup my logging according to the Logback appender, but I don't quite see how it be configured or subclassed to implement that.
Is there a hidden feature? Did someone develop a subclass to handle this?
Based on the documentation it appears that the way to do this is to write an EventEvaluator (see example 4.14 and 4.15) which looks at the time stamp for each event to only accept an event when "enough time" has passed since the last event was accepted.
You can use System.currentTimeMillis to get a number you can do math on to calculate time differences. http://java.sun.com/javase/6/docs/api/java/lang/System.html#currentTimeMillis%28%29
As Thorbjørn, it's easy to create an EventEvaluator that limit the rate by which an appender fires a message.
However, I found Logback to support DuplicateMessageFilter, that solves my problem probably in a bitter way: "The DuplicateMessageFilter merits a separate presentation. This filter detects duplicate messages, and beyond a certain number of repetitions, drops repeated messages."
Have a look at the new Whisper appender. It does smart suppression. Available via Maven and github here
Statutory disclaimer: I'm the author.
This tool would do exactly what you want but it's not threadsafe at all: http://code.google.com/p/throttled-smtp-appender/wiki/Usage
I've written a threadsafe version but haven't open sourced it yet.
The reason you would have trouble finding good tools for this is that SMTP isn't a real endpoint. Use a service like loggly, airbrake, or dozens of others, or run your own server using something like logstash.
To solve same problem I've written custom evaluator. It extends ch.qos.logback.classic.boolex.OnMarkerEvaluator, but you can use any other evaluator as base. If there will many acceptable messages in silence interval evaluator will discard these. For my use case it's ok, but if you need different behavior - just add extra checks to the second if.
public class LimitingOnMarkerEvaluator extends OnMarkerEvaluator {
private long lastSend = 0, interval = 0;
#Override
public boolean evaluate(ILoggingEvent event) throws EvaluationException {
if (super.evaluate(event)) {
long now = System.currentTimeMillis();
if (now - lastSend > interval) {
lastSend = now;
return true;
}
}
return false;
}
public long getInterval() {
return interval;
}
public void setInterval(long interval) {
this.interval = interval;
}
}
Config to send maximum one message every 1000 second (about 17 mins):
<evaluator class="package.LimitingOnMarkerEvaluator">
<marker>FATAL</marker>
<interval>1000000</interval>
</evaluator>
I suggest filing a jira item requesting this feature. It is likely to be implemented if only asked.
Btw,
Logback v0.9.26 allows now to set the size of SMTPAppender message buffer. Until yesterday it would send the current contens of the buffer which was up to 256 messages which imho was a pain in the neck as I wanted to show only the last one in the email. Thus it's now possible to implement periodically recurring email warnings that carry only one particular error as per my interpretation of this question.
http://logback.qos.ch/manual/appenders.html#cyclicBufferSize
Have fun.

Categories