Nutch: What version of Nutch + Cassandra actually works?

Nutch: What version of Nutch + Cassandra actually works? - java

I'm trying to do some crawling with Nutch and I'd like to test out Cassandra as a backend, however using the latest version of nutch and its dependencies Cassandra throws a variety of errors as you move through the inject, generate, fetch, etc. process.
The errors are all related to actual problems in code, not out of memory or configuration. I've fixed some of them by modifying code within gora-cassandra, but it's still not functional.
My question is, does a working version of these 2 projects exist? By working i mean you can run through inject, generate, fech, parse, updatedb on at least a small set of urls, without error.
Here's an example of one of the classes giving an error during fetch:
java.lang.NullPointerException
at org.apache.gora.cassandra.query.CassandraSuperColumn.getUnionIndex
I have used HBase as the backend and that just works, although HBase itself is a monster to manage so that's why i'd like to test out Cassandra. However, i'm about to give up on this as I don't think I should be having to modify gora-cassandra code just to get a basic example to run.
Thanks

According to this link it's just broken, which is about 3 months old http://lucene.472066.n3.nabble.com/Re-user-Digest-3-Jun-2017-19-27-20-0000-Issue-2758-td4339060.html
Its unclear why backends that do not work are even documented.
HBase is most widely used, followed by MongoDB... on the other end of
the spectrum, Cassandra is least used and broken. It has not been
maintained for quite some time... and yes this is reflected by use of
Super Columns. We are currently re-writing the backend as part of a
GSoC project.
I would agree with the guy making the original statement, Its unclear why backends that do not work are even documented.
Really tired of this project and its lack usable documentation.

Related

Neo4j Trigger Example Using TransactionEventHandler()

I would like to build out a set of triggers in my database using TransactionEventHandler() functionality.
However I haven't found a working example of this for version > 3.0. I did see an example by maxdemarzi however it doesn't appear to be working in recent versions of neo4j.
If anyone has any experience with this I would really appreciate the help!
Side Note: I do realize APOC has some alpha functinality around triggers using cypher. At the moment it isn't fully fledged and I have run into some issues using it. Thus I'm looking at implementing my own plugin to handle my particular use case.

After reaching out to maxdemarzi on github he has updated his example to support neo4j v 3.1.
See the repo here: https://github.com/maxdemarzi/neo_listens

Is Unitils project alive?

anybody knows whether unitils project is still alive. On there pages last version is 3.3 in maven repository it is 3.4.2.(Actually there is google cached version of their pages where the version is said to be 3.4.2)
Anyway is there any replacement for this project. I kind of lack the vivid community around and really don't want to be bound to dying project.

Unitils seems to be almost abandoned nowadays. Project is available on the GitHub here and you can look at its history and activity.
Anyways my two cents...
Unitils has serious drawbacks:
Integrates many third-party libs (easymock, dbunit, spring, dbmaintainer, xmlunit, slf4j etc) and thus forces their versions - it is a really serious drawback
Due to being dependent on many 3rd party libraries, it is almost impossible to keep it up to date without any company behind.
Unitils 4.0 is developed since 06.2011 and was planned to release at 01.2012, but now (01.2016) after 4 years is still not released.
DbUnit
For database-driven apps it may seem that interesting way to go is a plain DbUnit + Spring-Test or alternatively some 3rd party tools:
excilys/spring-dbunit that comes with handy #DataSet annotation and is actively developed on the github, also is constantly updated to use the newest versions of DbUnit and Spring Framework.
springtestdbunit/spring-test-dbunit which is also hosted on the github (comes with a #DatabaseSetup annotation).
Both are very similar, but personally I find DbUnit confusing, quite cumbersome and time-consuming. Why? Try to maintain large amount of small xml files and you find out what I mean. Also combining multiple data sets is really hard.
DbSetup
My choice. DbSetup doesn't need external xml/ json files, is extremely convenient and allows you to combine freely multiple data sets using fluent builders.
Just look at code below:
final Operation sql =
sequenceOf(
CommonOperations.DELETE_ALL,
CommonOperations.INSERT_REFERENCE_DATA,
prepareSpecialData()
);
DbSetup dbSetup = new DbSetup(new DataSourceDestination(dataSource), sql);
Everything is java, so you can freely refactor it, extract methods etc.
Hope it helps.

java.lang.StackOverflowError when parsing Groovy script on Jenkins

We are experiencing a problem with our Jenkins CI server.
Our CI implementation relies on several Groovy scripts, which we execute in Jenkins as "System Groovy scripts". This has been this way for years, and the scripts have undergone no recent modifications, and implement build flows, business logic steps such as version checking, etc.
Yesterday we started experiencing an exception in every Jenkins job that we tried to lauch that, one way or another, tried to execute Groovy scripts. The exception is:
java.lang.StackOverflowError
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.additiveExpression(GroovyRecognizer.java:12478)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.shiftExpression(GroovyRecognizer.java:9695)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.relationalExpression(GroovyRecognizer.java:12383)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.equalityExpression(GroovyRecognizer.java:12307)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.regexExpression(GroovyRecognizer.java:12255)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.andExpression(GroovyRecognizer.java:12223)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.exclusiveOrExpression(GroovyRecognizer.java:12191)
hundreds of similar lines
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.compoundStatement(GroovyRecognizer.java:7510)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.compatibleBodyStatement(GroovyRecognizer.java:8834)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.statement(GroovyRecognizer.java:899)
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.compilationUnit(GroovyRecognizer.java:757)
at org.codehaus.groovy.antlr.AntlrParserPlugin.transformCSTIntoAST(AntlrParserPlugin.java:131)
at org.codehaus.groovy.antlr.AntlrParserPlugin.parseCST(AntlrParserPlugin.java:108)
at org.codehaus.groovy.control.SourceUnit.parse(SourceUnit.java:236)
at org.codehaus.groovy.control.CompilationUnit$1.call(CompilationUnit.java:161)
at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:846)
at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:550)
at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:526)
at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:503)
at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:302)
at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:281)
at groovy.lang.GroovyShell.parseClass(GroovyShell.java:731)
at groovy.lang.GroovyShell.parse(GroovyShell.java:743)
at groovy.lang.GroovyShell.parse(GroovyShell.java:770)
at groovy.lang.GroovyShell.parse(GroovyShell.java:761)
at groovy.lang.GroovyShell$parse.call(Unknown Source)
at com.cloudbees.plugins.flow.FlowDSL.executeFlowScript(FlowDSL.groovy:80)
at com.cloudbees.plugins.flow.FlowRun$FlyweightTaskRunnerImpl.run(FlowRun.java:219)
at hudson.model.Run.execute(Run.java:1759)
at com.cloudbees.plugins.flow.FlowRun.run(FlowRun.java:155)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)
at hudson.model.OneOffExecutor.run(OneOffExecutor.java:43)
This looks like that the Groovy parser inside Jenkins is reaching the top of the stack while trying to parse the groovy script (as I have said, this abruptly started to happen with many scripts that worked perfectly before and had undergone no recent modification).
Currently our Jenkins installation (v1.594) runs on a Websphere 8.5.5.2 application server on AIX v7.1 (don't know exactly the fix pack level and / or if it has recently suffered any kind of update, still trying to gather the info).
After a restart, we returned to normal behavior (all the scripts were working as usual again without any modification to them).
Does anyone know about some incompatibility of any underlying library with Jenkins Groovy parsing?

There is a problem with the groovy code; causing the parser to go nuts:
java.lang.StackOverflowError
at org.codehaus.groovy.antlr.parser.GroovyRecognizer.additiveExpression(GroovyRecognizer.java:12478)
Based on a similar ticket:
https://issues.apache.org/jira/browse/GROOVY-1783,
it is possible that your code has circular references; or creating too many functions on the fly. You can take the approach of analyzing your code and trying to put anything that is going to make allocations outside of loops; in particular complex inline functions.
Another approach is to go look at the Build Flow plugin and scroll down the documentation and see how you could write an extension point rather than use groovy. This may not be easy to do and requires effort; but you can write a lot of tests for your code that way. You would still use groovy for the glue; but use java directly for the hot spots.
A third approach would be to file a ticket on the Groovy issue tracker; and see what the experts find out.

Liquibase Cannot find ChangeLogHistoryService for oracle

Recently, in my machine, every time I try to load a java project with Liquibase I get the error
Caused by: liquibase.exception.UnexpectedLiquibaseException: Cannot find ChangeLogHistoryService for oracle
at liquibase.changelog.ChangeLogHistoryServiceFactory.getChangeLogService(ChangeLogHistoryServiceFactory.java:73)
at liquibase.Liquibase.checkLiquibaseTables(Liquibase.java:724)
I can't find any information on this, beyond the source file.
All I know is that the same project, with the same source code seems to work on other computers. As far as I can tell, all configs are the same.
Any idea on what might cause this issue in the first place?

It means that one of the tables that Liquibase expects to find in the database (liquibasechangelog or liquibasechangeloglock) has been removed. If it is working on other machines it probably means that the database you are connecting to is also different. It sounds like you might be trying to connect to the database on localhost, which would explain the difference.

The error looks more like a classloader issue. Liquibase has a plug-in system that relies on finding class implementations in the classpath, including the built-in classes. Liquibase is looking for an implementation of ChangeLogHistoryService that support oracle and it isn't finding the liquibase.changelog.StandardChangeLogHistoryService class that it should be.
Are there any earlier errors you are seeing? If you run liquibase with logLevel=DEBUG it may also output better clues as to the cause of the unfound class.

I updated the liquibase version (from 3.1.1) to 3.2.0 and the problem went away.
I did have to fix SOME of the checksome codes in DATABASECHANGELOG table (3 records in 16). The fact that it wasn´t all of them may point to a possible origin for the problem? maybe?
For now everything works fine with the new version. If the problem comes up again in the future I will look into this again.
Thank you all for your support.

Java equivalent for database schema changes like South for Django?

I've been working on a Django project using South to track and manage database schema changes. I'm starting a new Java project using Google Web Toolkit and wonder if there is an equivalent tool. For those who don't know, here's what South does:
Automatically recognize changes to my Python database models (add/delete columns, tables etc.)
Automatically create SQL statements to apply those changes to my database
Track the applied schema migrations and apply them in order
Allow data migrations using Python code. For example, splitting a name field into a first-name and last-name field using the Python split() function
I haven't decided on my Java ORM yet, but Hibernate looks like the most popular. For me, the ability to easily make database schema changes will be an important factor.

Wow, South sounds pretty awesome! I'm not sure of anything out-of-the-box that will help you nearly as much as that does, however if you choose Hibernate as your ORM solution you can build your own incremental data migration suite without a lot of trouble.
Here was the approach that I used in my own project, it worked fairly well for me for over a couple of years and several updates/schema changes:
Maintain a schema_version table in the database that simply defines a number that represents the version of your database schema. This table can be handled outside of the scope of Hibernate if you wish.
Maintain the "current" version number for your schema inside your code.
When the version number in code is newer than what's the database, you can use Hibernate's SchemaUpdate utility which will detect any schema additions (NOTE, just additions) such as new tables, columns, and constraints.
Finally I maintained a "script" if you will of migration steps that were more than just schema changes that were identified by which schema version number they were required for. For instance new columns needed default values applied or something of that nature.
This may sounds like a lot of work, especially when coming from an environment that took care of a lot of it for you, but you can get a setup like this rolling pretty quickly with Hibernate and it is pretty easy to add onto as you go on. I never ended up making any changes to my incremental update framework over that time except to add new migration tasks.
Hopefully someone will come along with a good answer for a more "hands-off" approach, but I thought I'd share an approach that worked pretty well for me.
Good luck to you!

as I'm looking for the same heres what I've achieved so far.
We first used dbdeploy. It manages the most stuff for you but you will have to write all the transition scripts by yourself! That means every change you make has to be in its own script which you will have to write from scratch. Not very handy, but works very reliable.
The second thing I encountered is liquibase. It stores the configuration in one single xml file. Not very intuitive to read, but managable. Plus there is an Intellij Idea plugin for it. At the moment of writing it still has some minor issues, but as the author assured me, they will be fixed soon.
The perfect solution would be to get south working with your java environment. That really would be a tool to marry :D

Maybe try flyaway. Seems like a good alternative.

I've been thinking about using django-jython just for db migrations in our legacy Java application. The latest Jython version is 2.5.4rc1, but I think I can mitigate the risk by just using it for South migrations.
Especially since I can use inspectdb to generate the models for me. And then replace parts of the Java with Python "seamlessly".

If you're using hibernate, then checkout liquibase
http://www.liquibase.org/databases.html
It's been around for 10 years so it's pretty solid. It may support other ORM's, just have a dig around on their website. Checkout the liquibase+hibernate extension here:
https://github.com/liquibase/liquibase-hibernate

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.