Apache nutch is not crawling any more

Apache nutch is not crawling any more - java

I have a two machine cluster. On one machine nutch is configured and on second hbase and hadoop are configured. hadoop is in fully distributed mode and hbase in pseudo distributed mode. I have crawled about 280GB data. But now when I start crawling . It gives following message and do not crawl any more in previous table
INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
and following bug
ERROR store.HBaseStore
- [Ljava.lang.StackTraceElement;#7ae0c96b
Documents are fetched but they are not saved in hbase.
But if I crawl data in a new table, it works well and crawl properly witout any error. I think this is not a connection problem as for new table it works. I think it is bacause of some property etc.
Can anyone guide me as I am not an expert in apache nutch?

Not quite my field, but looks like thread exhaustion on the underlying machines.

As I was also facing similiar problem. Actual problem was with regionserver (Hbase deamon ). So try to restart it as it is shutdown when used with default seeting and data is too mutch in hbase. For more information, see log files of regionserver.

Related

Cassandra Talend Job running via java posing errors

I have 3 node Apache Cassandra cluster, where we are doing data loading operations via java prepared statement, while running the job we are facing the following error:
INSERT INTO "abc" () VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
is not prepared on /xx.xx.xx.xx:9042, preparing before retrying executing.
Seeing this message a few times is fine, but seeing it a lot may be source of performance problems. 
This query is used in a java code where the talend jars are called, which is taking a lot of time to complete the job of data loading  
Above error message is showing for all 3 Cassandra nodes in cluster. Below is the environment setup:
Apache Cassanndra version - 3.8.0
Talend Version - 6.4
Apache Cassandra driver - cassandra-driver-core-3.0.0-shaded.jar

Using Apache Ignite some expired data remains in memory with TTL enabled and after thread execution

Issue
Create an ignite client (in client mode false) and put some data (10k entries/values) to it with very small expiration time (~20s) and TTL enabled.
Each time the thread is running it'll remove all the entries that expired, but after few attempts this thread is not removing all the expired entries, some of them are staying in memory and are not removed by this thread execution.
That means we got some expired data in memory, and it's something we want to avoid.
Please can you confirm that is a real issue or just misuse/configuration of my setup?
Thanks for your feedback.
Test
I've tried in three different setups: full local mode (embedded server) on MacOS, remote server using one node in Docker, and also remote cluster using 3 nodes in kubernetes.
To reproduce
Git repo: https://github.com/panes/ignite-sample
Run MyIgniteLoadRunnerTest.run() to reproduce the issue described on top.
(Global setup: Writing 10k entries of 64octets each with TTL 10s)

It seems to be a known issue. Here's the link to track it https://issues.apache.org/jira/browse/IGNITE-11438. It's to be included into Ignite 2.8 release. As far as I know it has already been released as a part of GridGain Community Edition.

Spring Mongo's find operation freezes on windows server 2008 Machine

I am currently using Spring's Mongo persistence layer for querying MongoDB. The collection I query contains about 4G of data. When I run the find code on my IDE it retrieves the data. However, when I run the same code on my server, it freezes for about 15 to 20 minutes and eventually throws the error below. My concern is that it runs without a hitch on my IDE running on my 4G Ram windows PC and fails on my 14G ram server. I have looked through the Mongo Log, and there's nothing there that points to the problem. I also assumed that the problem might be an environmental issue since it works on my local spring IDE, however the libraries on both my local pc are the same as the ones on my server. Has anyone had this kind of issue or can any one point me to what I'm doing wrong. Also weirdly, the find operation works when I revert to Mongo's java driver find methods.
I'm using mongo-java-driver - 2.12.1
spring-data-mongodb - 1.7.0.RELEASE
See below sample find operation code and error message.
List<HTObject> empObjects =mongoOperations.find(new Query(Criteria.where("date").gte(dateS).lte(dateE)),HTObject.class);
The exception I get is:
09:42:01.436 [main] DEBUG o.s.data.mongodb.core.MongoDbUtils - Getting Mongo Database name=[Hansard]
Exception in thread "main" org.springframework.dao.DataAccessResourceFailureException: Cursor 185020098546 not found on server 172.30.128.155:27017; nested exception is com.mongodb.MongoException$CursorNotFound: Cursor 185020098546 not found on server 172.30.128.155:27017
at org.springframework.data.mongodb.core.MongoExceptionTranslator.translateExceptionIfPossible(MongoExceptionTranslator.java:73)
at org.springframework.data.mongodb.core.MongoTemplate.potentiallyConvertRuntimeException(MongoTemplate.java:2002)
at org.springframework.data.mongodb.core.MongoTemplate.executeFindMultiInternal(MongoTemplate.java:1885)
at org.springframework.data.mongodb.core.MongoTemplate.doFind(MongoTemplate.java:1696)
at org.springframework.data.mongodb.core.MongoTemplate.doFind(MongoTempate.java:1679)
at org.springframework.data.mongodb.core.MongoTemplate.find(MongoTemplate.java:598)
at org.springframework.data.mongodb.core.MongoTemplate.find(MongoTemplate.java:589)
at com.sa.dbObject.TestDb.main(TestDb.java:74)
Caused by: com.mongodb.MongoException$CursorNotFound: Cursor 185020098546 not found on server 172.30.128.155:27017
at com.mongodb.QueryResultIterator.throwOnQueryFailure(QueryResultIterator.java:218)
at com.mongodb.QueryResultIterator.init(QueryResultIterator.java:198)
at com.mongodb.QueryResultIterator.initFromQueryResponse(QueryResultIterator.java:176)
at com.mongodb.QueryResultIterator.getMore(QueryResultIterator.java:141)
at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:127)
at com.mongodb.DBCursor._hasNext(DBCursor.java:551)
at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
at org.springframework.data.mongodb.core.MongoTemplate.executeFindMultiInternal(MongoTemplate.java:1871)
... 5 more

In short
The MongoDB result cursor is not available anymore on the server.
Explanation
This can happen when using Sharding and a connection to a mongos fails over or if you run into timeouts (see http://docs.mongodb.org/manual/core/cursors/#closure-of-inactive-cursors).
You're performing a query that loads all objects into one list (mongoOperations.find). Depending on the result size, this may take a long time. Using an Iterator can help to leverage but even loading huge amounts using Iterators is limited at a certain point.
You should partition the results if you have to query very large data amounts using either paging (paging gets slower the more records you skip) or by querying with splits of your range (you have already a date range, so this could work).

Talend ETL Job Error in tOracleOutput Component

I am a newbie to TalendETL and am using Talend Open Studio for Big Data version 5.4.1 . I have developed a simple Talend ETL job that picks up data from a csv file and inserts data into my local Oracle Database. Below is how my package looks:
The job returns an exception that ArrayIndexOutOfBounds after the last record of the csv file. But I'm uncertain as to why it should return that in the first place? I checked out the solution given on this link: http://www.talendforge.org/forum/viewtopic.php?id=21644
But it doesn't seem to work at all. I have the latest driver for the oracle component and increasing/decreasing the commit size does not seem to affect it.
Can someone please help me out on this? Please let me know in case more information is needed.
P.S: The complete error log is below:-
Starting job Kaggle_Data_Load_Training at 09:31 25/06/2014.
[statistics] connecting to socket on port 3957
[statistics] connected
Exception in component tOracleOutput_1
java.lang.ArrayIndexOutOfBoundsException: -32203
at oracle.jdbc.driver.OraclePreparedStatement.setupBindBuffers(OraclePreparedStatement.java:2677)
at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:9270)
at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:210)
at test.kaggle_data_load_training_0_1.Kaggle_Data_Load_Training.tFileInputDelimited_1Process(Kaggle_Data_Load_Training.java:4360)
at test.kaggle_data_load_training_0_1.Kaggle_Data_Load_Training.runJobInTOS(Kaggle_Data_Load_Training.java:4717)
at test.kaggle_data_load_training_0_1.Kaggle_Data_Load_Training.main(Kaggle_Data_Load_Training.java:4582)
[statistics] disconnected
Job Kaggle_Data_Load_Training ended at 09:31 25/06/2014. [exit code=1]

Can you try to decrease the commit size on the tOracleOutput component? I remember there is some kind of bug in 5.4.1. of TOS which resulted in this error. Therefore please lower commit size (let's say to 500) and see if the problem still exists. Here's more information about the bug: http://www.talendforge.org/forum/viewtopic.php?id=5931

Had same issue in Talend 6.2.1
It can be resolved by changing updating DB Version in metadata of connection.
Same is confirmed on Talend blog

Understanding the real reason behind a Hive failure

I'm using a JDBC driver to run "describe TABLE_NAME" on hive. It gives me the following error:
NativeException: java.sql.SQLException: Query returned non-zero code: 9, cause: FAILED:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
return code 1 doesn't tell me very much. How do I figure out what the underlying reason is?

It's most likely because your Hive metastore is not setup properly. Hive uses a RDBMS metastore to store meta data about its tables. This includes things like table names, schemas, partitioning/bucketing/sorting columms, table level statistics, etc.
By default, Hive uses an embedded derby metastore which can only be accessed by one process at a time. If you are using that, it's possible that you have multiple sessions to Hive open that's causing this problem.
In any case, I would recommend you to set up a stand alone metastore for Hive. Embedded derby was chosen for its usability in running tests and a good out of the box metastore. However, in my opinion, it's not fit for production workflows. You can find instructions on how to configure MySQL as Hive metastore here.

Possibly you have another sesssion open. Since derby allows only one session per person.
You can check -
ps -wwwfu <your id>
kill the id which is running the hive connection.

It is because the table with the name you've specified is didn't exist in the database.
Try creating the table and again run the command. it will work. :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache nutch is not crawling any more - java

Not quite my field, but looks like thread exhaustion on the underlying machines.

As I was also facing similiar problem. Actual problem was with regionserver (Hbase deamon ). So try to restart it as it is shutdown when used with default seeting and data is too mutch in hbase. For more information, see log files of regionserver.

Related

Cassandra Talend Job running via java posing errors

Using Apache Ignite some expired data remains in memory with TTL enabled and after thread execution

Spring Mongo's find operation freezes on windows server 2008 Machine

Talend ETL Job Error in tOracleOutput Component

Understanding the real reason behind a Hive failure

Categories

Resources