Replication of solr Lucene index failing with following error message

Replication of solr Lucene index failing with following error message - java

I am getting below error when running solr replication
2013-12-27 05:03:32,391 [explicit-fetchindex-cmd] ERROR org.apache.solr.handler.ReplicationHandler- SnapPull failed :org.apache.solr.common.SolrException: Index fetch failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:485)
at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:319)
at org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:220)
Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/apps/search/data/customers/solr/solr/adidas-archive/data/index.20131227050332242/segments_a")
at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:78)
at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:84)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:320)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:380)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:663)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:376)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:711)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267)
at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:179)
at org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:632)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:469)
... 2 more
My step-up is Master on 3.x and Slave on 4.x.
This happened when I copied very large index 100 g+. What does this error means ? Does it means that index has got corrupt and if so what can be done to fix it, any thoughts ?
I ran the checkindex utility, but is again giving the error :
ERROR: could not read any segments file in directory
java.io.EOFException: read past EOF: MMapIndexInput(path="/apps/search/data/customers/solr/solr/adidas-archive/data/index.20131227051833263/segments_a") at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:78)
at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:41)

while upgrading the version of solr3.x to 4.x to upgrade you can use any of the following methods
1 . rebuild the index from scratch (time taking)
OR
2. optimize the index
But What i found in upgrading from 3.6 to 4.2 was that if your index size in more than 100 gigs (mine was over 170 gigs in solr 3.6 ) .It would be more fruitful to rebuild the index from scratch as new compression techniques are implemented now . (my index size got reduced from 170 gigs to 115 gigs in solr 4.2). other wise after optimization your index would be accepted by solr 4.x . but index size would remain same .
Please don't use different versions in a same replication system .
If you have been using it in past please share the details it would be really helpful
regards
rajat

Related

How to generate report from huge mongodb document [duplicate]

Using the code:
all_reviews = db_handle.find().sort('reviewDate', pymongo.ASCENDING)
print all_reviews.count()
print all_reviews[0]
print all_reviews[2000000]
The count prints 2043484, and it prints all_reviews[0].
However when printing all_reviews[2000000], I get the error:
pymongo.errors.OperationFailure: database error: Runner error: Overflow sort stage buffered data usage of 33554495 bytes exceeds internal limit of 33554432 bytes
How do I handle this?

You're running into the 32MB limit on an in-memory sort:
https://docs.mongodb.com/manual/reference/limits/#Sort-Operations
Add an index to the sort field. That allows MongoDB to stream documents to you in sorted order, rather than attempting to load them all into memory on the server and sort them in memory before sending them to the client.

As said by kumar_harsh in the comments section, i would like to add another point.
You can view the current buffer usage using the below command over the admin database:
> use admin
switched to db admin
> db.runCommand( { getParameter : 1, "internalQueryExecMaxBlockingSortBytes" : 1 } )
{ "internalQueryExecMaxBlockingSortBytes" : 33554432, "ok" : 1 }
It has a default value of 32 MB(33554432 bytes).In this case you're running short of buffer data so you can increase buffer limit with your own defined optimal value, example 50 MB as below:
> db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes:50151432})
{ "was" : 33554432, "ok" : 1 }
We can also set this limit permanently by the below parameter in the mongodb config file:
setParameter=internalQueryExecMaxBlockingSortBytes=309715200
Hope this helps !!!
Note:This commands supports only after version 3.0 +

solved with indexing
db_handle.ensure_index([("reviewDate", pymongo.ASCENDING)])

If you want to avoid creating an index (e.g. you just want a quick-and-dirty check to explore the data), you can use aggregation with disk usage:
all_reviews = db_handle.aggregate([{$sort: {'reviewDate': 1}}], {allowDiskUse: true})
(Not sure how to do this in pymongo, though).

JavaScript API syntax for the index:
db_handle.ensureIndex({executedDate: 1})

In my case, it was necessary to fix nessary indexes in code and recreate them:
rake db:mongoid:create_indexes RAILS_ENV=production
As the memory overflow does not occur when there is a needed index of field.
PS Before this I had to disable the errors when creating long indexes:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
Also may be needed reIndex:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> use your_db
switched to db your_db
> db.getCollectionNames().forEach( function(collection){ db[collection].reIndex() } )

MongoDB: Query has implicit limit(256)?

I've created (in code) a default collection in MongoDB and am querying it, and have discovered that while the code will return all the data when I run it locally, it won't when I query it on a deployment server. It returns a maximum of 256 records.
Notes:
This is not a capped collection.
Locally, I'm running 3.2.5, the remote MongoDB version is 2.4.12
I am not using the limit parameter. When I use it, I can limit both the local and deployment server, but the deployment server will still never return more than 256 records.
The amount of data being fetched from the server is <500K. Nothing huge.
The code is in Clojure, using Monger, which itself just calls the Java com.mongodb stuff.
I can pull in more than 256 records from the remote server using Robomongo though I'm not sure how it does this, as I cannot connect to the remote from the command line (auth failed using the same credentials, so I'm guessing version incompatibility there).
Any help is appreciated.
UPDATE: Found the thing that triggers the problem: When I sort the output, it reduces the output to 256—but only when I pull from Mongo 2.4! I don't know if this is a MongoDB itself, the MongoDB java class, or Monger, but here is the code that illustrates the issue, as simple as I could make it:
(ns mdbtest.core
(:require [monger.core :as mg]
[monger.query :as mq]))
(defn get-list []
(let [coll (mq/with-collection
(mg/get-db
(mg/connect {:host "old-mongo"}) "mydb") "saves"
(mq/sort (array-map :createdDate -1)))] ;;<<==remove sort
coll))

You need to specify a bigger batch-size, the default is 256 records.
Here's an example from my own code:
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1}) ))
256
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1})
(q/batch-size 1000) ))
688
See more info here: http://clojuremongodb.info/articles/querying.html#setting_batch_size

Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"

I am running this code on EMR 4.6.0 + Spark 1.6.1 :
val sqlContext = SQLContext.getOrCreate(sc)
val inputRDD = sqlContext.read.json(input)
try {
inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output)
logger.info("DONE!")
} catch {
case e : Throwable => logger.error("ERROR" + e.getMessage)
}
In the last stage of saveAsTextFile, it fails with this error:
16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
/* 001 */
/* 002 */ public java.lang.Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] exprs) {
/* 003 */ return new SpecificUnsafeProjection(exprs);
/* 004 */ }
(...)
What could be the reason? Thanks

Solved this problem by dropping all the unused column in the Dataframe, or just filter columns you actually need.
Turnes out Spark Dataframe can not handle super wide schemas. There is no specific number of columns where Spark might break with “Constant pool has grown past JVM limit of 0xFFFF” - it depends on kind of query, but reducing number of columns can help to workaround this issue.
The underlying root cause is in JVM's 64kb for generated Java classes - see also Andrew's answer.

This is due to known limitation of Java for generated classes to go beyond 64Kb.
This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3 - will be released in Jan/2018.

For future reference, this issue was fixed in spark 2.3 (As Andrew noted).
If you encounter this issue on Amazon EMR, upgrade to release version 5.13 or above.

Inconsistent counter values between replicas in Cassandra

I've got a 3 machine Cassandra cluster using rack unaware placements strategy with a replication factor of 2.
The column family is defined as follows:
create column family UserGeneralStats with comparator = UTF8Type and default_validation_class = CounterColumnType;
Unfortunately after a few days of production use I got some inconsistent values for the counters:
Query on replica 1:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=96545030198)
=> (counter=downloads, value=1013)
=> (counter=previews, value=10304)
Query on replica 2:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=9140386229)
=> (counter=downloads, value=339)
=> (counter=previews, value=1321)
As the standard read repair mechanism doesn't seem to repair the values I tried to force an
anti-entropy repair using nodetool repair. It did't have any effect on the counter values.
Data inspection showed that the lower values for the counters are the correct ones so I suspect that either Cassandra (or Hector which I used as API to call Cassandra from Java) retried some increments.
Any ideas how to repair the data and possibly prevent the sittuation from happening again?

If neither RR nor repair fixes it, it's probably a bug.
Please upgrade to 0.8.3 (out today) and verify it's still present in that version, then you can file a ticket at https://issues.apache.org/jira/browse/CASSANDRA.

How to fix java OutOfMemoryError: Java heap space from DataImportHandler?

I am trying to import a large dataset (41million records) into a new Solr index. I have setup the core, it works, I inserted some test docs, they work. I have setup the data-config.xml as below and then I start the full-import. After about 12 hours! the import fails.
The document size can get quite large, could the error be because of a large document (or field) or due to the volume of data going into the DataImportHandler?
How can I get this frustrating import task working!?!
I have included the tomcat error log below.
Let me know if there is any info i have missed!
logs:
Jun 1, 2011 5:47:55 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity results with URL: jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor
Jun 1, 2011 5:47:56 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Time taken for getConnection(): 1185
Jun 1, 2011 5:48:02 PM org.apache.solr.core.SolrCore execute
INFO: [results] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0
...
Jun 2, 2011 5:16:32 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:664)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(Unknown Source)
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:419)
at com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:1974)
at com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:175)
at com.microsoft.sqlserver.jdbc.Column.getValue(Column.java:113)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1982)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1967)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2256)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2265)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.getARow(JdbcDataSource.java:286)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$700(JdbcDataSource.java:228)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:266)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:260)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:78)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
... 5 more
Jun 2, 2011 5:16:32 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jun 2, 2011 5:16:44 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor"
user="sa"
password="password"/>
<document>
<entity name="results" query="SELECT fielda, fieldb, fieldc FROM mydb.[dbo].mytable WITH (NOLOCK)">
<field column="fielda" name="fielda"/><field column="fieldb" name="fieldb"/><field column="fieldc" name="fieldc"/>
</entity>
</document>
</dataConfig>
solrconfig.xml snippet:
<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>25</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>100000</maxFieldLength>
<writeLockTimeout>10000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
</indexDefaults>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>25</mergeFactor>
<infoStream file="INFOSTREAM.txt">true</infoStream>
</mainIndex>
Java config settings: init mem 128mb, max 512mb
Environment:
solr 3.1
tomcat 7.0.12
windows server 2008
java: v6 update 25 (build 1.6.0_25-b06)
(data coming from:sql 2008 r2)
/admin/stats.jsp - DataImportHandler
Status : IDLE
Documents Processed : 2503083
Requests made to DataSource : 1
Rows Fetched : 2503083
Documents Deleted : 0
Documents Skipped : 0
Total Documents Processed : 0
Total Requests made to DataSource : 0
Total Rows Fetched : 0
Total Documents Deleted : 0
Total Documents Skipped : 0
handlerStart : 1306759913518
requests : 9
errors : 0
EDIT: I am currently running a sql query to find out the largest single record's field length, as I think this is probably cause of exception. Also, running import again with jconsole to monitor heap usage.
EDIT: Read solr performance factors page. changing maxFieldLength to 1000000 and changing ramBufferSizeMB = 256. Now for another import run (yay...)

Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(Unknown Source)
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:419)
makes it pretty obvious that The MS JDBC driver is running out of ram. Many JDBC drivers can default to fetching all their results at once in memory. So see if this can be tuned or consider using the opensource JTDS driver which is generally better behaved anyway
I don't believe maxfieldlength is gonna help you - that will affect how much Lucene truncates, but not how much is initially transferred. Another option is to only transfer a selection at a time, say a 1 million, using TOP and ROWNUMBER and such for paging.

I was able to successfully import a large table using jdbc to Sql Server without running into out of memory errors by batching the queries manually. In my case, in 256 batches:
data-config.xml:
<dataConfig>
<dataSource
type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor"
user="sa"
password="password"
autoCommit="true"
batchSize="10000" />
<document>
<entity
name="batch"
query="
with Batch as (
select 0 BatchId
union all
select BatchId + 1
from Batch
where BatchId < 255
)
select BatchId FROM Batch OPTION (MAXRECURSION 500)
"
dataSource="JdbcDataSource"
rootEntity="false">
<entity name="results" query="SELECT fielda, fieldb, fieldc FROM mydb.[dbo].mytable WITH (NOLOCK) WHERE CONVERT(varbinary(1),fielda) = ${batch.BatchId}">
<field column="fielda" name="fielda"/><field column="fieldb" name="fieldb"/><field column="fieldc" name="fieldc"/>
</entity>
</entity>
</document>
</dataConfig>
The parent entity batch is a recursive sql server CTE that returns byte values 0-255. This is used to filter the child entity results.
Note 1: The where condition (ie. CONVERT(varbinary(1),fielda) = ${batch.BatchId} ) should be adjusted depending on the type and contents of the partitioning field in order to partition the results into equal batches. e.g. Use fielda % 255 = ${batch.BatchId} if fielda is a number. In my case fielda was a uniqueidentifier so the first byte was sufficient.
Note 2: The rootEntity="false" on the batch entity is required and indicates the batch entity is NOT a root document.

For mysql, it works
As per solr wiki,
DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.
Should look like:
<dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:8889/mysqldatabase" batchSize="-1" user="root" password="root"/>

Your java config settings might be not enough for this job: init mem 128mb, max 512mb
Try this cargo cult fix for OutOfMemoryError: increase heap size
make it -Xmx1024M or more if you can afford that and initial 512M -Xms512M

I've found that with large record sets things can get a bit hairy. You do have a couple options.
The Quick and most likely best option would be to allocate more memory to the indexing system! Memory (for the most part) is pretty cheap.
The other thing I might try is chunking the data.
Depending on the size of your "documents" 41M docs may bog you down when it comes to searching. You may want to shard the documents. When using DIH I try to facilitate partitioning using querystring parameter. You can do this by referencing the appropriate parameter passed into DHI using the ${dataimporter.request.MYPARAMETERNAME} within the query statement.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replication of solr Lucene index failing with following error message - java

Related

How to generate report from huge mongodb document [duplicate]

MongoDB: Query has implicit limit(256)?

Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"

Inconsistent counter values between replicas in Cassandra

How to fix java OutOfMemoryError: Java heap space from DataImportHandler?

Categories

Resources