Lucene 4.x performance issues - java

Over the last few weeks I've been working on upgrading an application from Lucene 3.x to Lucene 4.x in hopes of improving performance. Unfortunately, after going through the full migration process and playing with all sorts of tweaks I found online and in the documentation, Lucene 4 is running significantly slower than Lucene 3 (~50%). I'm pretty much out of ideas at this point, and was wondering if anyone else had any suggestions on how to bring it up to speed. I'm not even looking for a big improvement over 3.x anymore; I'd be happy to just match it and stay on a current release of Lucene.
<Edit>
In order to confirm that none of the standard migration changes had a negative effect on performance, I ported my Lucene 4.x version back to Lucene 3.6.2 and kept the newer API rather than using the custom ParallelMultiSearcher and other deprecated methods/classes.
Performance in 3.6.2 is even faster than before:
Old application (Lucene 3.6.0) - ~5700 requests/min
Updated application with new API and some minor optimizations (Lucene 4.4.0) - ~2900 requests/min
New version of the application ported back, but retaining optimizations and newer IndexSearcher/etc API (Lucene 3.6.2) - ~6200 requests/min
Since the optimizations and use of the newer Lucene API actually improved performance on 3.6.2, it doesn't make sense for this to be a problem with anything but Lucene. I just don't know what else I can change in my program to fix it.
</Edit>
Application Information
We have one index that is broken into 20 shards - this provided the best performance in both Lucene 3.x and Lucene 4.x
The index currently contains ~150 million documents, all of which are fairly simple and heavily normalized so there are a lot of duplicate tokens. Only one field (an ID) is stored - the others are not retrievable.
We have a fixed set of relatively simple queries that are populated with user input and executed - they are comprised of multiple BooleanQueries, TermQueries and TermRangeQueries. Some of them are nested, but only a single level right now.
We're not doing anything advanced with results - we just fetch the the scores and the stored ID fields
We're using MMapDirectories pointing to index files in a tmpfs. We played with the useUnmap "hack" since we don't open new Directories very often and got a nice boost from that
We're using a single IndexSearcher for all queries
Our test machines have 94GB of RAM and 64 logical cores
General Processing
1) Request received by socket listener
2) Up to 4 Query objects are generated and populated with normalized user input (all of the required input for a query must be present or it won't be executed)
3) Queries are executed in parallel using the Fork/Join framework
Subqueries to each shard are executed in parallel using the IndexSearcher w/ExecutorService
4) Aggregation and other simple post-processing
Other Relevant Info
Indexes were recreated for the 4.x system, but the data is the same. We tried the normal Lucene42 codec as well as an extended one that didn't use compression (per a suggestion on the web)
In 3.x we used a modified version of the ParallelMultisearcher, in 4.x we're using the IndexSearcher with an ExecutorService and combining all of our readers in a MultiReader
In 3.x we used a ThreadPoolExecutor instead of Fork/Join (Fork/Join performed better in my tests)
4.x Hot Spots
Method | Self Time (%) | Self Time (ms)| Self Time (CPU in ms)
java.util.concurrent.CountDownLatch.await() | 11.29% | 140887.219 | 0.0 <- this is just from tcp threads waiting for the real work to finish - you can ignore it
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.<init>() | 9.74% | 21594.03 | 121594
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTermsEnum$Frame.<init>() | 9.59% | 119680.956 | 119680
org.apache.lucene.codecs.lucene41.ForUtil.readBlock() | 6.91% | 86208.621 | 86208
org.apache.lucene.search.DisjunctionScorer.heapAdjust() | 6.68% | 83332.525 | 83332
java.util.concurrent.ExecutorCompletionService.take() | 5.29% | 66081.499 | 6153
org.apache.lucene.search.DisjunctionSucorer.afterNext() | 4.93% | 61560.872 | 61560
org.apache.lucene.search.Tercorer.advance() | 4.53% | 56530.752 | 56530
java.nio.DirectByteBuffer.get() | 3.96% | 49470.349 | 49470
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.<init>() | 2.97% | 37051.644 | 37051
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getFrame() | 2.77% | 34576.54 | 34576
org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo() | 2.47% | 30767.711 | 30767
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate() | 2.23% | 27782.522 | 27782
java.net.ServerSocket.accept() | 2.19% | 27380.696 | 0.0
org.apache.lucene.search.DisjunctionSucorer.advance() | 1.82% | 22775.325 | 22775
org.apache.lucene.search.HitQueue.getSentinelObject() | 1.59% | 19869.871 | 19869
org.apache.lucene.store.ByteBufferIndexInput.buildSlice() | 1.43% | 17861.148 | 17861
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getArc() | 1.35% | 16813.927 | 16813
org.apache.lucene.search.DisjunctionSucorer.countMatches() | 1.25% | 15603.283 | 15603
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() | 1.12% | 13929.646 | 13929
java.util.concurrent.locks.ReentrantLock.lock() | 1.05% | 13145.631 | 8618
org.apache.lucene.util.PriorityQueue.downHeap() | 1.00% | 12513.406 | 12513
java.util.TreeMap.get() | 0.89% | 11070.192 | 11070
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs() | 0.80% | 10026.117 | 10026
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.decodeMetaData() | 0.62% | 7746.05 | 7746
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader.iterator() | 0.60% | 7482.395 | 7482
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.seekExact() | 0.55% | 6863.069 | 6863
org.apache.lucene.store.DataInput.clone() | 0.54% | 6721.357 | 6721
java.nio.DirectByteBufferR.duplicate() | 0.48% | 5930.226 | 5930
org.apache.lucene.util.fst.ByteSequenceOutputs.read() | 0.46% | 5708.354 | 5708
org.apache.lucene.util.fst.FST.findTargetArc() | 0.45% | 5601.63 | 5601
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock() | 0.45% | 5567.914 | 5567
org.apache.lucene.store.ByteBufferIndexInput.toString() | 0.39% | 4889.302 | 4889
org.apache.lucene.codecs.lucene41.Lucene41SkipReader.<init>() | 0.33% | 4147.285 | 4147
org.apache.lucene.search.TermQuery$TermWeight.scorer() | 0.32% | 4045.912 | 4045
org.apache.lucene.codecs.MultiLevelSkipListReader.<init>() | 0.31% | 3890.399 | 3890
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() | 0.31% | 3886.194 | 3886
If there's any other information you could use that might help, please let me know.

For anyone who cares or is trying to do something similar (controlled parallelism within a query), the problem I had was that the IndexSearcher was creating a task per segment per shard rather than a task per shard - I misread the javadoc.
I resolved the problem by using forceMerge(1) on my shards to limit the number of extra threads. In my use case this isn't a big deal since I don't currently use NRT search, but it still adds unnecessary complexity to the update + slave synchronization process, so I'm looking into ways to avoid the forceMerge.
As a quick fix, I'll probably just extend the IndexSearcher and have it spawn a thread per reader instead of a thread per segment, but the idea of a "virtual segment" was brought up in the Lucene mailing list. That would be a much better long-term fix.
If you want to see more info, you can follow the lucene mailing list thread here:
http://www.mail-archive.com/java-user#lucene.apache.org/msg42961.html

Related

Kafka performance(writes/sec) drastically drops on increasing number of events/messages

Summary
I am trying to benchmark an off-the-shelf kafka (3 nodes) cluster. I am using the default configuration. Following are the setup details:
Total nodes = 3
Zookeeper = node1
Workers = node1, node2 and node3
Config of each node = 4 core, 15 GB
Kafka Version = 2.7.0
Scala Version = 2.13
Here are the node configurations
The first bench mark objective is to check the write speed. To do that, I wrote a java program which creates multiple threads, one producer/thread and writes to the kafka cluster concurrently.
I am noting the start time and end time to measure the time taken.
Here is the code
Here are the results
Here are the Stats
The issue
As I am increasing the number of event, the write/sec is getting reduced.
| Messages | Write/Sec |
|:----------- |:----------|
| 63,965 | 21,236.72 |
| 987,123 | 57,151.63 |
| 20,000,000 | 39,503.60 |
| 50,000,000 | 14,311.95 |
| 99,990,052 | 11,113.97 |
| 300,000,000 | 10,811.26 |
Questions
Why the performance is hampering when inserting more data?
What kind of tuning can I do to improve the write speed?
If my testing methodology is correct or not? Is there any better way to benchmark a Kafka cluster?
Notes
This is my first post on stack overflow, so kindly let me know if I can improve anything in the question.

Why is the Java class file format missing constant pool tag 2?

The JVM specification for Java 1.0.2 lists the following constant pool entry types:
+-----------------------------+-------+
| Constant Type | Value |
+-----------------------------+-------+
| CONSTANT_Class | 7 |
| CONSTANT_Fieldref | 9 |
| CONSTANT_Methodref | 10 |
| CONSTANT_InterfaceMethodref | 11 |
| CONSTANT_String | 8 |
| CONSTANT_Integer | 3 |
| CONSTANT_Float | 4 |
| CONSTANT_Long | 5 |
| CONSTANT_Double | 6 |
| CONSTANT_NameAndType | 12 |
| CONSTANT_Utf8 | 1 |
+-----------------------------+-------+
Subsequent JVM specs have added more constant pool entry types but haven't ever filled the "2" spot. Why is there a gap there?
I did some research and found some clue, for the constant pool tag 2, it seems to be held open under the Constant_Unicode but has never been used, because UTF-8 is already there, and UTF-8 is widely adopted, so if there is constant written in unicode, UTF-8 can handle it, and UTF-8 has a number of advantages than other encoding scheme, so I guess this historical fact might explain why 2 is missing, I guess it can be reused for other purposes if necessary.
Some statements from here:
https://bugs.openjdk.java.net/browse/JDK-8161256
For 13, 14, it should have different specific reasons why it was opened but never got used.

cassandra high volume writes sometimes silently fail

I am recording realtime trade data with the Datastax Cassandra java driver. I have configured Cassandra with a single node, replication factor of 1, and consistency level ALL.
I frequently have writes which do not record, but do not fail. The java client does not throw any errors, and the async execute successful callback is called. Trace doesn't seem to show anything unusual:
[CassandraClient] - Adding to trades memtable on /10.0.0.118[SharedPool-Worker-1] at Mon Dec 22 22:54:04 UTC 2015
[CassandraClient] - Appending to commitlog on /10.0.0.118[SharedPool-Worker-1] at Mon Dec 22 22:54:04 UTC 2015
[CassandraClient] - Coordinator used /10.0.0.118
but, when I look at the data in the cassandra shell, notice the skipped IDs (ignoring bad dates):
cqlsh:keyspace> select * from trades where [...] order by date desc limit 10;
date | id | price | volume
--------------------------+--------+--------+------------
1970-01-17 19:00:19+0000 | 729286 | 435.96 | 3.4410000
1970-01-17 19:00:19+0000 | 729284 | 436.00 | 17.4000000
1970-01-17 19:00:19+0000 | 729283 | 436.00 | 0.1300000
1970-01-17 19:00:19+0000 | 729277 | 436.45 | 5.6972000
1970-01-17 19:00:19+0000 | 729276 | 436.44 | 1.0000000
1970-01-17 19:00:19+0000 | 729275 | 436.44 | 0.9728478
1970-01-17 19:00:19+0000 | 729274 | 436.43 | 0.0700070
1970-01-17 19:00:19+0000 | 729273 | 436.45 | 0.0369260
1970-01-17 19:00:19+0000 | 729272 | 436.43 | 1.0000000
1970-01-17 19:00:19+0000 | 729271 | 436.43 | 1.0000000
why do some inserts silently fail? indications point to a timestamp issue, but I don't detect a pattern.
similar question: Cassandra - Write doesn't fail, but values aren't inserted
might be related to: Cassandra update fails silently with several nodes
The fact that the writes succeed and some records are missing is a symptom that C* is overwriting the missing rows. The reason you may see such behavior is the misuse of bound statements.
Usually people prepare the statements with:
PreparedStatement ps = ...;
BoundStatement bs = ps.bind();
then they issue something like:
for (int i = 0; i < myHugeNumberOfRowsToInsert; i++) {
session.executeAsync(bs.bind(xx));
}
This will actually produce the weird behavior, because the bound statement is the same across most of the executeAsync calls, and if the loop is fast enough to enqueue (say) 6 queries before the driver fires the first query at all, all the submitted queries will share the same bound data. A simple fix is to actually issue different BoundStatement:
for (int i = 0; i < myHugeNumberOfRowsToInsert; i++) {
session.executeAsync(new BoundStatement(ps).bind(xx));
}
This will guarantee that each statement is unique and no overwrites are possible at all.

Suggest framework for external rule storage

There is a situation:
I've got 2 .xlsx files:
1. With bussines data
for example:
-----------------------------------------
| Column_A | Column_B| Column_C | Result |
-----------------------------------------
| test | 562.03 | test2 | |
------------------------------------------
2. With bussiness rules
for example:
-------------------------------------------------------------------------
| Column_A | Column_B | Column_C | Result |
-------------------------------------------------------------------------
| EQUALS:test | GREATER:100 | EQUALS:test2 & NOTEQUALS:test | A |
--------------------------------------------------------------------------
| EQUALS:test11 | GREATER:500 | EQUALS:test11 & NOTEQUALS:test | B |
--------------------------------------------------------------------------
With condition in each cell.
One row contains list of these conditions and composes one rule.
All rules will be processed iteratively. But of course, I think, it would be better to construct some 'decision tree' or 'classification flow-chart'.
So, my task is: to store these conditions functionality (methods like EQUALS, GREATER, NOTEQUALS) in some external file or some other resource. To have a possibility to change it without compilation into java bytecode. To have a dynamic solution, not to hard code in java methods.
I found DROOLS http://drools.jboss.org/ as a whay that can work with such cases. But maybe there are another frameworks that can work with such issues?
JavaScript, DynamicSQL, DB solution is not suitable.

How to specify uberization of a Hive query in Hadoop2?

There is a new feature in Hadoop 2 called uberization. For example, this reference says:
Uberization is the possibility to run all tasks of a MapReduce job in
the ApplicationMaster's JVM if the job is small enough. This way, you
avoid the overhead of requesting containers from the ResourceManager
and asking the NodeManagers to start (supposedly small) tasks.
What I can't tell is whether this just happens magically behind the scenes or does one need to do something for this to happen? For example, when doing a Hive query is there a setting (or hint) to get this to happen? Can you specify the threshold for what is "small enough"?
Also, I'm having trouble finding much about this concept - does it go by another name?
I found details in the YARN Book by Arun Murthy about "uber jobs":
An Uber Job occurs when multiple mapper and reducers are combined to use a single
container. There are four core settings around the configuration of Uber Jobs found in
the mapred-site.xml options presented in Table 9.3.
Here is table 9.3:
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.enable | Whether to enable the small-jobs "ubertask" optimization, |
| | which runs "sufficiently small" jobs sequentially within a |
| | single JVM. "Small" is defined by the maxmaps, maxreduces, |
| | and maxbytes settings. Users may override this value. |
| | Default = false. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxmaps | Threshold for the number of maps beyond which the job is |
| | considered too big for the ubertasking optimization. |
| | Users may override this value, but only downward. |
| | Default = 9. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxreduces | Threshold for the number of reduces beyond which |
| | the job is considered too big for the ubertasking |
| | optimization. Currently the code cannot support more |
| | than one reduce and will ignore larger values. (Zero is |
| | a valid maximum, however.) Users may override this |
| | value, but only downward. |
| | Default = 1. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxbytes | Threshold for the number of input bytes beyond |
| | which the job is considered too big for the uber- |
| | tasking optimization. If no value is specified, |
| | `dfs.block.size` is used as a default. Be sure to |
| | specify a default value in `mapred-site.xml` if the |
| | underlying file system is not HDFS. Users may override |
| | this value, but only downward. |
| | Default = HDFS block size. |
|-----------------------------------+------------------------------------------------------------|
I don't know yet if there is a Hive-specific way to set this or if you just use the above with Hive.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master. So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So, the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

Categories