Performance issue between solr 1.4.0 and 4.6.0

Performance issue between solr 1.4.0 and 4.6.0 - java

I updated my solr version from 1.4.0 to 4.6.0 and now we are facing several performance issues.
a) If I use embedded version, it's very slow
b) Using http, I have these average times:
1.4: 151ms
4.6: 301ms
c) I saw that JavaBinCodec changed from version 1 to 2. Anybody nows if this can be the problem?
Note1: I tested many times, discarding first time, because of the warm up of server.
Note2: The documents returned are very big (3k lines in XML view, each document)
Any help would be apreciated.
The code used to test, showing code to solr 4.6
public class Main {
private static HttpSolrServer server;
public static void main(String[] args) throws Exception {
String url = "http://foo.bar/myIndex";
server = new HttpSolrServer(url);
for (int i = 0; i < 10; i++) {
search();
}
}
public static void search() throws Exception {
SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery("foo:bar");
solrQuery.setStart(0);
solrQuery.setRows(20);
// QUERY
long before = new GregorianCalendar().getTimeInMillis();
server.query(solrQuery);
long after = new GregorianCalendar().getTimeInMillis();
System.out.println(after - before);
}
}

Solr 4.6 runs on Java 6 or higher version. When using Java 7, Solr recommends to install at least Update 1 and also discourages the experimental usage of -XX JVM options. Latest version of JVMs may affect the performance of Solr. You can have an overview of issues in Solr due to JVM at the link below.
http://wiki.apache.org/lucene-java/JavaBugs
CPU, disk and memory requirements are based on the many choices made in implementing Solr (document size, number of documents, and number of hits retrieved to name a few).
However, you can try several things to improve the performance of Solr if you are using Zookeeper.
Move Zookeeper, if using, to another disk. If the index is huge then number of I/O call from Solr to Zookeeper will degrade the assembly performance.
Increase the Zookeeper timeout period.
Log gc times, I found out pauses of upto 20s on Zookeeper boxes.
Use the recommendations to tune the heap from http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning.

Related

Java code runs out of space memory on AWS but not MacOSX

I need another set of eyes on this.
I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.
With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).
Here's the code that's being run, streaming MyBatis to a CSV file on disk:
File directory = new File(feedDirectory);
File file;
try {
file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
} catch (IOException e) {
throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
}
String filePath = file.getAbsolutePath();
log.info(String.format("File name for %s feed is %s", providerCode, filePath));
// output file
try (FileOutputStream out = new FileOutputStream(file)) {
streamData(out, providerCode, startDate, endDate);
} catch (IOException e) {
throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
}
public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
StreamingHandler<FStay> handler = stayPrintingHandler(printer);
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
}
}
private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
StreamingHandler<FStay> handler = new StreamingHandler<>();
handler.setHandler((stay) -> {
try {
EXPORTER.writeStay(printer, stay);
} catch (IOException e) {
log.error("Issue with writing output: " + e.getMessage(), e);
}
});
return handler;
}
// The EXPORTER method
import org.apache.commons.csv.CSVPrinter;
public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
List<Object> list = asList(stay);
printer.printRecord(list);
}
List<Object> asList(FStay stay) {
List<Object> list = new ArrayList<>(46);
list.add(stay.getUid());
list.add(stay.getProviderCode());
//....
return list;
}
Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.
^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.
However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)
I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.
Here's what the JVisualVm graph looks like on the Ubuntu box:
I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)
The only differences I can think of at this point are:
Operating system - Ubuntu vs Mac OS X
Hosted VM in AWS vs hard metal laptop
Network speed is faster in AWS between database and Ubuntu server
JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally
Can anyone help shed any light on this problem?
Update
I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.
What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.
Update 2
In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.
Is there any likelihood of there being a bug at the Ubuntu/Java level for this?
Update 3
Tried replacing the output of the CSVPrinter to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.
As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.
I've also tried flushing the stream after every write, but had no luck with that as well.
Update 4
Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.
I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.
Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:
Error in the SqlSessionFactory I'm setting up
Bug in the Redshift driver that only pops up in Ubuntu / AWS
Bug in the result handler I have overwritten
Here's the heap dump screenshot:
Here's where I'm setting up my SqlSessionFactoryBean:
#Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
log.info("Got to datasource config");
// Dynamically load driver at runtime.
Class.forName(dataWarehouseDriver);
DataSource dataSource = new DataSource();
dataSource.setURL(dataWarehouseUrl);
dataSource.setUserID(dataWarehouseUsername);
dataSource.setPassword(dataWarehousePassword);
return dataSource;
}
#Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
factoryBean.setDataSource(redshiftDataSource());
return factoryBean;
}
Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
#Override
public void handleResult(ResultContext<? extends FStay> resultContext) {
// do nothing
}
});
Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.
Update 6
I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.
After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And this did the trick.
Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.
Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.
Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:
Thanks again in a huge way to all those who helped guide this!

Finally figured out a solution.
The heap dump was the biggest aid-- it indicated the InboundDataHandler class of Amazon's RedShift/postgres JDCB driver was the prime culprit.
The code to set up the SqlSession appeared legit, so traveling over to Amazon's documentation landed this gem:
To avoid client-side out-of-memory errors when retrieving large data
sets using JDBC, you can enable your client to fetch data in batches
by setting the JDBC fetch size parameter.
We hadn't run into this before, as we stream results with custom ResultHandlers in MyBatis... but there seems to be something different when the AWS Redshift JDBC driver is running on AWS itself vs outside AWS connecting in.
Taking the guidance from the documentation, we added a 'fetchSize' to our MyBatis select query:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And voila! Everything worked swimmingly. This is the only change we made and the heap never went above a couple hundred MBs.
You can see in one of the above graphs where the heap goes off the charts, as soon as the data started to be received on Amazon, the heap marches right up linearly and never reclaims an ounce of heap space once it starts.
My guess is the Redshift JDBC driver is doing something different when it's in Amazon's environment for some kind of optimization... that's all I can think of to explain the behavior.
Clearly Amazon knows what's going on since they documented it up front. I may not know the full 'why' of what's happening, but at least everything is resolved in what appears to be a satisfactory way.
Thanks to all those who helped.

JDBC mysql connection much slower on Win8 vs Ubuntu14

I need some help with optimization of mysql connection/query. To be honest I am fairly new to DB topic hence I do not know how to start optimization process and how to explain differences in performance between my linux and windows machines.
I have a java application which connects to the DB, retrive some data (about 1 000 000 rows), process them and write out to a set of csv files.
The problem I have is that on my linux machine (i5-2520M and SSD) the whole process takes about 17 seconds while on my windows 8 computer (i7-4790k, SSD disc combined with 7200 rpm HDD) it took almost a minute to execute the same code.
So it's more than 3 times longer on win than linux. Can anyone explain why is that the case and how to make the performance comparable on both platforms?
Update 1:
JVM is a hotspot I guess version 8.
DB is on localhost.
Cores: 4x4,5 ghz for windows and 2x2,5 ghz for linux both with intel superthreading fancy stuff
There is no any exception caught either on linux or windows even though I have got a try/catch prepared for all of them.
Here you have some basic data about the performance and key components of the application. I can provide more details if necessary, just tell me what you need.
public class DBAccesor {
private Connection mySQLconnection;
private ResultSet answerDB;
private Statement query;
private final String connectionFlags = "&characterEncoding=utf8&useUnicode=true&useSSL=false"
private String queryBody = "SELECT name, surename FROM table1 INNER JOIN table2 ON table1.person_id = table2.person_id WHERE origin = \"eu\"";
...
Connection established in: Win 0.167s vs Linux 0.311s
Class.forName(driverJDBC);
DriverManager.setLogWriter(new PrintWriter(System.out));
mySQLconnection = DriverManager.getConnection(
DBServer
+ DBName
+ login
+ password
+ connectionFlags);
Query Execution: Win 0.023s vs Linux 0.01s
query = mySQLconnection.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
query.setFetchSize(Integer.MIN_VALUE);
answerDB = query.executeQuery(queryBody);
Retrieving data from Result Set: Win 53.020s vs Linux 13.282s
ArrayList<Person> results = new ArrayList<Person>();
while (answerDB.next()) {
try {
String name = new String (answerDB.getBytes(1), "UTF-8"); //since
//there is a lot of local characters in my data I have to use
//getBytes instead of getString. otherwise obtained characters are
//corrupted.
String surname = new String (answerDB.getBytes(2), "UTF-8");
results.add (new Person(name, surename));
} catch (SQLException | UnsupportedEncodingException e) {
e.printStackTrace();
}
}
The rest of the code is quite strightforward. I have some paralelStream processing based on Arraylist just created and writing output to files.
Data procesing: Win 1.109s vs Linux 2.976s
Writing output to files: Win 1.571s vs Linux 0.439s
Overall runtime: Win 55.880s vs Linux 17.083s

What you want to do is to retrieve data from a MySQL database and write this data into your disk. Now the thing is that it's not about which OS you are using. Seeing the configurations of your two machines, I see that you are using a SSD on the Linux one and a HDD on the windows one. You should know that the Read / Write capabilities of a SSD disk are much more better and faster than with a HDD, so I think that it is from there that the differences of performance come from.
I refer you to read this thread of discussion for further information:
https://dba.stackexchange.com/questions/59828/ssd-vs-hdd-for-databases

Performance issue after upgrading to neo4j 2.0 RC1

We have a java application that embeds neo4j server. This application loads some data from Oracle db, creates the graph and then users can do domain specific traversals and algorithms on demand.
We recently upgraded from 1.9.3 to 2.0 RC1. We are now using Schema and unique constraints as follows:
Iterator<ConstraintDefinition> constraints = schema.getConstraints(
DynamicLabel.label(label)).iterator();
if (constraints == null || !constraints.hasNext()) {
try {
schema.constraintFor(
DynamicLabel.label(label))
.assertPropertyIsUnique(propertyName).create();
} catch (org.neo4j.graphdb.ConstraintViolationException ex) {
LOG.error("CONSTRAINT ALREADY DEFINED ON: "
+ label);
}
}
The issue is our applications startup time has become 10 times slower. Sampling the cpu times reveal the following:

We instrumented the application and found the root cause of the slowness. Following thread will discuss the specific issue we found in detail:
How to improve performance while committing nodes in bulk with Neo4j 2.0 RC1?

Java application performs very slow (10-100 times slower than on Windows, Linux, AIX)

I need your help about performance problems running our corporate Java application on HP\UX server. Application is standalone tool which synchronizes data over several data bases into one, communicates with remote control on XML-RPC protocol and uses local Derby (Java DB) data base instance to hold configuration data etc. We don not have performance problems on other environments on the same load like Windows XP, Linux and AIX which use Sun JVM. After series of test we found out that most time consuming was communication with Derby data base. Most time is spent on reading from socket and this time is greater in 10-100 times than on other platforms. We know for sure that Derby works fine, we’ve got CPU reserve (usage is about 30%-40%), so most probable reason is transport layer between local data base and application.
Is there a way to diagnose socket I\O problems on HP-UX or maybe there is some possible limitations that can be configured? Maybe there is necessary JVM option? Any ideas from your side would be highly appreciated.
We’ve tried to optimize JVM options according to http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp?topic=/com.ibm.websphere.wsfep.multiplatform.doc/info/ae/ae/tprf_tunejvm_v61.html but didn’t get any significant improvement.
JVM info:
Java HotSpot(TM) 64-Bit Server VM (19.1-b02-jinteg:2011mar11-16:46 PA2.0W (aCC_AP), mixed mode)
Java: version 1.6.0.10, vendor "Hewlett-Packard Company"
We use following instance:
OS: HP-UX (B.11.23)
Architecture: PA_RISC2.0W 64bit
Processors: 2
Total physical memory size: 4 088 MB
Swap size: 4 090 MB
Here is example of slow running code. It takes several seconds to execute on HP while on Windows it takes 10-30ms:
/** Template to communicate with local db. */
SimpleJdbcTemplate jdbcTemplate;
#Transactional(readOnly = true)
public List<JobLogEntry> getLastLogs(Integer dbnr, JobDataType dtype) {
try {
String uid = jdbcTemplate.queryForObject("SELECT session_uuid FROM "
+ tableName + " WHERE id=(SELECT max(id) FROM "
+ tableName + " WHERE dbnr=? AND dtype=?)",
String.class, dbnr, dtype.name());
List<JobLogEntry> list = jdbcTemplate.query("SELECT id, dbnr, dtype, zeit, level, message FROM "
+ tableName
+ " WHERE dbnr=? AND dtype=? AND session_uuid=? ORDER BY ID",
new ConRowMapper(), dbnr, dtype.name(), uid);
return list;
} catch (org.springframework.dao.EmptyResultDataAccessException e) {
return new ArrayList<JobLogEntry>();
}
}
class ConRowMapper implements RowMapper<JobLogEntry> {
private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy.MM.dd HH:mm:ss");
/**
* Maps rows.
*/
public JobLogEntry mapRow(ResultSet rs, int rowNum) throws SQLException {
return new JobLogEntry(rs.getInt("dbnr"),
rs.getString("dtype"),
dateFormat.format(rs.getTimestamp("zeit")),
rs.getString("level"),
rs.getString("message"));
}
}
Thanks in advance for all your ideas

I wonder about the method getLastLogs(). Why query to get the session UUID and then turn around and use it in another query? I would guess that it's possible to do it in one query.
When you say Derby, it makes me think that only Java accesses that database. Is that true? Do you know that it's optimized well (e.g. proper indexes for every WHERE clause)?
Do you use connection pooling? That way you can pay the cost of creating connections up front and amortize it over all the queries you run.
I see jdbcTemplate, so you must be using Spring. I'd get the debug or trace interceptor wired in and see where the time is being spent.
I'd also recommend Visual VM 1.3.2 will all the plugins installed. It will give you a lot more data.

Probable reason could be slow and blockong GC work on HP-UX. Try removing redundant System.gc() calls and use some JVM GC options to otimize :)
See nice presentation about HP performance tuning: http://www.scribd.com/doc/47433278/Javamemorymanagemen

How can I use OpenOffice in server mode as a multithreaded service?

What is the experience of working with OpenOffice in server mode? I know OpenOffice is not multithreaded and now I need to use its services in our server.
What can I do to overcome this problem?
I'm using Java.

With the current version of JODConverter (3.0-SNAPSHOT), it's quite easy to handle multiple threads of OOo in headless-mode, as the library now supports starting up several instances and keeping them in a pool, by just providing several port numbers or named pipes when constructing a OfficeManager instance:
final OfficeManager om = new DefaultOfficeManagerConfiguration()
.setOfficeHome("/usr/lib/openoffice")
.setPortNumbers(8100, 8101, 8102, 8103)
.buildOfficeManager();
om.start();
You can then us the library e.g. for converting documents without having to deal with the pool of OOo instances in the background:
OfficeDocumentConverter converter = new OfficeDocumentConverter(om);
converter.convert(new File("src/test/resources/test.odt"), new File("target/test.pdf"));

Yes, I am using OpenOffice as a document conversion server.
Unfortunately, the solution to your problem is to spawn a pool of OpenOffice processes.
The commons-pool branch of JODConverter (before it moved to code.google.com) implemented this out-of-the-box for you.

Thanks Bastian. I found another way, based on Bastian's answer. Opening several ports it provides access to create multithreads. But without many ports(enought several) we can improve performence by increase task queue timeout here is a documentation. And one thing again, we decided not to start and stop officeManager on each convertion process.At the end, I solved this task by this approach:
public class JODConverter {
private static volatile OfficeManager officeManager;
private static volatile OfficeDocumentConverter converter;
public static void startOfficeManager(){
try {
officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome(new File('libre office home path'))
.setPortNumbers(8100, 8101, 8102, 8103, 8104 )
.setTaskExecutionTimeout(600000L) // for big files
.setTaskQueueTimeout(200000L) // wait if all port were busy
.buildOfficeManager();
officeManager.start();
// 2) Create JODConverter converter
converter = new OfficeDocumentConverter(officeManager);
} catch (Throwable e){
e.printStackTrace();
}
}
public static void convertPDF(File inputFile, File outputFile) throws Throwable {
converter.convert(inputFile, outputFile);
}
public static void stopOfficeManager(){
officeManager.stop();
}
}
I call JODConverter's convertPDF when convertion is need. It will be stopped only when application was down.

OpenOffice can be used in headless mode, but it has not been built to handle a lot of requests in a stressfull production environment.
Using OpenOffice in headless mode has several issues:
The process might die/become unavailable.
There are several memory leaks issues.
Opening several OpenOffice "workers" does not scale as expected, and needs some tweaking to really have different open proccesses (having several OpenOffice copies, several services, running under different users.)
As suggested, jodconverter can be used to access the OpenOffice process.
http://code.google.com/p/jodconverter/wiki/GettingStarted

you can try this:
http://www.jopendocument.org/
its an opensource java based library that allows you to work with open office documents without open office, thus removing the need for the OOserver.

Vlad is correct about having to run multiple instances of OpenOffice on different ports.
I'd just like to add that OpenOffice doesn't seem to be stable. We run 10 instances of it in a production environment and set the code up to re-try with another instance if the first attempt fails. This way when one of the OpenOffice servers crashes (or doesn't crash but doesn't respond either) production is not affected. Since it's a pain to keep restarting the servers on a daily basis, we're slowly converting all our documents to JasperReports (see iReport for details). I'm not sure how you're using the OpenOffice server; we use it for mail merging (filling out forms for customers). If you need to convert things to PDF, I'd recommend iText.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.