We have a java application that embeds neo4j server. This application loads some data from Oracle db, creates the graph and then users can do domain specific traversals and algorithms on demand.
We recently upgraded from 1.9.3 to 2.0 RC1. We are now using Schema and unique constraints as follows:
Iterator<ConstraintDefinition> constraints = schema.getConstraints(
DynamicLabel.label(label)).iterator();
if (constraints == null || !constraints.hasNext()) {
try {
schema.constraintFor(
DynamicLabel.label(label))
.assertPropertyIsUnique(propertyName).create();
} catch (org.neo4j.graphdb.ConstraintViolationException ex) {
LOG.error("CONSTRAINT ALREADY DEFINED ON: "
+ label);
}
}
The issue is our applications startup time has become 10 times slower. Sampling the cpu times reveal the following:
We instrumented the application and found the root cause of the slowness. Following thread will discuss the specific issue we found in detail:
How to improve performance while committing nodes in bulk with Neo4j 2.0 RC1?
Related
I have am currently running through some queries using the Java API provided by MarkLogic. I have installed it through adding the required dependencies to my library. The connection is set up using
DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8000, secContext, ConnectionType.DIRECT);
From here some XQueries are ran using the code shown below
ServerEvaluationCall evl = client.newServerEval().xquery(query);
EvalResultIterator evr = evl.eval();
while(evr.hasNext()){
//Do something with the results
}
However, certain queries takes a long time to process causing an internal error.So Other then reducing the query time required, I am wondering if there is there a way to overcome this? Such as increasing of connection time limit for instance.
====Update===
Query used
xquery version "1.0-ml";
let $query-opts := /comments[fn:matches(text,".*generation.*")]
return(
$query-opts, fn:count($query-opts), xdmp:elapsed-time()
)
I know the regular expression used can be easily replaced by word-query. But for this instance I would like to just used regular expression for searching.
Example Data
<comments>
<date_commented>1998-01-14T04:32:30</date_commented>
<text>iCloud sync settings are not supposed to change after an iOS update. In the case of iOS 10.3 this was due to a bug.</text>
<uri>/comment/000000001415898</uri>
</comments>
On the basis of your provided data I'd use xdmp:estimate and a cts query.
xdmp:estimate(cts:search(doc(), cts:and-query((
cts:directory-query('/comment/'),
cts:element-word-query(xs:QName("text"), "generation")
))))
This will search all documents in your /comments/ directory for an element text containing the word generation. As you already know, this will only use indexes and does not require loading/parsing documents.
This also will not find any false-positives because there is only one text element per document/fragment (if your shown data is correct).
I need another set of eyes on this.
I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.
With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).
Here's the code that's being run, streaming MyBatis to a CSV file on disk:
File directory = new File(feedDirectory);
File file;
try {
file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
} catch (IOException e) {
throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
}
String filePath = file.getAbsolutePath();
log.info(String.format("File name for %s feed is %s", providerCode, filePath));
// output file
try (FileOutputStream out = new FileOutputStream(file)) {
streamData(out, providerCode, startDate, endDate);
} catch (IOException e) {
throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
}
public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
StreamingHandler<FStay> handler = stayPrintingHandler(printer);
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
}
}
private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
StreamingHandler<FStay> handler = new StreamingHandler<>();
handler.setHandler((stay) -> {
try {
EXPORTER.writeStay(printer, stay);
} catch (IOException e) {
log.error("Issue with writing output: " + e.getMessage(), e);
}
});
return handler;
}
// The EXPORTER method
import org.apache.commons.csv.CSVPrinter;
public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
List<Object> list = asList(stay);
printer.printRecord(list);
}
List<Object> asList(FStay stay) {
List<Object> list = new ArrayList<>(46);
list.add(stay.getUid());
list.add(stay.getProviderCode());
//....
return list;
}
Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.
^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.
However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)
I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.
Here's what the JVisualVm graph looks like on the Ubuntu box:
I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)
The only differences I can think of at this point are:
Operating system - Ubuntu vs Mac OS X
Hosted VM in AWS vs hard metal laptop
Network speed is faster in AWS between database and Ubuntu server
JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally
Can anyone help shed any light on this problem?
Update
I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.
What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.
Update 2
In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.
Is there any likelihood of there being a bug at the Ubuntu/Java level for this?
Update 3
Tried replacing the output of the CSVPrinter to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.
As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.
I've also tried flushing the stream after every write, but had no luck with that as well.
Update 4
Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.
I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.
Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:
Error in the SqlSessionFactory I'm setting up
Bug in the Redshift driver that only pops up in Ubuntu / AWS
Bug in the result handler I have overwritten
Here's the heap dump screenshot:
Here's where I'm setting up my SqlSessionFactoryBean:
#Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
log.info("Got to datasource config");
// Dynamically load driver at runtime.
Class.forName(dataWarehouseDriver);
DataSource dataSource = new DataSource();
dataSource.setURL(dataWarehouseUrl);
dataSource.setUserID(dataWarehouseUsername);
dataSource.setPassword(dataWarehousePassword);
return dataSource;
}
#Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
factoryBean.setDataSource(redshiftDataSource());
return factoryBean;
}
Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
#Override
public void handleResult(ResultContext<? extends FStay> resultContext) {
// do nothing
}
});
Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.
Update 6
I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.
After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And this did the trick.
Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.
Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.
Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:
Thanks again in a huge way to all those who helped guide this!
Finally figured out a solution.
The heap dump was the biggest aid-- it indicated the InboundDataHandler class of Amazon's RedShift/postgres JDCB driver was the prime culprit.
The code to set up the SqlSession appeared legit, so traveling over to Amazon's documentation landed this gem:
To avoid client-side out-of-memory errors when retrieving large data
sets using JDBC, you can enable your client to fetch data in batches
by setting the JDBC fetch size parameter.
We hadn't run into this before, as we stream results with custom ResultHandlers in MyBatis... but there seems to be something different when the AWS Redshift JDBC driver is running on AWS itself vs outside AWS connecting in.
Taking the guidance from the documentation, we added a 'fetchSize' to our MyBatis select query:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And voila! Everything worked swimmingly. This is the only change we made and the heap never went above a couple hundred MBs.
You can see in one of the above graphs where the heap goes off the charts, as soon as the data started to be received on Amazon, the heap marches right up linearly and never reclaims an ounce of heap space once it starts.
My guess is the Redshift JDBC driver is doing something different when it's in Amazon's environment for some kind of optimization... that's all I can think of to explain the behavior.
Clearly Amazon knows what's going on since they documented it up front. I may not know the full 'why' of what's happening, but at least everything is resolved in what appears to be a satisfactory way.
Thanks to all those who helped.
I updated my solr version from 1.4.0 to 4.6.0 and now we are facing several performance issues.
a) If I use embedded version, it's very slow
b) Using http, I have these average times:
1.4: 151ms
4.6: 301ms
c) I saw that JavaBinCodec changed from version 1 to 2. Anybody nows if this can be the problem?
Note1: I tested many times, discarding first time, because of the warm up of server.
Note2: The documents returned are very big (3k lines in XML view, each document)
Any help would be apreciated.
The code used to test, showing code to solr 4.6
public class Main {
private static HttpSolrServer server;
public static void main(String[] args) throws Exception {
String url = "http://foo.bar/myIndex";
server = new HttpSolrServer(url);
for (int i = 0; i < 10; i++) {
search();
}
}
public static void search() throws Exception {
SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery("foo:bar");
solrQuery.setStart(0);
solrQuery.setRows(20);
// QUERY
long before = new GregorianCalendar().getTimeInMillis();
server.query(solrQuery);
long after = new GregorianCalendar().getTimeInMillis();
System.out.println(after - before);
}
}
Solr 4.6 runs on Java 6 or higher version. When using Java 7, Solr recommends to install at least Update 1 and also discourages the experimental usage of -XX JVM options. Latest version of JVMs may affect the performance of Solr. You can have an overview of issues in Solr due to JVM at the link below.
http://wiki.apache.org/lucene-java/JavaBugs
CPU, disk and memory requirements are based on the many choices made in implementing Solr (document size, number of documents, and number of hits retrieved to name a few).
However, you can try several things to improve the performance of Solr if you are using Zookeeper.
Move Zookeeper, if using, to another disk. If the index is huge then number of I/O call from Solr to Zookeeper will degrade the assembly performance.
Increase the Zookeeper timeout period.
Log gc times, I found out pauses of upto 20s on Zookeeper boxes.
Use the recommendations to tune the heap from http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning.
As being proposed in the previous discussion Using file system instead of database to store pdf files in jackrabbit
we can use FileDataStore to store blob files in the file system instead of database (i my case have stored ~ 100 kb size pdfs).
The following problem I have faced is dealing with files that have been previously stored in blobstore and I want them to be accessible after switching to FileDataStore.
After adding FileDataStore support to the repository.xml
when using JcrUtils method getOrAddNode i get ItemExistsException:
public static Node getOrAddNode(Node parent, String name)
throws RepositoryException {
if (parent.hasNode(name)) {
return parent.getNode(name);
} else {
return parent.addNode(name);
}
}
e.g. parent.hasNode(name) returns false (it seems the item doesn't exist)
but then we fall in to the code parent.addNode(name) which consequently throws ItemExistsException.
Any help?
Is it necessary to proceed the migration of blobs to the FileDataStore or there is kind of configuration that jackrabbit could search for blobs in different locations at the same time: in my case mysql database and filesystem.
Some comments:
I have found at least several ways that could help do the migration job:
spec http://wiki.apache.org/jackrabbit/BackupAndMigration
tells about using JCR API (Session.exportSystemView(..) and then Session.importXML(..) ), using RepositoryCopier API etc.
jackrabbit-jcr-import-export-tool (see http://svn.apache.org/repos/asf/jackrabbit/sandbox/jackrabbit-jcr-import-export-tool/README.txt)
using jackrabbit standalone server (http://jackrabbit.apache.org/standalone-server.html)
It might be possible that there is a repository corruption. That is, the node contains a child node entry for the given name (the node you want to add), but the child node itself doesn't exist. Specially in older version of Jackrabbit you could get into this situation if multiple sessions concurrently tried to change the same nodes.
To fix such corruption problems, the bundle db persistence managers support a consistency check & fix feature. You would need to set those options in the repository.xml and workspace.xml files, and restart Jackrabbit. Once fixed, you can disable those options again.
There is also a way to fix such problems at runtime, by setting the system property org.apache.jackrabbit.autoFixCorruptions to true, and then traverse over all nodes in the repository.
I need your help about performance problems running our corporate Java application on HP\UX server. Application is standalone tool which synchronizes data over several data bases into one, communicates with remote control on XML-RPC protocol and uses local Derby (Java DB) data base instance to hold configuration data etc. We don not have performance problems on other environments on the same load like Windows XP, Linux and AIX which use Sun JVM. After series of test we found out that most time consuming was communication with Derby data base. Most time is spent on reading from socket and this time is greater in 10-100 times than on other platforms. We know for sure that Derby works fine, we’ve got CPU reserve (usage is about 30%-40%), so most probable reason is transport layer between local data base and application.
Is there a way to diagnose socket I\O problems on HP-UX or maybe there is some possible limitations that can be configured? Maybe there is necessary JVM option? Any ideas from your side would be highly appreciated.
We’ve tried to optimize JVM options according to http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp?topic=/com.ibm.websphere.wsfep.multiplatform.doc/info/ae/ae/tprf_tunejvm_v61.html but didn’t get any significant improvement.
JVM info:
Java HotSpot(TM) 64-Bit Server VM (19.1-b02-jinteg:2011mar11-16:46 PA2.0W (aCC_AP), mixed mode)
Java: version 1.6.0.10, vendor "Hewlett-Packard Company"
We use following instance:
OS: HP-UX (B.11.23)
Architecture: PA_RISC2.0W 64bit
Processors: 2
Total physical memory size: 4 088 MB
Swap size: 4 090 MB
Here is example of slow running code. It takes several seconds to execute on HP while on Windows it takes 10-30ms:
/** Template to communicate with local db. */
SimpleJdbcTemplate jdbcTemplate;
#Transactional(readOnly = true)
public List<JobLogEntry> getLastLogs(Integer dbnr, JobDataType dtype) {
try {
String uid = jdbcTemplate.queryForObject("SELECT session_uuid FROM "
+ tableName + " WHERE id=(SELECT max(id) FROM "
+ tableName + " WHERE dbnr=? AND dtype=?)",
String.class, dbnr, dtype.name());
List<JobLogEntry> list = jdbcTemplate.query("SELECT id, dbnr, dtype, zeit, level, message FROM "
+ tableName
+ " WHERE dbnr=? AND dtype=? AND session_uuid=? ORDER BY ID",
new ConRowMapper(), dbnr, dtype.name(), uid);
return list;
} catch (org.springframework.dao.EmptyResultDataAccessException e) {
return new ArrayList<JobLogEntry>();
}
}
class ConRowMapper implements RowMapper<JobLogEntry> {
private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy.MM.dd HH:mm:ss");
/**
* Maps rows.
*/
public JobLogEntry mapRow(ResultSet rs, int rowNum) throws SQLException {
return new JobLogEntry(rs.getInt("dbnr"),
rs.getString("dtype"),
dateFormat.format(rs.getTimestamp("zeit")),
rs.getString("level"),
rs.getString("message"));
}
}
Thanks in advance for all your ideas
I wonder about the method getLastLogs(). Why query to get the session UUID and then turn around and use it in another query? I would guess that it's possible to do it in one query.
When you say Derby, it makes me think that only Java accesses that database. Is that true? Do you know that it's optimized well (e.g. proper indexes for every WHERE clause)?
Do you use connection pooling? That way you can pay the cost of creating connections up front and amortize it over all the queries you run.
I see jdbcTemplate, so you must be using Spring. I'd get the debug or trace interceptor wired in and see where the time is being spent.
I'd also recommend Visual VM 1.3.2 will all the plugins installed. It will give you a lot more data.
Probable reason could be slow and blockong GC work on HP-UX. Try removing redundant System.gc() calls and use some JVM GC options to otimize :)
See nice presentation about HP performance tuning: http://www.scribd.com/doc/47433278/Javamemorymanagemen