I use the following code to create a graph with Neo4j Graph Database:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.lucene.unsafe.batchinsert.LuceneBatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserter;
import org.neo4j.unsafe.batchinsert.BatchInserterIndex;
import org.neo4j.unsafe.batchinsert.BatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserters;
public class Neo4jMassiveInsertion implements Insertion {
private BatchInserter inserter = null;
private BatchInserterIndexProvider indexProvider = null;
private BatchInserterIndex nodes = null;
private static enum RelTypes implements RelationshipType {
SIMILAR
}
public static void main(String args[]) {
Neo4jMassiveInsertion test = new Neo4jMassiveInsertion();
test.startup("data/neo4j");
test.createGraph("data/enronEdges.txt");
test.shutdown();
}
/**
* Start neo4j database and configure for massive insertion
* #param neo4jDBDir
*/
public void startup(String neo4jDBDir) {
System.out.println("The Neo4j database is now starting . . . .");
Map<String, String> config = new HashMap<String, String>();
inserter = BatchInserters.inserter(neo4jDBDir, config);
indexProvider = new LuceneBatchInserterIndexProvider(inserter);
nodes = indexProvider.nodeIndex("nodes", MapUtil.stringMap("type", "exact"));
}
public void shutdown() {
System.out.println("The Neo4j database is now shuting down . . . .");
if(inserter != null) {
indexProvider.shutdown();
inserter.shutdown();
indexProvider = null;
inserter = null;
}
}
public void createGraph(String datasetDir) {
System.out.println("Creating the Neo4j database . . . .");
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
String line;
int lineCounter = 1;
Map<String, Object> properties;
IndexHits<Long> cache;
long srcNode, dstNode;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split("\t");
cache = nodes.get("nodeId", parts[0]);
if(cache.hasNext()) {
srcNode = cache.next();
}
else {
properties = MapUtil.map("nodeId", parts[0]);
srcNode = inserter.createNode(properties);
nodes.add(srcNode, properties);
nodes.flush();
}
cache = nodes.get("nodeId", parts[1]);
if(cache.hasNext()) {
dstNode = cache.next();
}
else {
properties = MapUtil.map("nodeId", parts[1]);
dstNode = inserter.createNode(properties);
nodes.add(dstNode, properties);
nodes.flush();
}
inserter.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
}
lineCounter++;
}
reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Comparing with other graph database technologies (titan, orientdb) it needs too much time. So may i am doing something wrong. Is there a way to boost up the procedure?
I use neo4j 1.9.5 and my machine has a 2.3 Ghz CPU (i5), 4GB RAM and 320GB disk and I am running on Macintosh OSX Mavericks (10.9). Also my heap size is at 2GB.
Usually I can import about 1M nodes and 200k relationships per second on my macbook.
Flush & Search
Please don't flush & search on every insert, that totally kills performance.
Keep your nodeIds in a HashMap from your data to node-id, and only write to lucene during the import.
(If you care about memory usage you can also go with something like gnu-trove)
RAM
Memory Mapping
You also use too little RAM (I usually use heaps between 4 and 60GB depending on the data set size) and you don't have any config set.
Please check as sensible config something like this, depending on you data volume I'd raise these numbers.
cache_type=none
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1000M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
Heap
And make sure to give it enough heap. You might also have a disk that might be not the fastest. Try to increase your heap to at least 3GB. Also make sure to have the latest JDK, 1.7.._b25 had a memory allocation issue (it allocated only a tiny bit of memory for the
Related
I need to copy all of the contents of a stream of VectorSchemaRoots into a single object:
Stream<VectorSchemaRoot> data = fetchStream();
VectorSchemaRoot finalResult = VectorSchemaRoot.create(schema, allocator);
VectorLoader = new VectorLoader(finalResult);
data.forEach(current -> {
VectorUnloader unloader = new VectorUnloader(current);
ArrowRecordBatch batch = unloader.getRecordBatch();
loader.load(batch);
current.close();
})
However, I am getting the following error:
java.lang.IllegalStateException: Memory was leaked by query. Memory was leaked.
Also getting this further down the stack track:
Could not load buffers for field date: Timetamp(MILLISECOND, null) not null. error message: A buffer can only be associated between two allocators that shame the same root
I use the same allocator for everything, does anyone know why I am getting this issue?
The "leak" is probably just a side effect of the exception, because the code as written is not exception-safe. Use try-with-resources to manage the ArrowRecordBatch instead of manually calling close():
try (ArrowRecordBatch batch = unloader.getRecordBatch()) {
loader.load(batch);
}
(though, depending on what load does, this may not be enough).
I can't say much else about why you're getting the exception without seeing more code and the full stack trace.
Could you try with something like this:
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VectorLoader;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.VectorUnloader;
import org.apache.arrow.vector.ipc.message.ArrowRecordBatch;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import java.util.Arrays;
import java.util.Collections;
import java.util.stream.Stream;
public class StackOverFlowSolved {
public static void main(String[] args) {
try(BufferAllocator allocator = new RootAllocator()){
// load data
IntVector ageColumn = new IntVector("age", allocator);
ageColumn.allocateNew();
ageColumn.set(0, 1);
ageColumn.set(1, 2);
ageColumn.set(2, 3);
ageColumn.setValueCount(3);
Stream<VectorSchemaRoot> streamOfVSR = Collections.singletonList(VectorSchemaRoot.of(ageColumn)).stream();
// transfer data
streamOfVSR.forEach(current -> {
Field ageLoad = new Field("age",
FieldType.nullable(new ArrowType.Int(32, true)), null);
Schema schema = new Schema(Arrays.asList(ageLoad));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema,
allocator.newChildAllocator("loaddata", 0, Integer.MAX_VALUE))) {
VectorUnloader unload = new VectorUnloader(current);
try (ArrowRecordBatch recordBatch = unload.getRecordBatch()) {
VectorLoader loader = new VectorLoader(root);
loader.load(recordBatch);
}
System.out.println(root.contentToTSVString());
}
current.close();
});
}
}
}
age
1
2
3
I'm trying to create and save a generated model directly from Java. The documentation specifies how to do this in R and Python, but not in Java. A similar question was asked before, but no real answer was provided (beyond linking to H2O doc, which doesn't contain a code example).
It'd be sufficient for my present purpose get some pointers to be able to translate the following reference code to Java. I'm mainly looking for guidance on the relevant JAR(s) to import from the Maven repository.
import h2o
h2o.init()
path = h2o.system_file("prostate.csv")
h2o_df = h2o.import_file(path)
h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
model = h2o.glm(y = "CAPSULE",
x = ["AGE", "RACE", "PSA", "GLEASON"],
training_frame = h2o_df,
family = "binomial")
h2o.download_pojo(model)
I think I've figured out an answer to my question. A self-contained sample code follows. However, I'll still appreciate an answer from the community since I don't know if this is the best/idiomatic way to do it.
package org.name.company;
import hex.glm.GLMModel;
import water.H2O;
import water.Key;
import water.api.StreamWriter;
import water.api.StreamingSchema;
import water.fvec.Frame;
import water.fvec.NFSFileVec;
import hex.glm.GLMModel.GLMParameters.Family;
import hex.glm.GLMModel.GLMParameters;
import hex.glm.GLM;
import water.util.JCodeGen;
import java.io.*;
import java.util.Map;
public class Launcher
{
public static void initCloud(){
String[] args = new String [] {"-name", "h2o_test_cloud"};
H2O.main(args);
H2O.waitForCloudSize(1, 10 * 1000);
}
public static void main( String[] args ) throws Exception {
// Initialize the cloud
initCloud();
// Create a Frame object from CSV
File f = new File("/path/to/data.csv");
NFSFileVec nfs = NFSFileVec.make(f);
Key frameKey = Key.make("frameKey");
Frame fr = water.parser.ParseDataset.parse(frameKey, nfs._key);
// Create a GLM and output coefficients
Key modelKey = Key.make("modelKey");
try {
GLMParameters params = new GLMParameters();
params._train = frameKey;
params._response_column = fr.names()[1];
params._intercept = true;
params._lambda = new double[]{0};
params._family = Family.gaussian;
GLMModel model = new GLM(params).trainModel().get();
Map<String, Double> coefs = model.coefficients();
for(Map.Entry<String, Double> entry : coefs.entrySet()) {
System.out.format("%s: %f\n", entry.getKey(), entry.getValue());
}
String filename = JCodeGen.toJavaId(model._key.toString()) + ".java";
StreamingSchema ss = new StreamingSchema(model.new JavaModelStreamWriter(false), filename);
StreamWriter sw = ss.getStreamWriter();
OutputStream os = new FileOutputStream("/base/path/" + filename);
sw.writeTo(os);
} finally {
if (fr != null) {
fr.remove();
}
}
}
}
Would something like this do the trick?
public void saveModel(URI uri, Keyed<Frame> model)
{
Persist p = H2O.getPM().getPersistForURI(uri);
OutputStream os = p.create(uri.toString(), true);
model.writeAll(new AutoBuffer(os, true)).close();
}
Make sure the URI has a proper form otherwise H2O will break on an npe. As for Maven you should be able to get away with the h2o core.
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-core</artifactId>
<version>3.14.0.2</version>
</dependency>
I am using Java8 with Apache OpenNLP. I have a service that extracts all the nouns from a paragraph. This works as expected on my localhost server. I also had this running on an OpenShift server with no problems. However, it does use a lot of memory. I need to deploy my application to AWS Elastic Beanstalk Tomcat Server.
One solution is I could probably upgrade from AWS Elastic Beanstalk Tomcat Server t1.micro to another instance type. But I am on a small budget, and want to avoid the extra fees if possible.
When I run the app, and it tries to do the word chunking, it gets the following error:
dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space] with root cause
java.lang.OutOfMemoryError: Java heap space
at opennlp.tools.ml.model.AbstractModelReader.getParameters(AbstractModelReader.java:148)
at opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:75)
at opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:328)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:256)
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:179)
at opennlp.tools.parser.ParserModel.<init>(ParserModel.java:180)
at com.jobs.spring.service.lang.LanguageChunkerServiceImpl.init(LanguageChunkerServiceImpl.java:35)
at com.jobs.spring.service.lang.LanguageChunkerServiceImpl.getNouns(LanguageChunkerServiceImpl.java:46)
Question
Is there a way to either:
Reduce the amount of memory used when extracting the nouns from a paragraph.
Use a different api other than Apache OpenNLP that won't use as much memory.
A way to configure AWS Elastic Beanstalk Tomcat Server to cope with the demands.
Code Sample:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashSet;
import java.util.Set;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.util.InvalidFormatException;
#Component("languageChunkerService")
#Transactional
public class LanguageChunkerServiceImpl implements LanguageChunkerService {
private Set<String> nouns = null;
private InputStream modelInParse = null;
private ParserModel model = null;
private Parser parser = null;
public void init() throws InvalidFormatException, IOException {
ClassLoader classLoader = getClass().getClassLoader();
File file = new File(classLoader.getResource("en-parser-chunking.bin").getFile());
modelInParse = new FileInputStream(file.getAbsolutePath());
// load chunking model
model = new ParserModel(modelInParse); // line 35
// create parse tree
parser = ParserFactory.create(model);
}
#Override
public Set<String> getNouns(String sentenceToExtract) {
Set<String> extractedNouns = new HashSet<String>();
nouns = new HashSet<>();
try {
if (parser == null) {
init();
}
Parse topParses[] = ParserTool.parseLine(sentenceToExtract, parser, 1);
// call subroutine to extract noun phrases
for (Parse p : topParses) {
getNounPhrases(p);
}
// print noun phrases
for (String s : nouns) {
String word = s.replaceAll("[^a-zA-Z ]", "").toLowerCase();// .split("\\s+");
//System.out.println(word);
extractedNouns.add(word);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (modelInParse != null) {
try {
modelInParse.close();
} catch (IOException e) {
}
}
}
return extractedNouns;
}
// recursively loop through tree, extracting noun phrases
private void getNounPhrases(Parse p) {
if (p.getType().equals("NN")) { // NP=noun phrase
// System.out.println(p.getCoveredText()+" "+p.getType());
nouns.add(p.getCoveredText());
}
for (Parse child : p.getChildren())
getNounPhrases(child);
}
}
UPDATE
Tomcat8 config:
First of all you should try to optimize your code. This starts by using regex and precompile statements before using replaceAll since replaceAll replaced the stuff in the memory. (https://eyalsch.wordpress.com/2009/05/21/regex/)
Second you should not store the parsed sentence in an array. Third hint: you should try to allocate the memory to your array using bytebuffer. Another Hint which may affect you at the most, you should use a BufferedReader to read your chunked file. (out of memory error, java heap space)
After this you should already see less memory usage. If those tips didnt help you please provide a memory dump/allocation graph.
One more tip: A HashSet takes 5.5x more memory then an unordered List. (Performance and Memory allocation comparision between List and Set)
I created a file helloworld.txt. Now I'm reading from the file and then I want to load the contents of the file into the cache, and whenever the cache is updated, it should write to the file as well.
This is my code so far:
Please tell me what to do to load the cache and then write from the cache to the file, as the instructions are not clear from Apache Ignite documentation.
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.ignite.Ignite;
import org.apache.ignite.IgniteCache;
import org.apache.ignite.IgniteDataStreamer;
import org.apache.ignite.IgniteException;
import org.apache.ignite.Ignition;
import org.apache.ignite.examples.ExampleNodeStartup;
import org.apache.ignite.examples.ExamplesUtils;
public class FileRead {
/** Cache name. */
private static final String CACHE_NAME = "FileCache";
/** Heap size required to run this example. */
public static final int MIN_MEMORY = 512 * 1024 * 1024;
/**
* Executes example.
*
* #param args Command line arguments, none required.
* #throws IgniteException If example execution failed.
*/
public static void main(String[] args) throws IgniteException {
ExamplesUtils.checkMinMemory(MIN_MEMORY);
try (Ignite ignite = Ignition.start("examples/config/example-ignite.xml")) {
System.out.println();
try (IgniteCache<Integer, String> cache = ignite.getOrCreateCache(CACHE_NAME)) {
long start = System.currentTimeMillis();
try (IgniteDataStreamer<Integer, String> stmr = ignite.dataStreamer(CACHE_NAME)) {
// Configure loader.
stmr.perNodeBufferSize(1024);
stmr.perNodeParallelOperations(8);
///FileReads();
try {
BufferedReader in = new BufferedReader
(new FileReader("/Users/akritibahal/Desktop/helloworld.txt"));
String str;
int i=0;
while ((str = in.readLine()) != null) {
System.out.println(str);
stmr.addData(i,str);
i++;
}
System.out.println("Loaded " + i + " keys.");
}
catch (IOException e) {
}
}
}
}
}
}
For information on how to load the cache from a persistence store please refer to this page: https://apacheignite.readme.io/docs/data-loading
You have two options:
Start a client node, create IgniteDataStreamer and use it to load the data. Simply call addData() for each line in the file.
Implement CacheStore.loadCache() method, provide the implementation in the cache configuration and call IgniteCache.loadCache().
Second approach will require to have the file on all server nodes, by there will be no communication between nodes, so most likely it will be faster.
I have a directory that contains 200 million HTML files (don't look at me, I didn't create this mess, I just have to deal with it). I need to index every HTML file in that directory into Solr. I've been reading guides on getting the job done, and I've got something going right now. After about an hour, I've got about 100k indexed, meaning this is going to take roughly 85 days.
I'm indexing the files to a standalone Solr server, running on a c4.8xlarge AWS EC2 instance. Here's the output from free -m with the Solr server running, and the indexer I wrote running as well:
total used free shared buffers cached
Mem: 60387 12981 47405 0 19 4732
-/+ buffers/cache: 8229 52157
Swap: 0 0 0
As you can see, I'm doing pretty good on resources. I increased the number of maxWarmingSearchers to 200 in my Solr config, because I was getting the error:
Exceeded limit of maxWarmingSearchers=2, try again later
Alright, but I don't think increasing that limit was really the right approach. I think the issue is that for each file, I am doing a commit, and I should be doing this in bulk (say 50k files / commit), but I'm not entirely sure how to adapt this code for that, and every example I see does a single file at a time. I really need to do everything I can to make this run as fast as possible, since I don't really have 85 days to wait on getting the data in Solr.
Here's my code:
Index.java
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
public class Index {
public static void main(String[] args) {
String directory = "/opt/html";
String solrUrl = "URL";
final int QUEUE_SIZE = 250000;
final int MAX_THREADS = 300;
BlockingQueue<String> queue = new LinkedBlockingQueue<>(QUEUE_SIZE);
SolrProducer producer = new SolrProducer(queue, directory);
new Thread(producer).start();
for (int i = 1; i <= MAX_THREADS; i++)
new Thread(new SolrConsumer(queue, solrUrl)).start();
}
}
Producer.java
import java.io.IOException;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.concurrent.BlockingQueue;
public class SolrProducer implements Runnable {
private BlockingQueue<String> queue;
private String directory;
public SolrProducer(BlockingQueue<String> queue, String directory) {
this.queue = queue;
this.directory = directory;
}
#Override
public void run() {
try {
Path path = Paths.get(directory);
Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
if (!attrs.isDirectory()) {
try {
queue.put(file.toString());
} catch (InterruptedException e) {
}
}
return FileVisitResult.CONTINUE;
}
});
} catch (IOException e) {
e.printStackTrace();
}
}
}
Consumer.java
import co.talentiq.common.net.SolrManager;
import org.apache.solr.client.solrj.SolrServerException;
import java.io.IOException;
import java.util.concurrent.BlockingQueue;
public class SolrConsumer implements Runnable {
private BlockingQueue<String> queue;
private static SolrManager sm;
public SolrConsumer(BlockingQueue<String> queue, String url) {
this.queue = queue;
if (sm == null)
this.sm = new SolrManager(url);
}
#Override
public void run() {
try {
while (true) {
String file = queue.take();
sm.indexFile(file);
}
} catch (InterruptedException e) {
e.printStackTrace();
} catch (SolrServerException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
SolrManager.java
import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import java.io.File;
import java.io.IOException;
import java.util.UUID;
public class SolrManager {
private static String urlString;
private static SolrClient solr;
public SolrManager(String url) {
urlString = url;
if (solr == null)
solr = new HttpSolrClient(url);
}
public void indexFile(String fileName) throws IOException, SolrServerException {
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
String solrId = UUID.randomUUID().toString();
up.addFile(new File(fileName), solrId);
up.setParam("literal.id", solrId);
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
}
}
You can use up.setCommitWithin(10000); to make Solr just commit automagically at least every ten seconds. Increase the value to make Solr commit each minute (60000) or each ten minutes (600000). Remove the explicit commit (setAction(..)).
Another option is to configure autoCommit in your configuration file.
You might also be able to index quicker by moving the HTML extraction process out from Solr (and just submitting the text to be indexed), or expanding the amount of servers you're posting to (more nodes in the cluster).
Am guessing you wont be searching the index in parallel while documents are being indexed. So here are the things that you could do.
You can configure the auto commit option in your solrconfig.xml. It can be done based on number of documents / time interval. For you, number of documents option would make more sense.
Remove that call to setAction() method in ContentStreamUpdateRequest object. you can maintain a count for number of calls made to indexFile() method. Say if it reaches 25000/10000 (based on your heap you can limit the count) then for that indexing call alone you can perform the commit using the SolrClient object like solr.commit(). so that the commit will be made once for specified count.
Let me know the results. Good Luck!