neo4j upload csv UnmarshalException - java

I am uploading a 100MB csv file to neo4j containing transactional data. I am getting a java error that I cannot seem to trace to a setting or something that I can change.
neo4j-sh (?)$ CREATE CONSTRAINT ON (a:Account) ASSERT a.id IS UNIQUE;
+-------------------+
| No data returned. |
+-------------------+
Constraints added: 1
48 ms
neo4j-sh (?)$ USING PERIODIC COMMIT
> LOAD CSV FROM
> "file:/somepath/findata.csv"
> AS line
> FIELDTERMINATOR ','
> MERGE (a1:Account { id: toString(line[3]) })
> MERGE (a2:Account { id: toString(line[4]) })
> CREATE (a1)-[:LINK { value: toFloat(line[0]), date: line[5] } ]->(a2);
java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is:
java.io.EOFException
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:228)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
at com.sun.proxy.$Proxy1.interpretLine(Unknown Source)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:110)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:94)
at org.neo4j.shell.impl.AbstractClient.grabPrompt(AbstractClient.java:74)
at org.neo4j.shell.StartClient.grabPromptOrJustExecuteCommand(StartClient.java:357)
at org.neo4j.shell.StartClient.startRemote(StartClient.java:303)
at org.neo4j.shell.StartClient.start(StartClient.java:175)
at org.neo4j.shell.StartClient.main(StartClient.java:120)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214)
... 11 more
I've tried the command twice and it gives me the same error. Sofar google has not helped me to figure out what I can do to circumvent this error. What is happening in neo4j and how could I solve this?

Maybe this isn't the problem, but your path to your CSV file may be malformed. This would explain the java.rmi.UnmarshalException. The path should be 'file://', where would be something like '/home/cantdutchthis/findata.csv' on a linux system. On a linux or Mac machine, this means that there will be 3 '/'s - 'file:///home/cantdutchthis/findata.csv'.
Grace and peace,
Jim

Related

ElasticSearch IndexMissingException while using prepareGet

I am trying to get data from elasticsearch using Java's GET API. but I keep on getting the IndexMisingException.
Exception in thread "main" org.elasticsearch.indices.IndexMissingException: [logstash-*] missing
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:768)
at org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:691)
at org.elasticsearch.cluster.metadata.MetaData.concreteSingleIndex(MetaData.java:748)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.<init>(TransportShardSingleOperationAction.java:139)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.<init>(TransportShardSingleOperationAction.java:116)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:89)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:55)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:193)
at org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:201)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
at elasticConnection.ClientElastic.main(ClientElastic.java:18)
I have the Index in elasticsearch.
health status **index** pri rep docs.count docs.deleted store.size pri.store.size
yellow open **events** 5 1 39 0 48.7kb 48.7kb
yellow open **logstash-2016.03.30** 5 1 152 0 137.8kb 137.8kb
please help.
Your indexes are still waiting for replica, which generally we avoid when working on a single node.
Run this command on your local host :
curl -XPUT 'localhost:9200/_settings' -d '{ "index" : { "number_of_replicas" : 0 } }'
This should change the index status to green and your program should be good to go.

How to read a .csv file using Renjin

I have a grails application in which I want to use Renjin to carry out some statistics using R.
The code in my grails app is like this:
ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("Renjin");
engine.eval("data <- read.table('/path/to/my/app/R/file.csv', sep=',', na.strings=c('',' ','-','--'))");
When running my code I get the following Exception:
ERROR errors.GrailsExceptionResolver - IndexOutOfBoundsException occurred when processing request:
.......
Index: 49, Size: 49. Stacktrace follows:
Message: Index: 49, Size: 49
Line | Method
->> 635 | rangeCheck in java.util.ArrayList
I realise that java arrays use zero based indexing, whereas R arrays use 1 based indexing, I think the issue is related to this. Is there a way to get around this issue?
Also, the CSV has 49 columns.

Replication of solr Lucene index failing with following error message

I am getting below error when running solr replication
2013-12-27 05:03:32,391 [explicit-fetchindex-cmd] ERROR org.apache.solr.handler.ReplicationHandler- SnapPull failed :org.apache.solr.common.SolrException: Index fetch failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:485)
at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:319)
at org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:220)
Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/apps/search/data/customers/solr/solr/adidas-archive/data/index.20131227050332242/segments_a")
at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:78)
at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:84)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:320)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:380)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:663)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:376)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:711)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267)
at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:179)
at org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:632)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:469)
... 2 more
My step-up is Master on 3.x and Slave on 4.x.
This happened when I copied very large index 100 g+. What does this error means ? Does it means that index has got corrupt and if so what can be done to fix it, any thoughts ?
I ran the checkindex utility, but is again giving the error :
ERROR: could not read any segments file in directory
java.io.EOFException: read past EOF: MMapIndexInput(path="/apps/search/data/customers/solr/solr/adidas-archive/data/index.20131227051833263/segments_a") at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:78)
at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:41)
while upgrading the version of solr3.x to 4.x to upgrade you can use any of the following methods
1 . rebuild the index from scratch (time taking)
OR
2. optimize the index
But What i found in upgrading from 3.6 to 4.2 was that if your index size in more than 100 gigs (mine was over 170 gigs in solr 3.6 ) .It would be more fruitful to rebuild the index from scratch as new compression techniques are implemented now . (my index size got reduced from 170 gigs to 115 gigs in solr 4.2). other wise after optimization your index would be accepted by solr 4.x . but index size would remain same .
Please don't use different versions in a same replication system .
If you have been using it in past please share the details it would be really helpful
regards
rajat

Filter not working after Foreach

For some reason adding a filter to the statement below causes a couple of errors. In the console output I find Failed to read data from "...". Also in the log I found this:
Backend error message
---------------------
java.lang.NullPointerException
at org.apache.pig.builtin.Utf8StorageConverter.consumeTuple(Utf8StorageConverter.java:185)
at org.apache.pig.builtin.Utf8StorageConverter.consumeBag(Utf8StorageConverter.java:94)
at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:331)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:1562)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:228)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:282)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:416)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:3
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias limited
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias limited
at org.apache.pig.PigServer.openIterator(PigServer.java:838)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.io.IOException: Couldn't retrieve job.
at org.apache.pig.PigServer.store(PigServer.java:902)
at org.apache.pig.PigServer.openIterator(PigServer.java:813)
... 12 more
The code that I'm using is as follows:
--- Read the input
records = LOAD 'data' AS (id1, id2, link, tags:bag{}, dates);
counted = FOREACH records GENERATE (chararray) id1, (int) COUNT(tags) as amountOfTags;
filtered = FILTER counted BY amountOfTags > 0;
limited = limit filtered 10;
--- Save the result
dump limited;
Everything works fine until I add the filtered... line and try to output it.
Can anyone tell me why?

Apache Sqoop/Pig field escaping

We are exporting some data from MySQL using Sqoop, doing some processing with it via Apache Pig, and then attempting to export that data from HDFS back into a MySQL database. However, when exporting the data, we are running into issues:
java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NumberFormatException: For input string: ".proseries.com"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:449)
at java.lang.Integer.valueOf(Integer.java:554)
at mdm_urls.__loadFromFields(mdm_urls.java:419)
The HDFS data looks like (tab separated):
id:int url:text tld:text port:int
Somehow the tld field is being imported into the port column for some rows. Out of ~250M rows, this is only the case for less than 10. My initial assumption was that the url field must have a tab in it. However, we have stripped all tabs in our Pig script:
REGISTER target/mystuff.jar;
legacy_urls = LOAD 'url' USING PigStorage(',') AS (id, sha1, url_text);
legacy_urls_norm = FOREACH legacy_urls GENERATE id AS id, sha1 AS sha1, REPLACE(REPLACE(url_text, '\n', ''), '\t', '') AS url_text;
urls = FOREACH legacy_urls_norm GENERATE id, url_text, mystuff.RootDomain(url_text), mystuff.Protocol(url_text), mystuff.Host(url_text), mystuff.Path(url_text), mystuff.EffectiveTld(url_text), mystuff.Port(url_text), sha1;
STORE urls INTO 'mdm_urls';
Here is my sqoop export command:
sqoop export --connect jdbc:mysql://hostnmae/db_name --input-fields-terminated-by "\t" --table test --export-dir my_urls
I am having a difficult time debugging this because the sqoop errors do not give any indication as to what row it was working on (so that I can confirm if a tab char is still present, etc). My first question is, how might I better troubleshoot this issue? My second question is, how are people escaping bad input data with PIG?

Categories