How to read a .csv file using Renjin - java

I have a grails application in which I want to use Renjin to carry out some statistics using R.
The code in my grails app is like this:
ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("Renjin");
engine.eval("data <- read.table('/path/to/my/app/R/file.csv', sep=',', na.strings=c('',' ','-','--'))");
When running my code I get the following Exception:
ERROR errors.GrailsExceptionResolver - IndexOutOfBoundsException occurred when processing request:
.......
Index: 49, Size: 49. Stacktrace follows:
Message: Index: 49, Size: 49
Line | Method
->> 635 | rangeCheck in java.util.ArrayList
I realise that java arrays use zero based indexing, whereas R arrays use 1 based indexing, I think the issue is related to this. Is there a way to get around this issue?
Also, the CSV has 49 columns.

Related

Creating dictionary from large Pyspark dataframe showing OutOfMemoryError: Java heap space

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
property_sql_df.show() shows,
+--------------+------------+--------------------+--------------------+
| id|country_code| name| hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
| BOND-9129450| US|Scotron Home w/Ga...|90cb0946cf4139e12...|
| BOND-1742850| US|Sited in the Mead...|d5c301f00e9966483...|
| BOND-3211356| US|NEW LISTING - Com...|811fa26e240d726ec...|
| BOND-7630290| US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
| BOND-7175508| US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.
Expected Output
{
"90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
"d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}
What I have tried so far,
Way 1: causing java.lang.OutOfMemoryError: Java heap space
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect():
hashed_value = ind.hash_of_cc_pn_li
property_id = ind.id
if hashed_value in duplicate_property_list:
duplicate_property_list[hashed_value].append(property_id)
else:
duplicate_property_list[hashed_value] = [property_id]
Way 2: Not working because of missing native OFFSET on pyspark
%%time
i = 0
limit = 1000000
for offset in range(0, total_record,limit):
i = i + 1
if i != 1:
offset = offset + 1
duplicate_property_list = {}
duplicate_properties = {}
# Preparing dataframe
url = '''select id, hash_of_cc_pn_li from properties_df LIMIT {} OFFSET {}'''.format(limit,offset)
properties_sql_df = spark.sql(url)
# Grouping dataset
rows = properties_sql_df.groupBy("hash_of_cc_pn_li").agg(F.collect_set("id").alias("ids")).collect()
duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
# Filter a dictionary to keep elements only where duplicate cound
duplicate_properties = filterTheDict(duplicate_property_list, lambda elem : len(elem[1]) >=2)
# Writing to file
with open('duplicate_detected/duplicate_property_list_all_'+str(i)+'.json', 'w') as fp:
json.dump(duplicate_property_list, fp)
What I get now on the console:
java.lang.OutOfMemoryError: Java heap space
and showing this error on Jupyter notebook output
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
This is the followup question that I asked here: Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space
Why not keep as much data and processing in Executors, rather than collecting to Driver? If I understand this correctly, you could use pyspark transformations and aggregations and save directly to JSON, therefore leveraging executors, then load that JSON file (likely partitioned) back into Python as a dictionary. Admittedly, you introduce IO overhead, but this should allow you to get around your OOM heap space errors. Step-by-step:
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
df.groupBy(
f.col("hash_of_cc_pn_li"),
).agg(
f.collect_set("id").alias("id") # use f.collect_list() here if you're not interested in deduplication of BOND-XXXXX values
).write.json("./test.json")
Inspecting the output path:
ls -l ./test.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 Jul 27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 _SUCCESS
_SUCCESS
Loading to Python as dict:
import json
from glob import glob
data = []
for file_name in glob('./test.json/*.json'):
with open(file_name) as f:
try:
data.append(json.load(f))
except json.JSONDecodeError: # there is definitely a better way - this is here because some partitions might be empty
pass
Finally
{item['hash_of_cc_pn_li']:item['id'] for item in data}
{'d5c3': ['BOND-7630290', 'BOND-1742850'],
'811f': ['BOND-3211356'],
'90cb': ['BOND-9129450', 'BOND-7175508']}
I hope this helps! Thank you for the good question!

How to generate report from huge mongodb document [duplicate]

Using the code:
all_reviews = db_handle.find().sort('reviewDate', pymongo.ASCENDING)
print all_reviews.count()
print all_reviews[0]
print all_reviews[2000000]
The count prints 2043484, and it prints all_reviews[0].
However when printing all_reviews[2000000], I get the error:
pymongo.errors.OperationFailure: database error: Runner error: Overflow sort stage buffered data usage of 33554495 bytes exceeds internal limit of 33554432 bytes
How do I handle this?
You're running into the 32MB limit on an in-memory sort:
https://docs.mongodb.com/manual/reference/limits/#Sort-Operations
Add an index to the sort field. That allows MongoDB to stream documents to you in sorted order, rather than attempting to load them all into memory on the server and sort them in memory before sending them to the client.
As said by kumar_harsh in the comments section, i would like to add another point.
You can view the current buffer usage using the below command over the admin database:
> use admin
switched to db admin
> db.runCommand( { getParameter : 1, "internalQueryExecMaxBlockingSortBytes" : 1 } )
{ "internalQueryExecMaxBlockingSortBytes" : 33554432, "ok" : 1 }
It has a default value of 32 MB(33554432 bytes).In this case you're running short of buffer data so you can increase buffer limit with your own defined optimal value, example 50 MB as below:
> db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes:50151432})
{ "was" : 33554432, "ok" : 1 }
We can also set this limit permanently by the below parameter in the mongodb config file:
setParameter=internalQueryExecMaxBlockingSortBytes=309715200
Hope this helps !!!
Note:This commands supports only after version 3.0 +
solved with indexing
db_handle.ensure_index([("reviewDate", pymongo.ASCENDING)])
If you want to avoid creating an index (e.g. you just want a quick-and-dirty check to explore the data), you can use aggregation with disk usage:
all_reviews = db_handle.aggregate([{$sort: {'reviewDate': 1}}], {allowDiskUse: true})
(Not sure how to do this in pymongo, though).
JavaScript API syntax for the index:
db_handle.ensureIndex({executedDate: 1})
In my case, it was necessary to fix nessary indexes in code and recreate them:
rake db:mongoid:create_indexes RAILS_ENV=production
As the memory overflow does not occur when there is a needed index of field.
PS Before this I had to disable the errors when creating long indexes:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
Also may be needed reIndex:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> use your_db
switched to db your_db
> db.getCollectionNames().forEach( function(collection){ db[collection].reIndex() } )

neo4j upload csv UnmarshalException

I am uploading a 100MB csv file to neo4j containing transactional data. I am getting a java error that I cannot seem to trace to a setting or something that I can change.
neo4j-sh (?)$ CREATE CONSTRAINT ON (a:Account) ASSERT a.id IS UNIQUE;
+-------------------+
| No data returned. |
+-------------------+
Constraints added: 1
48 ms
neo4j-sh (?)$ USING PERIODIC COMMIT
> LOAD CSV FROM
> "file:/somepath/findata.csv"
> AS line
> FIELDTERMINATOR ','
> MERGE (a1:Account { id: toString(line[3]) })
> MERGE (a2:Account { id: toString(line[4]) })
> CREATE (a1)-[:LINK { value: toFloat(line[0]), date: line[5] } ]->(a2);
java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is:
java.io.EOFException
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:228)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
at com.sun.proxy.$Proxy1.interpretLine(Unknown Source)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:110)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:94)
at org.neo4j.shell.impl.AbstractClient.grabPrompt(AbstractClient.java:74)
at org.neo4j.shell.StartClient.grabPromptOrJustExecuteCommand(StartClient.java:357)
at org.neo4j.shell.StartClient.startRemote(StartClient.java:303)
at org.neo4j.shell.StartClient.start(StartClient.java:175)
at org.neo4j.shell.StartClient.main(StartClient.java:120)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214)
... 11 more
I've tried the command twice and it gives me the same error. Sofar google has not helped me to figure out what I can do to circumvent this error. What is happening in neo4j and how could I solve this?
Maybe this isn't the problem, but your path to your CSV file may be malformed. This would explain the java.rmi.UnmarshalException. The path should be 'file://', where would be something like '/home/cantdutchthis/findata.csv' on a linux system. On a linux or Mac machine, this means that there will be 3 '/'s - 'file:///home/cantdutchthis/findata.csv'.
Grace and peace,
Jim

ELKI DBSCAN : How to set dbc.parser?

I am doing DBSCAN clustering and I have one more column apart from latitude longitude which I want to see with cluster results. For example data looks like this:
28.6029445 77.3443552 1
28.6029511 77.3443573 2
28.6029436 77.3443458 3
28.6029011 77.3443032 4
28.6028967 77.3443042 5
28.6029087 77.3442829 6
28.6029132 77.3442797 7
Now in minigui if i set parser.labelindices to 2 and run the task then the output looks like this:
# Cluster: Cluster 0
ID=63222 28.6031295 77.3407848 441
ID=63225 28.603134 77.3407744 444
ID=63220 28.6031566667 77.3407816667 439
ID=63226 28.6030819 77.3407605 445
ID=63221 28.6032 77.3407616667 440
ID=63228 28.603085 77.34071 447
ID=63215 28.60318 77.3408583333 434
ID=63229 28.6030751 77.3407096 448
So it is still connected to the 3rd column which I passed as a label. I have checked the clustering result by passing just latitude and longitude and its exactly same. So in a way by passing a column as 'label' I can retrieve that column with lat long in cluster results.
Now I want to use this in my java code
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(
FileBasedDatabaseConnection.Parameterizer.INPUT_ID,
fileLocation);
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
2);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
But this is giving a NullPointerException. In MiniGui dbc.parser is NumberVectorLabelParser by default. So this should work fine. What am I missing?
I will have a look into the NPE, it should return a more helpful error message instead.
Most likely, the problem is that this parameter is of type List<Integer>, i.e. you would need to pass a list. Alternatively, you can pass a String, which will be parsed. The following should work just fine:
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
"2");
Note that the text writer might (I have not checked this) print labels as is. So you cannot take the output as indication that it considered your data set to be 3 dimensional.
The debugging handler -resulthandler LogResultStructureResultHandler -verbose should give you type output:
java -jar elki.jar KDDCLIApplication -dbc.in dbpedia.gz \
-algorithm NullAlgorithm \
-resulthandler LogResultStructureResultHandler -verbose
should yield an output like this:
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 1941 ms
de.lmu.ifi.dbs.elki.algorithm.NullAlgorithm.runtime: 0 ms
BasicResult: Algorithm Step (main)
StaticArrayDatabase: Database (database)
DBIDView: Database IDs (DBID)
MaterializedRelation: DoubleVector,dim=2 (relation)
MaterializedRelation: LabelList (relation)
SettingsResult: Settings (settings)
In this case, my data set are coordinates from Wikipedia, along with a name each. I have a 2 dimensional DoubleVector relation, and a LabelList relation storing the object names.

Replication of solr Lucene index failing with following error message

I am getting below error when running solr replication
2013-12-27 05:03:32,391 [explicit-fetchindex-cmd] ERROR org.apache.solr.handler.ReplicationHandler- SnapPull failed :org.apache.solr.common.SolrException: Index fetch failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:485)
at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:319)
at org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:220)
Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/apps/search/data/customers/solr/solr/adidas-archive/data/index.20131227050332242/segments_a")
at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:78)
at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:84)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:320)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:380)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:663)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:376)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:711)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267)
at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:179)
at org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:632)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:469)
... 2 more
My step-up is Master on 3.x and Slave on 4.x.
This happened when I copied very large index 100 g+. What does this error means ? Does it means that index has got corrupt and if so what can be done to fix it, any thoughts ?
I ran the checkindex utility, but is again giving the error :
ERROR: could not read any segments file in directory
java.io.EOFException: read past EOF: MMapIndexInput(path="/apps/search/data/customers/solr/solr/adidas-archive/data/index.20131227051833263/segments_a") at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:78)
at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:41)
while upgrading the version of solr3.x to 4.x to upgrade you can use any of the following methods
1 . rebuild the index from scratch (time taking)
OR
2. optimize the index
But What i found in upgrading from 3.6 to 4.2 was that if your index size in more than 100 gigs (mine was over 170 gigs in solr 3.6 ) .It would be more fruitful to rebuild the index from scratch as new compression techniques are implemented now . (my index size got reduced from 170 gigs to 115 gigs in solr 4.2). other wise after optimization your index would be accepted by solr 4.x . but index size would remain same .
Please don't use different versions in a same replication system .
If you have been using it in past please share the details it would be really helpful
regards
rajat

Categories