I am using the REST API to send a batch of SQL commands to OrientDB. Each batch contains about 6000 vertices and 13200 edges. I am creating vertices and edges using a csv, 1 batch is all the queries generated from parsing 100 rows in the csv. I am checking for duplicates before inserting either, as follows:
Vertex
UPDATE vertex SET property1 = "property1", property2 = "property2" UPSERT WHERE property1 = "property1";
Edge (using this method since UPSERT not supported with edges)
LET $1 = SELECT expand(bothE("edge1")) FROM vertex1 WHERE property1 = "property1";
LET $2 = SELECT expand(bothE("edge1")) FROM vertex2 WHERE property1 = "property1";
LET $3 = SELECT INTERSECT($1, $2);
IF($3.INTERSECT.size() == 0) {
LET $4 = CREATE EDGE edge1 FROM (SELECT FROM vertex1 WHERE property1 = "property1") TO (SELECT FROM vertex2 WHERE property1 = "property1");
The 1st batch takes about 50 seconds, then the second takes about 250 seconds, and the third over 1000 seconds. I suspect that searching previous records for duplicates is what is slowing it down, however its such a small amount of data I suspect that either my duplicate checking or the server config is to blame.
Is there a better way to check for duplicates? I've updated my server config to use the following:
storage.diskCache.bufferSize = 40GB
query.parallelMinimumRecords = 1
storage.useWAL = false
Any advice appreciated
Related
I have a kind having around 5 Million entities in the Google Cloud Datastore. I want to get this count programmatically using Java. I tried following code but it work upto certain threshold (800K).
When i ran query for 5 M records, it goes into infinite loop (my guess) since it doesn't returns any count. How to get the count of entities for this big data? I would not like to use Google App Engine API since it requires to setup environment.
private static Datastore datastore;
datastore = DatastoreOptions.getDefaultInstance().getService();
Query query = Query.newKeyQueryBuilder().setKind(kind).build();
int count = Iterators.size(datastore.run(query)); //count has the entities count
How accurate do you need the count to be? For an slightly out of date count you can use a stats entity to fetch the number of entities for a kind.
If you can't use the stale counts from the stats entity, then you'll need to keep counter entities for the real time counts that you need. You should consider using a sharded counter.
Check out Google Dataflow. A pipeline like the following should do it:
def send_count_to_call_back(callback_url):
def f(record_count):
r = requests.post(callback_url, data=json.dumps({
'record_count': record_count,
}))
return f
def run_pipeline(project, callback_url)
pipeline_options = PipelineOptions.from_dictionary({
'project': project,
'runner': 'DataflowRunner',
'staging_location':'gs://%s.appspot.com/dataflow-data/staging' % project,
'temp_location':'gs://%s.appspot.com/dataflow-data/temp' % project,
# .... other options
})
query = query_pb2.Query()
query.kind.add().name = 'YOUR_KIND_NAME_GOES HERE'
p = beam.Pipeline(options=pipeline_options)
_ = (p
| 'fetch all rows for query' >> ReadFromDatastore(project, query)
| 'count rows' >> apache_beam.combiners.Count.Globally()
| 'send count to callback' >> apache_beam.Map(send_count_to_call_back(callback_url))
)
I use python, but they have a Java sdk too https://beam.apache.org/documentation/programming-guide/
The only issue is your process will have to trigger this pipeline, let it run on its own for a few minutes, and then let it hit a callback URL to let you know it's done
I have a bigger problem with the batch import in OrientDB when I use Java.
My data is a collection of recordID's and tokens. For each ID exists a set of tokens but tokens can be in several ID'S.
Example:
ID Tokens
1 2,3,4
2 3,5,7
3 1,2,4
My graph should have two types of verticies: rIDClass and tokenClass. I want to give each vertex an ID corresponding to the recordID and the token. So the total number of tokenClass vertices should be the total number of unique tokens in the data. (Each token is only created once!)
How can I realize this problem? I tried the "Custom Batch Insert" from the original documentation and I tried the method "Batch Implementation", described in the blueprints documentation.
The problem at the first method is that OrientDB creates for each inserted token a separate vertex with a custom ID, which is set by the system itself.
The problem at the second method is when I try to add a vertex to the batchgraph I can't set the corresponding vertex Class and additionally I get an Exception. This is my code from the second method:
BatchGraph<OrientGraph> bgraph = new BatchGraph<OrientGraph>(graph, VertexIDType.STRING, 1);
Vertex vertex1 = bgraph.addVertex(1);
vertex1.setProperty("uid", 1);
Maybe someone has a solution.
Vertex vertex2 = bgraph.addVertex(2);
vertex2.setProperty("uid", 2);
Edge edge1 = graph.addEdge(12, vertex1 , vertex2, "EdgeConnectClass");
And I get the following Exception:
Exception in thread "main" java.lang.ClassCastException:
com.tinkerpop.blueprints.util.wrappers.batch.BatchGraph$BatchVertex cannot be cast to com.tinkerpop.blueprints.impls.orient.OrientVertex
at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.addEdge(OrientBaseGraph.java:612)
at App.indexRecords3(App.java:83)
at App.main(App.java:47)
I don't know if I understood correctly but, if you want a schema like this:
try this:
Vertex vertex1 = g.addVertex("class:rIDClass");
vertex1.setProperty("uid", 1);
Vertex token2 = g.addVertex("class:tokenClass");
token2.setProperty("uid", 2);
Edge edge1 = g.addEdge("class:rIDClass", vertex1, token2, "EdgeConnectClass");
Hope it helps
Regards
I am using spark 2.11 version and I am doing only 3 basic operations in my application:
taking records from the database: 2.2 million
checking records from a file (5 000) present in Database(2.2 million) using contains
writing matched records to a file of CSV format
But for these 3 operations, it takes almost 20 minutes. If I do same operations in SQL, it will take less than 1 minutes.
I have started to use spark because it will yield results very fast but it is taking too much of time. How to improve performance?
Step 1: taking records from the database.
Properties connectionProperties = new Properties();
connectionProperties.put("user", "test");
connectionProperties.put("password", "test##");
String query="(SELECT * from items)
dataFileContent= spark.read().jdbc("jdbc:oracle:thin:#//172.20.0.11/devad", query,connectionProperties);
Step2: checking records of file A (5k) present in file B (2M) using contains
Dataset<Row> NewSet=source.join(target,target.col("ItemIDTarget").contains(source.col("ItemIDSource")),"inner");
Step3: writing matched records to a file of CSV format
NewSet.repartition(1).select("*")
.write().format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.save(fileAbsolutePath);
To improve the performance I have tried several things like setting Cache,
data serialization
set("spark.serializer","org.apache.spark.serializer.KryoSerializer")),
Shuffle time
sqlContext.setConf("spark.sql.shuffle.partitions", "10"),
Data Structure Tuning
-XX:+UseCompressedOops ,
none of the approaches is not yielding better performance.
Increasing performance is more like improving parallelism.
Parallelism depends on number of partitions in RDD.
Make sure Dataset/Dataframe/RDD neither have too many number of partitions nor very less number of partitions.
Please check below suggestions where you can improve your code. I'm more comfortable with scala so I am providing suggestions in scala.
Step1:
Make sure you have control on connections you make with database by mentionioning numPartitions.
Number of connections = number of partitions.
Below I just assigned 10 to num_partitions, this you have to tune to get more performance.
int num_partitions;
num_partitions = 10;
Properties connectionProperties = new Properties();
connectionProperties.put("user", "test");
connectionProperties.put("password", "test##");
connectionProperties.put("partitionColumn", "hash_code");
String query = "(SELECT mod(A.id,num_partitions) as hash_code, A.* from items A)";
dataFileContent = spark.read()
.jdbc("jdbc:oracle:thin:#//172.20.0.11/devad",
dbtable = query,
columnName = "hash_code",
lowerBound = 0,
upperBound = num_partitions,
numPartitions = num_partitions,
connectionProperties);
You can check how numPartitions works
Step2:
Dataset<Row> NewSet = source.join(target,
target.col("ItemIDTarget").contains(source.col("ItemIDSource")),
"inner");
Since one of table/dataframe having 5k records(small amount of data) you can use broadcast join as mentioned below.
import org.apache.spark.sql.functions.broadcast
val joined_df = largeTableDF.join(broadcast(smallTableDF), "key")
Step3:
Use coalesce to decrease number of partitions so that it avoids full shuffle.
NewSet.coalesce(1).select("*")
.write().format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.save(fileAbsolutePath);
Hope my answer helps you.
I have:
a database table with 400 000 000 rows (Cassandra 3)
a list of circa 10 000 keywords
both data sets are expected to grow in time
I need to:
check if a specified column contains a keyword
sum how many rows contained the keyword in the column
Which approach should I choose?
Approach 1 (Secondary index):
Create secondary SASI index on the table
Find matches for given keyword "on fly" anytime
However, I am afraid of
cappacity problem - secondary indices can consume extra space and for such large table it could be too much
performance - I am not sure if finding of keyword among hundreds milions of rows can be achieved in a reasonable time
Approach 2 (Java job - brute force):
Java job that continuously iterates over data
Matches are saved into cache
Cache is updated during the next iteration
// Paginate throuh data...
String page = null;
do {
PagingState state = page == null ? null : PagingState.fromString(page);
PagedResult<DataRow> res = getDataPaged(query, status, PAGE_SIZE, state);
// Iterate through the current page ...
for (DataRow row : res.getResult()) {
// Skip empty titles
if (row.getTitle().length() == 0) {
continue;
}
// Find match in title
for (String k : keywords) {
if (k.length() > row.getTitle().length()) {
continue;
}
if (row.getTitle().toLowerCase().contains(k.toLowerCase()) {
// TODO: SAVE match
break;
}
}
}
status = res.getResult();
page = res.getPage();
// TODO: Wait here to reduce DB load
} while (page != null);
Problems
It could be very slow to iterate through whole table. If I waited for one second per every 1000 rows, then this cycle would finish in 4.6 days
This would require extra space for cache; moreover, frequent deletions from cache would produce tombstones in Cassandra
A better way will be to use a search engine like SolR our ElasticSearch. Full text search is their speciality. You could easily dump your data from cassandra to Elasticsearch and implement your java job on top of ElasticSearch.
EDIT:
With Cassandra you can request your result query as a JSON and Elasticsearch 'speak' only in JSON so you will be able to transfer your data very easily.
Elasticsearch
SolR
I have database with 300 000 rows, and I need filter some rows by algorithm.
protected boolean validateMatch(DbMatch m) throws MatchException, NotSupportedSportException{
// expensive part
List<DbMatch> hh = sd.getMatches(DateService.beforeDay(m.getStart()), m.getHt(), m.getCountry(),m.getSportID());
List<DbMatch> ah = sd.getMatches(DateService.beforeDay(m.getStart()), m.getAt(), m.getCountry(),m.getSportID());
....
My hibernate dao function for load data from Mysql is used 2x times of init array size.
public List<DbMatch> getMatches(Date before,String team, String country,int sportID) throws NotSupportedSportException{
//Match_soccer where date between :start and :end
Criteria criteria = session.createCriteria(DbMatch.class);
criteria.add(Restrictions.le("start",before));
criteria.add(Restrictions.disjunction()
.add(Restrictions.eq("ht", team))
.add(Restrictions.eq("at", team)));
criteria.add(Restrictions.eq("country",country));
criteria.add(Restrictions.eq("sportID",sportID));
criteria.addOrder(Order.desc("start") );
return criteria.list();
}
Example how i try filter data
function List<DbMatch> filter(List<DbMatch> mSet){
List<DbMatch> filtred = new ArrayList<>();
for(DbMatch m:mSet){
if(validateMatch(DbMatch m))filtred.add(m);
}
}
(1)I tried different criteria settings and counted function times with stopwatch. My result is when I use filter(matches) matches size 1000 my program take 3 min 21 s 659 ms.
(2)I tried remove criteria.addOrder(Order.desc("start")); than program filtered after 3 min 12 s 811 ms.
(3)But if I remove criteria.addOrder(Order.desc("start")); and add criteria.setMaxResults(1); result was 22 s 311 ms.
Using last configs i can filter all my 300 000 record by 22,3 * 300 = 22300 s (~ 6h), but if use first function I should wait (~ 60 h).
If I want use criteria without order and limit i must be sure that my table is sorted by date on database because it is important get last match .
All data is stored on matches table.
Table indexes:
Table, Non_unique, Key_name, Seq_in_index, Column_name, Collation, Cardinality, Sub_part, Packed, Null, Index_type, Comment, Index_comment
matches, 0, PRIMARY, 1, mid, A, 220712, , , , BTREE, ,
matches, 0, UK_kcenwf4m58fssuccpknl1v25v, 1, beid, A, 220712, , , YES, BTREE, ,
UPDATED
After added ALTER TABLE matches ADD INDEX (sportID, country); now program time deacrised to 15s for 1000 matches. But if I not use order by and add limit need wait only 4s for 1000 mathces.
How I should act on this situation to improve program executions speed?
Your first order of business is to figure out how long each component take to process the request.
Find out the SQL query generated by the ORM and run that manually in MySQL workbench and see how long it takes (non cached). You can also ask for it to explain the index usage.
If it's fast enough then it's your java code that's taking longer and you need to optimize your algorithm. You can use JConsole to dig further into that.
If you identify which component is taking longer you can post here with your analysis and we can make suggestions accordingly.