I'm trying to work with OrientDB using Java API. My aim is to add some vertices to fresh created database.
First of all I created connection to database
graph = new OrientGraphFactory("PLOCAL:localhost/affiliates", "admin", "admin").getNoTx();
Next through some iteration I created some amount of vertices
private Vertex createVertex(Company company) {
Vertex vertex = graph.addVertex(company);
//vertex.setProperty("id", company.getId());
if (company.getName() != null) {
vertex.setProperty("name", company.getName());
}
if (company.getInn() != null) {
vertex.setProperty("inn", company.getInn());
}
if (company.getOgrn() != null) {
vertex.setProperty("ogrn", company.getOgrn());
}
return vertex;
}
After that (in that session) I've calculated created vertices like this:
Integer vertexCount = 0;
Iterable vertices = graph.getVertices();
while (vertices.iterator().hasNext()) {
Vertex vertex = (Vertex) vertices.iterator().next();
vertexCount++;
}
log.info("vertexCount = " + vertexCount);
And found the same amount of vertices as I expected (100 vertices).
After that I'm trying to see those vertices in OrientDB console and see nothing
orientdb {db=affiliates}> classes
CLASSES
+----+-----------+-------------+------------+-----+
|# |NAME |SUPER-CLASSES|CLUSTERS |COUNT|
+----+-----------+-------------+------------+-----+
|0 |_studio | |_studio(12) | 1|
|1 |E | |e(10) | 0|
|2 |OFunction | |ofunction(6)| 0|
|3 |OIdentity | |- | 0|
|4 |ORestricted| |- | 0|
|5 |ORole |[OIdentity] |orole(4) | 3|
|6 |OSchedule | |oschedule(8)| 0|
|7 |OSequence | |osequence(7)| 0|
|8 |OTriggered | |- | 0|
|9 |OUser |[OIdentity] |ouser(5) | 3|
|10 |V | |v(9) | 0|
+----+-----------+-------------+------------+-----+
| |TOTAL | | | 7|
+----+-----------+-------------+------------+-----+
What I've missed?
Related
i have a bunch of Data with 20000 rows in a JavaRDD. Now i want to save several files with exact the same size (like 70 rows per file).
I tried it with the code below, but because it is not exactly dividable some data sets consist of 69, 70 or 71 rows. The struggle is I need all with the same size except the last record (it can have less).
Help is appreciated!!! Thanks in advance guys!
myString.repartition(286).saveAsTextFile(outputPath);
You can use filterByRange do something like (pseudo code):
for i = 0; i < javaRDD.size ; i+= 70
val tempRDD = javaRDD.filterByRange(i,i+70).repartition(1)
tempRDD.saveAsTextFile(outputPath + i.toString());
Unfortunately a Scala answer, but it works.
First define a custom partitioner:
class IndexPartitioner[V](n_per_part: Int, rdd: org.apache.spark.rdd.RDD[_ <: Product2[Long, V]], do_cache: Boolean = true) extends org.apache.spark.Partitioner {
val max = {
if (do_cache) rdd.cache()
rdd.map(_._1).max
}
override def numPartitions: Int = math.ceil(max.toDouble/n_per_part).toInt
override def getPartition(key: Any): Int = key match {
case k:Long => (k/n_per_part).toInt
case _ => (key.hashCode/n_per_part).toInt
}
}
Create an RDD of random strings and index it:
val rdd = sc.parallelize(Array.tabulate(1000)(_ => scala.util.Random.alphanumeric.filter(_.isLetter).take(5).mkString))
val rdd_idx = rdd.zipWithIndex.map(_.swap)
Create the partitioner and apply it:
val partitioner = new IndexPartitioner(70, rdd_idx)
val rdd_part = rdd_idx.partitionBy(partitioner).values
Check partition sizes:
rdd_part
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","number_of_records")
.show
/**
+----------------+-----------------+
| 0| 70|
| 1| 70|
| 2| 70|
| 3| 70|
| 4| 70|
| 5| 70|
| 6| 70|
| 7| 70|
| 8| 70|
| 9| 70|
| 10| 70|
| 11| 70|
| 12| 70|
| 13| 70|
| 14| 20|
+----------------+-----------------+
*/
One file for each partition:
import sqlContext.implicits._
rdd_part.toDF.write.format("com.databricks.spark.csv").save("/tmp/idx_part_test/")
(+1 for "_SUCCESS")
XXX$ ls /tmp/idx_part_test/ | wc -l
16
I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?
I'm new spark Java API. My dataset contains two columns (account, Lib) . I want to display accounts having differents lib. In fact my dataset is something like this.
ds1
+---------+------------+
| account| Lib |
+---------+------------+
| 222222 | bbbb |
| 222222 | bbbb |
| 222222 | bbbb |
| | |
| 333333 | aaaa |
| 333333 | bbbb |
| 333333 | cccc |
| | |
| 444444 | dddd |
| 444444 | dddd |
| 444444 | dddd |
| | |
| 555555 | vvvv |
| 555555 | hhhh |
| 555555 | vvvv |
I want to get ds2 like this:
+---------+------------+
| account| Lib |
+---------+------------+
| | |
| 333333 | aaaa |
| 333333 | bbbb |
| 333333 | cccc |
| | |
| 555555 | vvvv |
| 555555 | hhhh |
If groups are small you can use window functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", approx_count_distinct("Lib").over(Window.partitionBy("account")).alias("cnt"))
.where(col("cnt") > 1)
If groups are large:
df.join(
df
.groupBy("account")
.agg(countDistinct("Lib").alias("cnt")).where(col("cnt") > 1),
Seq("account"),
"leftsemi"
)
I have huge .csv file which has several columns but the columns of importance to me are USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.).
So what I am trying to do is : replace all null values in DURATION column with average of duration of all the calls of same type by the same user(i.e. of same USER_ID).
I have found the average as following :
In the query below I am finding out the average of duration of all the calls of same type by the same user.
Dataset<Row> filteredData = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull()).and(col(DATE).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .groupBy(col(USER_ID), col(TYPE), col(NORMALIZE_NUMBER))
/*3*/ .agg(sum(DURATION).alias(DURATION_IN_MIN).divide(count(col(USER_ID))));
filteredData.show() gives :
|USER_ID |type |normalized_number|(sum(duration) AS `durationInMin` / count(USER_ID))|
+--------------------------------+--------+-----------------+---------------------------------------------------+
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+435657456354 |0.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+876454354353 |48.6 |
|8a8a8a8a592b4ace01595e099764000c|INCOMING|+132445686765 |15.0 |
|8a8a8a8a592b4ace01592b4ff4b90000|INCOMING|+097645634324 |74.16666666666667 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+134435657656 |15.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+135879878543 |31.0 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+768435245243 |11.0 |
|8a8a8a8a592b4ace01592cd8fd160003|INCOMING|+787685534523 |0.0 |
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+098976865745 |61.5 |
|8a8a8a8a592b4ace01592b4ff4b90000|OUTGOING|+123456787644 |43.333333333333336 |
In the query below I am filtering the data and replacing all the null occurences with 0 in step 2.
DataSet<Row> filteredData2 = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull())
.and(col(DATE).gt(0)).and(col(DURATION).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .withColumn(DURATION, when(col(DURATION).isNull(), 0).otherwise(col(DURATION).cast(LONG)))
/*3*/ .withColumn(DATE, col(DATE).cast(LONG).minus(col(DATE).cast(LONG).mod(ROUND_ONE_MIN)).cast(LONG))
/*4*/ .groupBy(col(USER_ID), col(DURATION), col(TYPE), col(DATE), col(NORMALIZE_NUMBER))
/*5*/ .agg(sum(DURATION).alias(DURATION_IN_MIN))
/*6*/ .withColumn(DAY_TIME, lit(""))
/*7*/ .withColumn(WEEK_DAY, lit(""))
/*8*/ .withColumn(HOUR_OF_DAY, lit(0));
filteredData2.show() gives :
|USER_ID |duration|type |date |normalized_number|durationInMin|DAY_TIME|WEEK_DAY|HourOfDay|
+--------------------------------+--------+--------+-------------+-----------------+-------------+--------+--------+---------+
|8a8a8a8a592b4ace01595e70dcbd0016|25 |INCOMING|1479017220000|+465435534353 |25 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|29 |INCOMING|1482562560000|+545765765775 |29 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|75 |OUTGOING|1483363980000|+124435665755 |75 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|34 |OUTGOING|1483261920000|+098865563645 |34 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|22 |OUTGOING|1481712180000|+232434656765 |22 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|64 |OUTGOING|1482984060000|+875634521325 |64 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|179 |OUTGOING|1482825060000|+876542543554 |179 | | |0 |
|8a8a8a8a592b4ace01595e65901b0013|12 |OUTGOING|1482393360000|+098634563456 |12 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|14 |OUTGOING|1482820860000|+1344365i8787 |14 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|105 |INCOMING|1478772240000|+234326886784 |105 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|453 |OUTGOING|1480944480000|+134435676578 |453 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|42 |OUTGOING|1483193100000|+413247687686 |42 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|41 |OUTGOING|1481696820000|+134345435645 |41 | | |0 |
Please help me to combine these two or use these two get the required result. I am new to Spark and SparkSQL.
Thanks.
import java.util.ArrayList;
import java.util.List;
public class Tree
{
private Board board;
private List<Tree> children;
private Tree parent;
public Tree(Board board1)
{
this.board = board1;
this.children = new ArrayList<Tree>();
}
public Tree(Tree t1)
{
}
public Tree createTree(Tree tree, boolean isHuman, int depth)
{
Player play1 = new Player();
ArrayList<Board> potBoards = new ArrayList<Board>(play1.potentialMoves(tree.board, isHuman));
if (board.gameEnd() || depth == 0)
{
return null;
}
//Tree oldTree = new Tree(board);
for (int i = 0; i < potBoards.size() - 1; i++)
{
Tree newTree = new Tree(potBoards.get(i));
createTree(newTree, !isHuman, depth - 1);
tree.addChild(newTree);
}
return tree;
}
private Tree addChild(Tree child)
{
Tree childNode = new Tree(child);
childNode.parent = this;
this.children.add(childNode);
return childNode;
}
}
Hi there. I'm trying to make a gameTree that will be handled by minimax in the future. I think the error either happened in the AddChild function or the potentialMoves? The potentialMoves returns all potential moves a player or computer can make. For example in Othello a player can either go
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | |b| | | | |
+-+-+-+-+-+-+-+-+
3| | | |b|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|w| | | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | |b|b|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|w| | | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | | |w|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|b|b| | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | | |w|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|b|| | |
+-+-+-+-+-+-+-+-+
5| | | | |b| | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
for the first turn. The potential moves does not permanently change the board that is being played on. It returns an ArrayList.
I have this in my main:
Tree gameTree = new Tree(boardOthello);
Tree pickTree = gameTree.createTree(gameTree, true, 2);
Does the addChild() function look ok or is there something else I'm missing in my code?