ELKI DBSCAN : How to set dbc.parser? - java

I am doing DBSCAN clustering and I have one more column apart from latitude longitude which I want to see with cluster results. For example data looks like this:
28.6029445 77.3443552 1
28.6029511 77.3443573 2
28.6029436 77.3443458 3
28.6029011 77.3443032 4
28.6028967 77.3443042 5
28.6029087 77.3442829 6
28.6029132 77.3442797 7
Now in minigui if i set parser.labelindices to 2 and run the task then the output looks like this:
# Cluster: Cluster 0
ID=63222 28.6031295 77.3407848 441
ID=63225 28.603134 77.3407744 444
ID=63220 28.6031566667 77.3407816667 439
ID=63226 28.6030819 77.3407605 445
ID=63221 28.6032 77.3407616667 440
ID=63228 28.603085 77.34071 447
ID=63215 28.60318 77.3408583333 434
ID=63229 28.6030751 77.3407096 448
So it is still connected to the 3rd column which I passed as a label. I have checked the clustering result by passing just latitude and longitude and its exactly same. So in a way by passing a column as 'label' I can retrieve that column with lat long in cluster results.
Now I want to use this in my java code
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(
FileBasedDatabaseConnection.Parameterizer.INPUT_ID,
fileLocation);
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
2);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
But this is giving a NullPointerException. In MiniGui dbc.parser is NumberVectorLabelParser by default. So this should work fine. What am I missing?

I will have a look into the NPE, it should return a more helpful error message instead.
Most likely, the problem is that this parameter is of type List<Integer>, i.e. you would need to pass a list. Alternatively, you can pass a String, which will be parsed. The following should work just fine:
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
"2");
Note that the text writer might (I have not checked this) print labels as is. So you cannot take the output as indication that it considered your data set to be 3 dimensional.
The debugging handler -resulthandler LogResultStructureResultHandler -verbose should give you type output:
java -jar elki.jar KDDCLIApplication -dbc.in dbpedia.gz \
-algorithm NullAlgorithm \
-resulthandler LogResultStructureResultHandler -verbose
should yield an output like this:
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 1941 ms
de.lmu.ifi.dbs.elki.algorithm.NullAlgorithm.runtime: 0 ms
BasicResult: Algorithm Step (main)
StaticArrayDatabase: Database (database)
DBIDView: Database IDs (DBID)
MaterializedRelation: DoubleVector,dim=2 (relation)
MaterializedRelation: LabelList (relation)
SettingsResult: Settings (settings)
In this case, my data set are coordinates from Wikipedia, along with a name each. I have a 2 dimensional DoubleVector relation, and a LabelList relation storing the object names.

Related

explode an spark array column to multiple columns sparksql

I have a column which has type Value defined like below
val Value: ArrayType = ArrayType(
new StructType()
.add("unit", StringType)
.add("value", StringType)
)
and data like this
[[unit1, 25], [unit2, 77]]
[[unit2, 100], [unit1, 40]]
[[unit2, 88]]
[[unit1, 33]]
I know spark sql can use functions.explode to make the data become multiple rows, but what i want is explode to multiple columns (or the 1 one column but 2 items for the one has only 1 item).
so the end result looks like below
unit1 unit2
25 77
40 100
value1 88
33 value2
How could I achieve this?
addtion after initial post and update
I want to get result like this (this is more like my final goal).
transformed-column
[[unit1, 25], [unit2, 77]]
[[unit2, 104], [unit1, 40]]
[[unit1, value1], [unit2, 88]]
[[unit1, 33],[unit2,value2]]
where value1 is the result of applying some kind of map/conversion function using the [unit2, 88]
similarly, value2 is the result of applying the same map /conversion function using the [unit1, 33]
I solved this problem using the map_from_entries as suggested by #jxc, and then used UDF to convert the map of 1 item to map of 2 items, using business logic to convert between the 2 units.
one thing to note is the map returned from map_from_entries is scala map. and if you use java, need to make sure the udf method takes scala map instead.
ps. maybe I did not have to use map_from_entries, instead maybe i could make the UDF to take array of structType

Java Mallet LDA keyword distributions

I have used Java-Mallet API for topic modelling with LDA. The API produce following results:
topic : keyword1 (count), keyword2 (count)
For example
topic 0 : file (12423), test (3123) ...
topic 1 : class (2415), test (314) ...
Is it right that topic 0 = file (12423/12423+3123 ....), test(3123/12423+3123).
That's one way to evaluate probabilities. You can also add a smoothing parameter (usually 0.01) to each value, and add 0.01 times the size of the vocabulary to the denominator to make it add up to 1.0.

#user_script:1: WRONGTYPE Operation against a key holding the wrong kind of value

Following is my lua script
if redis.call('sismember',KEYS[1],ARGV[1])==1
then redis.call('srem',KEYS[1],ARGV[1])
else return 0
end
store = tonumber(redis.call('hget',KEYS[2],'capacity'))
store = store + 1
redis.call('hset',KEYS[2],'capacity',store)
return 1
when I run this srcipt in Java, An exception like
#user_script:1: WRONGTYPE Operation against a key holding the wrong kind of value
throws, the Java code is like
Object ojb = jedis.evalsha(sha,2,userName.getBytes(),
id.getBytes(),id.getBytes()) ;
where userName is "tau" and id is "002" in my code,
and I test the type of "tau" and "002" as follows,
127.0.0.1:6379> type tau
set
127.0.0.1:6379> type 002
hash
and exactly, the content of them are :
127.0.0.1:6379> hgetall 002
name
"鏁版嵁搴撲粠鍒犲簱鍒拌窇璺?
teacher
"taochq"
capacity
54
127.0.0.1:6379> smembers tau
002
004
001
127.0.0.1:6379>
Now I'm so confused and don't know what's wrong, any help will be grateful
The error is quite verbose - you're trying to perform an operation on key of the wrong type.
Run MONITOR alongside and then your script - then you'll be able to spot the error easily.
Try your script as:
EVAL "if redis.call('sismember',KEYS[1],ARGV[1])==1 \n then redis.call('srem',KEYS[1],ARGV[1]) \n else return 0 \n end \n local store = tonumber(redis.call('hget',KEYS[2],'capacity')) \n store = store + 1 \n redis.call('hset',KEYS[2],'capacity',store) \n return 1" 2 tau 002 002
You'll see if it works. Most likely, userName.getBytes() and id.getBytes() are not returning what you expect. Use MONITOR as Itamar suggests to see what's actually reaching the server.
You'll get a separate a different issue: Script attempted to create global variable 'store'. Add local to the 5th line:
local store = tonumber(redis.call('hget',KEYS[2],'capacity'))

FOREACH in cypher - neo4j

I am very new to CYPHER QUERY LANGUAGE AND i am working on relationships between nodes.
I have a CSV file of table containing multiple columns and 1000 rows.
Template of my table is :
cdrType ANUMBER BNUMBER DUARTION
2 123 456 10
2 890 456 5
2 123 666 2
2 123 709 7
2 345 789 20
I have used these commands to create nodes and property keys.
LOAD CSV WITH HEADERS FROM "file:///2.csv" AS ROW
CREATE (:ANUMBER {aNumber:ROW.aNumber} ),
CREATE (:BNUMBER {bNumber:ROW.bNumber} )
Now I need to create relation between all rows in the table and I think FOREACH loop is best in my case. I created this query but it gives me an error. Query is :
MATCH (a:ANUMBER),(b:BNUMBER)
FOREACH(i in RANGE(0, length(ANUMBER)) |
CREATE UNIQUE (ANUMBER[i])-[s:CALLED]->(BNUMBER[i]))
and the error is :
Invalid input '[': expected an identifier character, whitespace,
NodeLabel, a property map, ')' or a relationship pattern (line 3,
column 29 (offset: 100)) " CREATE UNIQUE
(a:ANUMBER[i])-[s:CALLED]->(b:BNUMBER[i]))"
I need relation for every row. like in my case. 123 - called -> 456 , 890 - called -> 456. So I need visual representation of this calling data that which number called which one. For this I need to create relation between all rows.
any one have idea how to solve this ?
What about :
LOAD CSV WITH HEADERS FROM "file:///2.csv" AS ROW
CREATE (a:ANUMBER {aNumber:ROW.aNumber} )
CREATE (b:BNUMBER {bNumber:ROW.bNumber} )
MERGE (a)-[:CALLED]->(b);
It's not more complex than that i.m.o.
Hope this helps !
Regards,
Tom

Inconsistent counter values between replicas in Cassandra

I've got a 3 machine Cassandra cluster using rack unaware placements strategy with a replication factor of 2.
The column family is defined as follows:
create column family UserGeneralStats with comparator = UTF8Type and default_validation_class = CounterColumnType;
Unfortunately after a few days of production use I got some inconsistent values for the counters:
Query on replica 1:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=96545030198)
=> (counter=downloads, value=1013)
=> (counter=previews, value=10304)
Query on replica 2:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=9140386229)
=> (counter=downloads, value=339)
=> (counter=previews, value=1321)
As the standard read repair mechanism doesn't seem to repair the values I tried to force an
anti-entropy repair using nodetool repair. It did't have any effect on the counter values.
Data inspection showed that the lower values for the counters are the correct ones so I suspect that either Cassandra (or Hector which I used as API to call Cassandra from Java) retried some increments.
Any ideas how to repair the data and possibly prevent the sittuation from happening again?
If neither RR nor repair fixes it, it's probably a bug.
Please upgrade to 0.8.3 (out today) and verify it's still present in that version, then you can file a ticket at https://issues.apache.org/jira/browse/CASSANDRA.

Categories