Java Mallet LDA keyword distributions - java

I have used Java-Mallet API for topic modelling with LDA. The API produce following results:
topic : keyword1 (count), keyword2 (count)
For example
topic 0 : file (12423), test (3123) ...
topic 1 : class (2415), test (314) ...
Is it right that topic 0 = file (12423/12423+3123 ....), test(3123/12423+3123).

That's one way to evaluate probabilities. You can also add a smoothing parameter (usually 0.01) to each value, and add 0.01 times the size of the vocabulary to the denominator to make it add up to 1.0.

Related

Java Micrometer - What to do with metrics of type *_bucket

Quick question regarding metrics of type *_bucket please.
My application generates metrics, like those below:
# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.005592405",} 273.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.006990506",} 797.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.008388607",} 2638.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.009786708",} 3543.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.011184809",} 3932.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.01258291",} 4154.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.013981011",} 4279.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.015379112",} 4380.0
and
# HELP resilience4j_circuitbreaker_calls_seconds Total number of successful calls
# TYPE resilience4j_circuitbreaker_calls_seconds histogram
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001048576",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001398101",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001747626",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002097151",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002446676",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002796201",} 0.0
I believe they are really useful, but unfortunately, I do not know what to do with them.
I tried some queries such as rate(http_server_requests_seconds{_bucket_=\"+Inf\", status=~\"2..\"}[5m]), but does not seems to bring anything valuable out.
May I ask what is the proper way to use those metrics of type *_bucket, for instance, how to build Grafana dashboards and visuals that are the best suited for those *_bucket please?
Thank you
you can find 99th percentile/95th percentile of the latency of given endpoint using this metric and can use histogram_quantile function for that.
e.g. For 99th percentile :
histogram_quantile(
0.99,
sum(
rate(
http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
) by (le)
)
For 95th percentile :
histogram_quantile(
0.95,
sum(
rate(http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
) by (le)
)
More on it:
A nice snippet from reference:
https://idanlupinsky.com/blog/application-monitoring-with-micrometer-prometheus-grafana-and-cloudwatch/
The histogram is a collection of buckets (or counters), each maintaining the number of events observed that took up to duration specified by the le tag. Let's have a look at a part of the histogram as published by our demo application:
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.067108864",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.089478485",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.111848106",} 92382.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.134217727",} 99050.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.156587348",} 99703.0
...
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.984263336",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="1.0",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="+Inf",} 100000.0
The second line in the listing above indicates there were no requests observed that took up to ~89ms (specified by the le tag). This is expected given the 100ms sleep time when processing requests. Line #3 shows that 92,382 requests were observed whose duration took up to ~111ms. Note that the histogram is cumulative and that the entire count of requests falls in the last bucket with no upper limit le="+Inf".

DFS performance of Spark GraphX vs simple Java DFS implementation

Considering a graph with 14,000 vertices and 14,000 edges, I wonder why GraphX takes much more time than the java implementation of a graph to get all the paths from a vertex to the leaf?
The java implementation: A few seconds
The Graphx implementation: Several minutes
Is spark GraphX really suitable for this kind of treatment?
My system:
i5-7500 #3.40GHz,
8GB RAM
The pregel's algorythm:
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have canReach = false.
val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId)
val sssp = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
} else {
Iterator.empty
}
},
(a, b) => a || b // Merge Message
It happened to me when implementing some algorithms on Graphx, I believe that GraphX is well adapted for a distributed environment when you have big graphs split accross multiple machines.
But now while you say that you use one node, have you checked the number of workers used? number of executors? Amount of memory used by each excutor? These configuration parameters definitely plays an important role in increasing or decreasing the performance of your GraphX application.

How to train Matrix Factorization Model in Apache Spark MLlib's ALS Using Training, Test and Validation datasets

I want to implement Apache Spark's ALS machine learning algorithm. I found that best model should be chosen to get best results. I have split the training data into three sets Training, Validation and Test as suggest on forums.
I've found following code sample to train model on these sets.
val ranks = List(8, 12)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}
val testRmse = computeRmse(bestModel.get, test, numTest)
This code trains model for each combination of rank and lambda and compares rmse (root mean squared error) with validation set. These iterations gives a better model which we can say is represented by (rank,lambda) pair. But it doesn't do much after that on test set.
It just computes the rmse with `test' set.
My question is how it can be further tuned with test set data.
No, one would never fine tune the model using test data. If you do that, it stops being your test data.
I'd recommend this section of Prof. Andrew Ng's famous course that discusses the model training process: https://www.coursera.org/learn/machine-learning/home/week/6
Depending on your observation of the error values with validation data set, you might want to add/remove features, get more data or make changes in the model, or maybe even try a different algorithm altogether. If the cross-validation and the test rmse look reasonable, then you are done with the model and you could use it for the purpose (some prediction, I would assume) that made you build it in the first place.

ELKI DBSCAN : How to set dbc.parser?

I am doing DBSCAN clustering and I have one more column apart from latitude longitude which I want to see with cluster results. For example data looks like this:
28.6029445 77.3443552 1
28.6029511 77.3443573 2
28.6029436 77.3443458 3
28.6029011 77.3443032 4
28.6028967 77.3443042 5
28.6029087 77.3442829 6
28.6029132 77.3442797 7
Now in minigui if i set parser.labelindices to 2 and run the task then the output looks like this:
# Cluster: Cluster 0
ID=63222 28.6031295 77.3407848 441
ID=63225 28.603134 77.3407744 444
ID=63220 28.6031566667 77.3407816667 439
ID=63226 28.6030819 77.3407605 445
ID=63221 28.6032 77.3407616667 440
ID=63228 28.603085 77.34071 447
ID=63215 28.60318 77.3408583333 434
ID=63229 28.6030751 77.3407096 448
So it is still connected to the 3rd column which I passed as a label. I have checked the clustering result by passing just latitude and longitude and its exactly same. So in a way by passing a column as 'label' I can retrieve that column with lat long in cluster results.
Now I want to use this in my java code
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(
FileBasedDatabaseConnection.Parameterizer.INPUT_ID,
fileLocation);
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
2);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
But this is giving a NullPointerException. In MiniGui dbc.parser is NumberVectorLabelParser by default. So this should work fine. What am I missing?
I will have a look into the NPE, it should return a more helpful error message instead.
Most likely, the problem is that this parameter is of type List<Integer>, i.e. you would need to pass a list. Alternatively, you can pass a String, which will be parsed. The following should work just fine:
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
"2");
Note that the text writer might (I have not checked this) print labels as is. So you cannot take the output as indication that it considered your data set to be 3 dimensional.
The debugging handler -resulthandler LogResultStructureResultHandler -verbose should give you type output:
java -jar elki.jar KDDCLIApplication -dbc.in dbpedia.gz \
-algorithm NullAlgorithm \
-resulthandler LogResultStructureResultHandler -verbose
should yield an output like this:
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 1941 ms
de.lmu.ifi.dbs.elki.algorithm.NullAlgorithm.runtime: 0 ms
BasicResult: Algorithm Step (main)
StaticArrayDatabase: Database (database)
DBIDView: Database IDs (DBID)
MaterializedRelation: DoubleVector,dim=2 (relation)
MaterializedRelation: LabelList (relation)
SettingsResult: Settings (settings)
In this case, my data set are coordinates from Wikipedia, along with a name each. I have a 2 dimensional DoubleVector relation, and a LabelList relation storing the object names.

How could a distributed queue-like-thing be implemented on top of a RBDMS or NOSQL datastore or other messaging system (e.g., rabbitmq)?

From the wouldn't-it-be-cool-if category of questions ...
By "queue-like-thing" I mean supports the following operations:
append(entry:Entry) - add entry to tail of queue
take(): Entry - remove entry from head of queue and return it
promote(entry_id) - move the entry one position closer to the head; the entry that currently occupies that position is moved in the old position
demote(entry_id) - the opposite of promote(entry_id)
Optional operations would be something like:
promote(entry_id, amount) - like promote(entry_id) except you specify the number of positions
demote(entry_id, amount) - opposite of promote(entry_id, amount)
of course, if we allow amount to be positive or negative, we can consolidate the promote/demote methods with a single move(entry_id, amount) method
It would be ideal if the following operations could be performed on the queue in a distributed fashion (multiple clients interacting with the queue):
queue = ...
queue.append( a )
queue.append( b )
queue.append( c )
print queue
"a b c"
queue.promote( b.id )
print queue
"b a c"
queue.demote( a.id )
"b c a"
x = queue.take()
print x
"b"
print queue
"c a"
Are there any data stores that are particularly apt for this use case? The queue should always be in a consistent state even if multiple users are modifying the queue simultaneously.
If it weren't for the promote/demote/move requirement, there wouldn't be much of a problem.
Edit:
Bonus points if there are Java and/or Python libraries to accomplish the task outlined above.
Solution should scale extremely well.
Redis supports lists and ordered sets: http://redis.io/topics/data-types#lists
It also supports transactions and publish/subscribe messaging. So, yes, I would say this can be easily done on redis.
Update: In fact, about 80% of it has been done many times: http://www.google.co.uk/search?q=python+redis+queue
Several of those hits could be upgraded to add what you want. You would have to use transactions to implement the promote/demote operations.
It might be possible to use lua on the server side to create that functionality, rather than having it in client code. Alternatively, you could create a thin wrapper around redis on the server, that implements just the operations you want.
Python: "Batteries Included"
Rather than looking to a data store like RabbitMQ, Redis, or an RDBMS, I think python and a couple libraries have more than enough to solve this problem. Some may complain that this do-it-yourself approach is re-inventing the wheel but I prefer running a hundred lines of python code over managing another data store.
Implementing a Priority Queue
The operations that you define: append, take, promote, and demote, describe a priority queue. Unfortunately python doesn't have a built-in priority queue data type. But it does have a heap library called heapq and priority queues are often implemented as heaps. Here's my implementation of a priority queue meeting your requirements:
class PQueue:
"""
Implements a priority queue with append, take, promote, and demote
operations.
"""
def __init__(self):
"""
Initialize empty priority queue.
self.toll is max(priority) and max(rowid) in the queue
self.heap is the heap maintained for take command
self.rows is a mapping from rowid to items
self.pris is a mapping from priority to items
"""
self.toll = 0
self.heap = list()
self.rows = dict()
self.pris = dict()
def append(self, value):
"""
Append value to our priority queue.
The new value is added with lowest priority as an item. Items are
threeple lists consisting of [priority, rowid, value]. The rowid
is used by the promote/demote commands.
Returns the new rowid corresponding to the new item.
"""
self.toll += 1
item = [self.toll, self.toll, value]
self.heap.append(item)
self.rows[self.toll] = item
self.pris[self.toll] = item
return self.toll
def take(self):
"""
Take the highest priority item out of the queue.
Returns the value of the item.
"""
item = heapq.heappop(self.heap)
del self.pris[item[0]]
del self.rows[item[1]]
return item[2]
def promote(self, rowid):
"""
Promote an item in the queue.
The promoted item swaps position with the next highest item.
Returns the number of affected rows.
"""
if rowid not in self.rows: return 0
item = self.rows[rowid]
item_pri, item_row, item_val = item
next = item_pri - 1
if next in self.pris:
iota = self.pris[next]
iota_pri, iota_row, iota_val = iota
iota[1], iota[2] = item_row, item_val
item[1], item[2] = iota_row, iota_val
self.rows[item_row] = iota
self.rows[iota_row] = item
return 2
return 0
The demote command is nearly identical to the promote command so I'll omit it for brevity. Note that this depends only on python's lists, dicts, and heapq library.
Serving our Priority Queue
Now with the PQueue data type, we'd like to allow distributed interactions with an instance. A great library for this is gevent. Though gevent is relatively new and still beta, it's wonderfully fast and well tested. With gevent, we can setup a socket server listening on localhost:4040 pretty easily. Here's my server code:
pqueue = PQueue()
def pqueue_server(sock, addr):
text = sock.recv(1024)
cmds = text.split(' ')
if cmds[0] == 'append':
result = pqueue.append(cmds[1])
elif cmds[0] == 'take':
result = pqueue.take()
elif cmds[0] == 'promote':
result = pqueue.promote(int(cmds[1]))
elif cmds[0] == 'demote':
result = pqueue.demote(int(cmds[1]))
else:
result = ''
sock.sendall(str(result))
print 'Request:', text, '; Response:', str(result)
if args.listen:
server = StreamServer(('127.0.0.1', 4040), pqueue_server)
print 'Starting pqueue server on port 4040...'
server.serve_forever()
Before that runs in production, you'll of course want to do some better error/buffer handling. But it'll work just fine for rapid-prototyping. Notice that this doesn't require any locking around the pqueue object. Gevent doesn't actually run code in parallel, it just gives that impression. The drawback is that more cores won't help but the benefit is lock-free code.
Don't get me wrong, the gevent SocketServer will process multiple requests at the same time. But it switches between answering requests through cooperative multitasking. This means you have to yield the coroutine's time slice. While gevents socket I/O functions are designed to yield, our pqueue implementation is not. Fortunately, the pqueue completes it's tasks really quickly.
A Client Too
While prototyping, I found it useful to have a client as well. It took some googling to write a client so I'll share that code too:
if args.client:
while True:
msg = raw_input('> ')
sock = gsocket.socket(gsocket.AF_INET, gsocket.SOCK_STREAM)
sock.connect(('127.0.0.1', 4040))
sock.sendall(msg)
text = sock.recv(1024)
sock.close()
print text
To use the new data store, first start the server and then start the client. At the client prompt you ought to be able to do:
> append one
1
> append two
2
> append three
3
> promote 2
2
> promote 2
0
> take
two
Scaling Extremely Well
Given your thinking about a data store, it seems you're really concerned with throughput and durability. But "scale extremely well" doesn't quantify your needs. So I decided to benchmark the above with a test function. Here's the test function:
def test():
import time
import urllib2
import subprocess
import random
random = random.Random(0)
from progressbar import ProgressBar, Percentage, Bar, ETA
widgets = [Percentage(), Bar(), ETA()]
def make_name():
alphabet = 'abcdefghijklmnopqrstuvwxyz'
return ''.join(random.choice(alphabet)
for rpt in xrange(random.randrange(3, 20)))
def make_request(cmds):
sock = gsocket.socket(gsocket.AF_INET, gsocket.SOCK_STREAM)
sock.connect(('127.0.0.1', 4040))
sock.sendall(cmds)
text = sock.recv(1024)
sock.close()
print 'Starting server and waiting 3 seconds.'
subprocess.call('start cmd.exe /c python.exe queue_thing_gevent.py -l',
shell=True)
time.sleep(3)
tests = []
def wrap_test(name, limit=10000):
def wrap(func):
def wrapped():
progress = ProgressBar(widgets=widgets)
for rpt in progress(xrange(limit)):
func()
secs = progress.seconds_elapsed
print '{0} {1} records in {2:.3f} s at {3:.3f} r/s'.format(
name, limit, secs, limit / secs)
tests.append(wrapped)
return wrapped
return wrap
def direct_append():
name = make_name()
pqueue.append(name)
count = 1000000
#wrap_test('Loaded', count)
def direct_append_test(): direct_append()
def append():
name = make_name()
make_request('append ' + name)
#wrap_test('Appended')
def append_test(): append()
...
print 'Running speed tests.'
for tst in tests: tst()
Benchmark Results
I ran 6 tests against the server running on my laptop. I think the results scale extremely well. Here's the output:
Starting server and waiting 3 seconds.
Running speed tests.
100%|############################################################|Time: 0:00:21
Loaded 1000000 records in 21.770 s at 45934.773 r/s
100%|############################################################|Time: 0:00:06
Appended 10000 records in 6.825 s at 1465.201 r/s
100%|############################################################|Time: 0:00:06
Promoted 10000 records in 6.270 s at 1594.896 r/s
100%|############################################################|Time: 0:00:05
Demoted 10000 records in 5.686 s at 1758.706 r/s
100%|############################################################|Time: 0:00:05
Took 10000 records in 5.950 s at 1680.672 r/s
100%|############################################################|Time: 0:00:07
Mixed load processed 10000 records in 7.410 s at 1349.528 r/s
Final Frontier: Durability
Finally, durability is the only problem I didn't completely prototype. But I don't think it's that hard either. In our priority queue, the heap (list) of items has all the information we need to persist the data type to disk. Since, with gevent, we can also spawn functions in a multi-processing way, I imagined using a function like this:
def save_heap(heap, toll):
name = 'heap-{0}.txt'.format(toll)
with open(name, 'w') as temp:
for val in heap:
temp.write(str(val))
gevent.sleep(0)
and adding a save function to our priority queue:
def save(self):
heap_copy = tuple(self.heap)
toll = self.toll
gevent.spawn(save_heap, heap_copy, toll)
You could now copy the Redis model of forking and writing the data store to disk every few minutes. If you need even greater durability then couple the above with a system that logs commands to disk. Together, those are the AFP and RDB persistence methods that Redis uses.
Websphere MQ can do almost all of this.
The promote/demote is almost possible, by removing the message from the queue and putting it back on with a higher/lower priority, or, by using the "CORRELID" as a sequence number.
What's wrong with RabbitMQ? It sounds exactly like what you need.
We extensively use Redis as well in our Production environment, but it doesn't have some of the functionality Queues usually have, like setting a task as complete, or re-sending the task if it isn't completed in some TTL. It does, on the other hand, have other features a Queue doesn't have, like it is a generic storage, and it is REALLY fast.
Use Redisson it implements familiar List, Queue, BlockingQueue, Deque java interfaces in distributed approach provided by Redis. Example with a Deque:
Redisson redisson = Redisson.create();
RDeque<SomeObject> queue = redisson.getDeque("anyDeque");
queue.addFirst(new SomeObject());
queue.addLast(new SomeObject());
SomeObject obj = queue.removeFirst();
SomeObject someObj = queue.removeLast();
redisson.shutdown();
Other samples:
https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#77-list
https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#78-queue https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#710-blocking-queue
If you for some reason decide to use an SQL database as a backend, I would not use MySQL as it requires polling (well and would not use it for lots of other reasons), but PostgreSQL supports LISTEN/NOTIFY for signalling other clients so that they do not have to poll for changes. However, it signals all listening clients at once, so you still would require a mechanism for choosing a winning listener.
As a sidenote I am not sure if a promote/demote mechanism would be useful; it would be better to schedule the jobs appropriately while inserting...

Categories