The new Couchbase SDK makes bulk operations easier to use and more performant use rx-java. But is there any value to using rx for operations on single values?
If we look at a simple CAS / insert operation, ie if the value exists do a cas else do an insert and return the document value
final String id = "id";
final String modified = "modified";
final int numCasRetries = 3;
Observable
.defer(() -> bucket.async().get(id))
.flatMap(document -> {
try {
if (document == null) {
JsonObject content = JsonObject.create();
content.put(modified, new Date().getTime());
document = bucket.insert(JsonDocument.create(id, content));
} else {
document.content().put(modified, new Date().getTime());
document = bucket.replace(document);
}
return Observable.just(document);
} catch (CASMismatchException e) {
return Observable.error(e);
}
})
.retry((count, error) -> {
// Only retry on CASMismatchException
return ((error instanceof CASMismatchException)
&& (count < numCASRetries));
})
.onErrorResumeNext(error -> {
return Observable.error(new Exception(error));
})
.toBlocking()
.single();
So toBlocking will block the calling thread until a result is available. and only one value is written and read from Couchbase at a time. So I do not understand why or even if this code will be any better than
final String id = "id";
final String modified = "modified";
final int numCasRetries = 3;
JsonDocument document = null;
for (int i = 1; i <= numCasRetries; i++) {
document = bucket.get(id);
try {
if (document == null) {
JsonObject content = JsonObject.create();
content.put(modified, new Date().getTime());
document = bucket.insert(JsonDocument.create(id, content));
} else {
document.content().put(modified, new Date().getTime());
document = bucket.replace(document);
}
return document;
} catch (CASMismatchException e) {
if (i == numCasRetries) {
throw e;
}
}
}
If anything I'd argue that in this scenario the rx approach is less readable.
For an operation on a single document where ultimately you need to block, I'd tend to agree that your second example is clearer.
RxJava shines when you heavily use asynchronous processing, especially when you need advanced error handling, retry scenarii, combination of asynchronous flows...
The previous generation of Couchbase Java SDK (1.4.x) just had Future for that, and it didn't provide the elegant, powerful and expressive capabilities we found in RxJava.
Related
I have this path for a MongoDB field main.inner.leaf and every field couldn't be present.
In Java I should write, avoiding null:
String leaf = "";
if (document.get("main") != null &&
document.get("main", Document.class).get("inner") != null) {
leaf = document.get("main", Document.class)
.get("inner", Document.class).getString("leaf");
}
In this simple example I set only 3 levels: main, inner and leaf but my documents are deeper.
So is there a way avoiding me writing all these null checks?
Like this:
String leaf = document.getString("main.inner.leaf", "");
// "" is the deafult value if one of the levels doesn't exist
Or using a third party library:
String leaf = DocumentUtils.getNullCheck("main.inner.leaf", "", document);
Many thanks.
Since the intermediate attributes are optional you really have to access the leaf value in a null safe manner.
You could do this yourself using an approach like ...
if (document.containsKey("main")) {
Document _main = document.get("main", Document.class);
if (_main.containsKey("inner")) {
Document _inner = _main.get("inner", Document.class);
if (_inner.containsKey("leaf")) {
leafValue = _inner.getString("leaf");
}
}
}
Note: this could be wrapped up in a utility to make it more user friendly.
Or use a thirdparty library such as Commons BeanUtils.
But, you cannot avoid null safe checks since the document structure is such that the intermediate levels might be null. All you can do is to ease the burden of handling the null safety.
Here's an example test case showing both approaches:
#Test
public void readNestedDocumentsWithNullSafety() throws IllegalAccessException, NoSuchMethodException, InvocationTargetException {
Document inner = new Document("leaf", "leafValue");
Document main = new Document("inner", inner);
Document fullyPopulatedDoc = new Document("main", main);
assertThat(extractLeafValueManually(fullyPopulatedDoc), is("leafValue"));
assertThat(extractLeafValueUsingThirdPartyLibrary(fullyPopulatedDoc, "main.inner.leaf", ""), is("leafValue"));
Document emptyPopulatedDoc = new Document();
assertThat(extractLeafValueManually(emptyPopulatedDoc), is(""));
assertThat(extractLeafValueUsingThirdPartyLibrary(emptyPopulatedDoc, "main.inner.leaf", ""), is(""));
Document emptyInner = new Document();
Document partiallyPopulatedMain = new Document("inner", emptyInner);
Document partiallyPopulatedDoc = new Document("main", partiallyPopulatedMain);
assertThat(extractLeafValueManually(partiallyPopulatedDoc), is(""));
assertThat(extractLeafValueUsingThirdPartyLibrary(partiallyPopulatedDoc, "main.inner.leaf", ""), is(""));
}
private String extractLeafValueUsingThirdPartyLibrary(Document document, String path, String defaultValue) {
try {
Object value = PropertyUtils.getNestedProperty(document, path);
return value == null ? defaultValue : value.toString();
} catch (Exception ex) {
return defaultValue;
}
}
private String extractLeafValueManually(Document document) {
Document inner = getOrDefault(getOrDefault(document, "main"), "inner");
return inner.get("leaf", "");
}
private Document getOrDefault(Document document, String key) {
if (document.containsKey(key)) {
return document.get(key, Document.class);
} else {
return new Document();
}
}
My problem: I have a model engine that takes a list of parameter configuration, and evaluates a double value that corresponds to the metric associated to that configuration. I have six parameters, and each of them can vary according to a list. I want to find by brute force the best parameter configuration considering the combination that will produce the higher value for the output metric. Since I'm learning Spark, I realized that with the cartesian product operation I can easily generate the combinations, and split the RDD to be processed in parallel. So, I came up with this driver program:
public static void main(String[] args) {
String scriptName = "model.mry";
String scriptStr = null;
try {
scriptStr = new String(Files.readAllBytes(Paths.get(scriptName)));
} catch (IOException ex) {
Logger.getLogger(BruteForceDriver.class.getName()).log(Level.SEVERE, null, ex);
System.exit(1);
}
final String script = scriptStr;
SparkConf conf = new SparkConf()
.setAppName("wordCount")
.setSparkHome("/home/danilo/bin/spark-2.2.0-bin-hadoop2.7")
.setJars(new String[]{"/home/danilo/NetBeansProjects/SparkHello1/target/SparkHello1-1.0.jar",
"/home/danilo/.m2/repository/org/modcs/mercury/4.7/mercury-4.7.jar"})
.setMaster("spark://danilo-desktop:7077");
String baseDir = "/home/danilo/NetBeansProjects/SimulationOptimization/workspace/";
JavaSparkContext sc = new JavaSparkContext(conf);
final int NUM_SERVICES = 6;
final int QTD = 3;
JavaRDD<Service>[] providers = new JavaRDD[NUM_SERVICES];
for (int i = 1; i <= NUM_SERVICES; i++) {
providers[i - 1] = sc.textFile(baseDir + "provider"
+ i
+ ".mat")
.filter((t1) -> !t1.contains("#") && !t1.trim().isEmpty())
.map(Service.createParser("" + i))
.zipWithIndex().filter((t1) -> {
return t1._2 < QTD;
}).keys();
}
JavaPairRDD c = null;
JavaRDD<Service> p = providers[0];
for (int i = 1; i < NUM_SERVICES; i++) {
if (c == null) {
c = p.cartesian(providers[i]);
} else {
c = c.cartesian(providers[i]);
}
}
JavaRDD<List<Service>> cartesian = c.map(new FlattenTuple<>());
final Broadcast<ModelEvaluator> model = sc.broadcast(new ModelEvaluator(script));
JavaPairRDD<Double, List<Service>> results = cartesian.mapToPair(
(t) -> {
try {
double val = model.value().evaluateModel(t);
System.out.println(val);
return new Tuple2<>(val, t);
} catch (Exception ex) {
return null;
}
}
);
results.sortByKey().collect().forEach((t) -> {
System.out.println(t._1 + ", " + t._2);
});
sc.close();
}
The "QTD" variable allows me to control the size the interval which each parameter will vary. For QTD = 3, I'll have 3^6 = 729 combinations. The problem is that it is taking so long to compute all those combinations. I wrote a implementations using only normal Java threads, and the runtime is about 40 seconds. Using my Spark driver program, the runtime more than 6 minutes. Why is my Spark program so slow compared to the plain Java multi-thread program?
Edit:
I put:
results = results.cache();
before sorting the results and now the runtime is 2.5 minutes.
Edit 2:
I created a RDD with the cartesian product of the parameters by hand instead of using the operation provided by the framework. Now my runtime is 1'25''. It does make sense now, since there is some overhead to start the driver and move the jars to the workers.
I'm trying to download photos posted with specific tag in real time. I found real time api pretty useless so I'm using long polling strategy. Below is pseudocode with comments of sublte bugs in it
newMediaCount = getMediaCount();
delta = newMediaCount - mediaCount;
if (delta > 0) {
// if mediaCount changed by now, realDelta > delta, so realDelta - delta photos won't be grabbed and on next poll if mediaCount didn't change again realDelta - delta would be duplicated else ...
// if photo posted from private account last photo will be duplicated as counter changes but nothing is added to recent
recentMedia = getRecentMedia(delta);
// persist recentMedia
mediaCount = newMediaCount;
}
Second issue can be addressed with Set of some sort I gueess. But first really bothers me. I've moved two calls to instagram api as close as possible but is this enough?
Edit
As Amir suggested I've rewritten the code with use of min/max_tag_ids. But it still skips photos. I couldn't find better way to test this than save images on disk for some time and compare result to instagram.com/explore/tags/.
public class LousyInstagramApiTest {
#Test
public void testFeedContinuity() throws Exception {
Instagram instagram = new Instagram(Settings.getClientId());
final String TAG_NAME = "portrait";
String id = instagram.getRecentMediaTags(TAG_NAME).getPagination().getMinTagId();
HashtagEndpoint endpoint = new HashtagEndpoint(instagram, TAG_NAME, id);
for (int i = 0; i < 10; i++) {
Thread.sleep(3000);
endpoint.recentFeed().forEach(d -> {
try {
URL url = new URL(d.getImages().getLowResolution().getImageUrl());
BufferedImage img = ImageIO.read(url);
ImageIO.write(img, "png", new File("D:\\tmp\\" + d.getId() + ".png"));
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
}
class HashtagEndpoint {
private final Instagram instagram;
private final String hashtag;
private String minTagId;
public HashtagEndpoint(Instagram instagram, String hashtag, String minTagId) {
this.instagram = instagram;
this.hashtag = hashtag;
this.minTagId = minTagId;
}
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
String maxTagId = feed.getPagination().getNextMaxTagId();
if (maxTagId != null && maxTagId.compareTo(minTagId) > 0) dataList.addAll(paginateFeed(maxTagId));
Collections.reverse(dataList);
// dataList.removeIf(d -> d.getId().compareTo(minTagId) < 0);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
private Collection<? extends MediaFeedData> paginateFeed(String maxTagId) throws InstagramException {
System.out.println("pagination required");
List<MediaFeedData> dataList = new ArrayList<>();
do {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, null, maxTagId);
maxTagId = feed.getPagination().getNextMaxTagId();
dataList.addAll(feed.getData());
} while (maxTagId.compareTo(minTagId) > 0);
return dataList;
}
}
Using the Tag endpoints to get the recent media with a desired tag, it returns a min_tag_id in its pagination info, which is tied to the most recently tagged media at the time of your call. As the API also accepts a min_tag_id parameter, you can pass that number from your last query to only receive those media that are tagged after your last query.
So based on whatever polling mechanism you have, you just call the API to get the new recent media if any based on last received min_tag_id.
You will also need to pass a large count parameter and follow the pagination of the response to receive all data without losing anything when the speed of tagging is faster than your polling.
Update:
Based on your updated code:
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null, 100000);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
// follow the pagination
MediaFeed recentMediaNextPage = instagram.getRecentMediaNextPage(feed.getPagination());
while (recentMediaNextPage.getPagination() != null) {
dataList.addAll(recentMediaNextPage.getData());
recentMediaNextPage = instagram.getRecentMediaNextPage(recentMediaNextPage.getPagination());
}
Collections.reverse(dataList);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
I want to have Solr always retrieve all documents found by a search (I know Solr wasn't built for that, but anyways) and I am currently doing this with this code:
...
List<Article> ret = new ArrayList<Article>();
QueryResponse response = solr.query(query);
int offset = 0;
int totalResults = (int) response.getResults().getNumFound();
List<Article> ret = new ArrayList<Article>((int) totalResults);
query.setRows(FETCH_SIZE);
while(offset < totalResults) {
//requires an int? wtf?
query.setStart((int) offset);
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
response = solr.query(query);
List<Article> current = response.getBeans(Article.class);
offset += current.size();
ret.addAll(current);
}
...
This works, but is pretty slow if a query gets over 1000 hits (I've read about that on here. This is being caused by Solr because I am setting the start everytime which - for some reason - takes some time). What would be a nicer (and faster) ways to do this?
To improve the suggested answer you could use a streamed response. This has been added especially for the case that one fetches all results. As you can see in Solr's Jira that guy wants to do the same as you do. This has been implemented for Solr 4.
This is also described in Solrj's javadoc.
Solr will pack the response and create a whole XML/JSON document before it starts sending the response. Then your client is required to unpack all that and offer it as a list to you. By using streaming and parallel processing, which you can do when using such a queued approach, the performance should improve further.
Yes, you will loose the automatic bean mapping, but as performance is a factor here, I think this is acceptable.
Here is a sample unit test:
public class StreamingTest {
#Test
public void streaming() throws SolrServerException, IOException, InterruptedException {
HttpSolrServer server = new HttpSolrServer("http://your-server");
SolrQuery tmpQuery = new SolrQuery("your query");
tmpQuery.setRows(Integer.MAX_VALUE);
final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
SolrDocument tmpDoc;
do {
tmpDoc = tmpQueue.take();
} while (!(tmpDoc instanceof PoisonDoc));
}
private class PoisonDoc extends SolrDocument {
// marker to finish queuing
}
private class MyCallbackHander extends StreamingResponseCallback {
private BlockingQueue<SolrDocument> queue;
private long currentPosition;
private long numFound;
public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
queue = aQueue;
}
#Override
public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
// called before start of streaming
// probably use for some statistics
currentPosition = aStart;
numFound = aNumFound;
if (numFound == 0) {
queue.add(new PoisonDoc());
}
}
#Override
public void streamSolrDocument(SolrDocument aDoc) {
currentPosition++;
System.out.println("adding doc " + currentPosition + " of " + numFound);
queue.add(aDoc);
if (currentPosition == numFound) {
queue.add(new PoisonDoc());
}
}
}
}
You might improve performance by increasing FETCH_SIZE. Since you are getting all the results, pagination doesn't make sense unless you are concerned with memory or some such. If 1000 results are liable to cause a memory overflow, I'd say your current performance seems pretty outstanding though.
So I would try getting everything at once, simplifying this to something like:
//WHOLE_BUNCHES is a constant representing a reasonable max number of docs we want to pull here.
//Integer.MAX_VALUE would probably invite an OutOfMemoryError, but that would be true of the
//implementation in the question anyway, since they were still being stored in the list at the end.
query.setRows(WHOLE_BUNCHES);
QueryResponse response = solr.query(query);
int totalResults = (int) response.getResults().getNumFound(); //If you even still need this figure.
List<Article> ret = response.getBeans(Article.class);
If you need to keep the pagination though:
You are performing this first query:
QueryResponse response = solr.query(query);
and are populating the number of found results from it, but you are not pulling any results with the response. Even if you keep pagination here, you could at least eliminate one extra query here.
This:
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
Is unnecessary. setRows specifies a Maximum number of rows to return, so asking for more than are available won't cause any problems.
Finally, apropos of nothing, but I have to ask: what argument would you expect setStart to take if not an int?
Use below logic to fetch solr data as batch to optimize performance of solr data fetch query:
public List<Map<String, Object>> getData(int id,Set<String> fields){
final int SOLR_QUERY_MAX_ROWS = 3;
long start = System.currentTimeMillis();
SolrQuery query = new SolrQuery();
String queryStr = "id:" + id;
LOG.info(queryStr);
query.setQuery(queryStr);
query.setRows(SOLR_QUERY_MAX_ROWS);
QueryResponse rsp = server.query(query, SolrRequest.METHOD.POST);
List<Map<String, Object>> mapList = null;
if (rsp != null) {
long total = rsp.getResults().getNumFound();
System.out.println("Total count found: " + total);
// Solr query batch
mapList = new ArrayList<Map<String, Object>>();
if (total <= SOLR_QUERY_MAX_ROWS) {
addAllData(mapList, rsp,fields);
} else {
int marker = SOLR_QUERY_MAX_ROWS;
do {
if (rsp != null) {
addAllData(mapList, rsp,fields);
}
query.setStart(marker);
rsp = server.query(query, SolrRequest.METHOD.POST);
marker = marker + SOLR_QUERY_MAX_ROWS;
} while (marker <= total);
}
}
long end = System.currentTimeMillis();
LOG.debug("SOLR Performance: getData: " + (end - start));
return mapList;
}
private void addAllData(List<Map<String, Object>> mapList, QueryResponse rsp,Set<String> fields) {
for (SolrDocument sdoc : rsp.getResults()) {
Map<String, Object> map = new HashMap<String, Object>();
for (String field : fields) {
map.put(field, sdoc.getFieldValue(field));
}
mapList.add(map);
}
}
using Jsoup, I extract JavaScript part in html file. and store it as java String Object.
and I want to extract function list, variables list in js's function using javax.script.ScriptEngine
JavaScript part has several function section.
ex)
function a() {
var a_1;
var a_2
...
}
function b() {
var b_1;
var b_2;
...
}
function c() {
var c_1;
var c_2;
...
}
My Goals is right below.
List funcList
a
b
c
List varListA
a_1
a_2
...
List varListB
b_1
b_2
...
List varListC
c_1
c_2
...
How can I extract function list and variables list(or maybe values)?
I think you can do this by using javascript introspection after having loaded the javascript in the Engine - e.g. for functions:
ScriptEngine engine;
// create the engine and have it load your javascript
Bindings bind = engine.getBindings(ScriptContext.ENGINE_SCOPE);
Set<String> allAttributes = bind.keySet();
Set<String> allFunctions = new HashSet<String>();
for ( String attr : allAttributes ) {
if ( "function".equals( engine.eval("typeof " + attr) ) ) {
allFunctions.add(attr);
}
}
System.out.println(allFunctions);
I haven't found a way to extract the variables inside functions (local variables) without delving in internal mechanics (and thus unsafe to use) of the javascript scripting engine.
It is pretty tricky. ScriptEngine API seems not good for inspecting the code. So, I have such kind of pretty ugly solution with instance of and cast operators.
Bindings bindings = engine.getBindings(ScriptContext.ENGINE_SCOPE);
for (Map.Entry<String, Object> scopeEntry : bindings.entrySet()) {
Object value = scopeEntry.getValue();
String name = scopeEntry.getKey();
if (value instanceof NativeFunction) {
log.info("Function -> " + name);
NativeFunction function = NativeFunction.class.cast(value);
DebuggableScript debuggableFunction = function.getDebuggableView();
for (int i = 0; i < debuggableFunction.getParamAndVarCount(); i++) {
log.info("First level arg: " + debuggableFunction.getParamOrVarName(i));
}
} else if (value instanceof Undefined
|| value instanceof String
|| value instanceof Number) {
log.info("Global arg -> " + name);
}
}
I had similar issue. Maybe it will be helpfull for others.
I use groove as script lang. My Task was to retrive all invokable functions from the script. And then filter this functions by some criteria.
Unfortunately this approach is usefull only for groovy...
Get script engine:
public ScriptEngine getEngine() throws Exception {
if (engine == null)
engine = new ScriptEngineManager().getEngineByName(scriptType);
if (engine == null)
throw new Exception("Could not find implementation of " + scriptType);
return engine;
}
Compile and evaluate script:
public void evaluateScript(String script) throws Exception {
Bindings bindings = getEngine().getBindings(ScriptContext.ENGINE_SCOPE);
bindings.putAll(binding);
try {
if (engine instanceof Compilable)
compiledScript = ((Compilable)getEngine()).compile(script);
getEngine().eval(script);
} catch (Throwable e) {
e.printStackTrace();
}
}
Get functions from script. I did not found other ways how to get all invokable methods from script except Reflection. Yeah, i know that this approach depends on ScriptEngine implementation, but it's the only one :)
public List getInvokableList() throws ScriptException {
List list = new ArrayList();
try {
Class compiledClass = compiledScript.getClass();
Field clasz = compiledClass.getDeclaredField("clasz");
clasz.setAccessible(true);
Class scrClass = (Class)clasz.get(compiledScript);
Method[] methods = scrClass.getDeclaredMethods();
clasz.setAccessible(false);
for (int i = 0, j = methods.length; i < j; i++) {
Annotation[] annotations = methods[i].getDeclaredAnnotations();
boolean ok = false;
for (int k = 0, m = annotations.length; k < m; k++) {
ok = annotations[k] instanceof CalculatedField;
if (ok) break;
}
if (ok)
list.add(methods[i].getName());
}
} catch (NoSuchFieldException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
}
return list;
}
In my task i don't need all functions, for this i create custom annotation and use it in the script:
#Retention(RetentionPolicy.RUNTIME)
#Target(ElementType.METHOD)
public #interface CalculatedField {
}
Script example:
import com.vssk.CalculatedField;
def utilFunc(s) {
s
}
#CalculatedField
def func3() {
utilFunc('Testing func from groovy')
}
Method to invoke script function by it's name:
public Object executeFunc(String name) throws Exception {
return ((Invocable)getEngine()).invokeFunction(name);
}