I'm pretty much new to ignite and have a question about responsibility of client and server nodes. As far as I got from the documentation client nodes are very small machines, so it's not their purpose to perform some heavy cache operations. For instance I need to load data from some persistence store, perform some heavy cache-related computations and put resulting data into cache. It looks like this:
I.
//This is on a client node
public class Loader{
private DataSource dataSource;
#IgniteInstanceResource
private Ignite ignite;
public void load(){
String key;
String values;
//retreive key and value from the dataSource
IgniteDataStreamer<String, String> streamer = ignite.dataStreamer("cache");
String result;
//process value
streamer.addData(key, result); //<---------1
}
}
The question is about //1. Is it client's node responsibility to process loaded data and put it into cache? I actually have intention to do the following: create task for each loaded String key and String value and perform all evaluation and cache related operations on a server node. Like the following:
II.
public class LoaderJob extends ComputeJobAdapter{
private String key;
private String value;
#Override
public Object execute(){
//perform all computation and putting into cache here
//and return Tuple2(key, result);
}
}
public class LoaderTask extends extends ComputeTaskSplitAdapter<Void, Void {
//...
public Void reduce(List<ComputeJobResult> results) throws IgniteException {
results.stream().forEach(result -> {
Tuple2<String, String> jobResult = result.getData();
ignite.dataStreamer("cache").addData(jobResult._1, jobResult._2);
});
return null;
}
}
In the second case what the client is doing is just to load data from the persistance store and then publishing tasks on servers.
What is the common way of doing things like that?
It depends on amount of data and computational complexity. In case of big amount of data you can load data right from server, without using client.
Here is the simplest example for DataStreamer, you need only to add loading data from your persistent store and do calculations before using DataStreamer.
Also, it depends on other things, like a client confuguration(CPU, RAM, network) and connection between client and server nodes. If client have a good configuration, for example, as a server, and it's in the same network as a server nodes, then it's not a problem to make load and computations on client and only after it stream data to cache.
Creating dedicate job for some data by yourself, is bad idea. Something like this doing in streamer (data will be buffered and sent to specific node where are will be stored).
client nodes are very small machines, so it's not their purpose to perform some heavy cache operations
This is not a true statement. You are able to give enough resource to client JVM, to load data.
You should create one data streamer on client side and load data from this machine. Also streamer instance is thread save, so you can load date from some threads simultaneously.
IgniteDataStreamer is the the fastest way to load data in a cache. So, the first case is valid.
I think, the second case make sense if a data will be gathered from persistence store on the server nodes and client send only parameters of the loading.
Related
I have stream of objects with address and list of organizations:
#Data
class TaggedObject {
String address;
List<String> organizations;
}
Is there a way to do the following using apache flink:
Merge organization lists for objects with same address
Send all results to Sink when some event occurs. E.g. when user sends control message to a kafka topic or another DataSource
Keep all objects for future accumulations
I tried using global window and custom trigger:
public class MyTrigger extends Trigger<TaggedObject, GlobalWindow> {
#Override
public TriggerResult onElement(TaggedObject element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
if (element instanceof Control) return TriggerResult.FIRE;
else return TriggerResult.CONTINUE;
}
But it seems to give only Control element as a result. Other elements were ignored.
If you want a generic control signal that triggers output for ALL addresses, then you'll need to use a broadcast stream. You combine your stream of addresses with your control stream and then perform the appropriate logic (merging organizations for an address, or triggering output) inside of your custom implementation of a KeyedBroadcastProcessFunction.
It seems like you should just key the stream by address and then use a KeyedProcessFunction (with a List- or MapState) to store the different organizations. Then as soon as an event comes in, you can just output the entries of the State.
Kind Regards
Dominik
Is there a way to populate a Map once from the DB (through Mongo repository) data and reuse it when required from multiple classes instead of hitting the Database through the repository.
As per your comment, what you are looking for is a Caching mechanism. Caches are components which allow data to live in memory, as opposed to files, databases or other mediums so as to allow for the fast retrieval of information (against a higher memory footprint).
There are probably various tutorials online, but usually caches all have the following behaviour:
1. They are key-value pair structures.
2. Each entity living in the cache also has a Time To Live, that is, how long will it considered to be valid.
You can implement this in the repository layer, so the cache mechanism will be transparent to the rest of your application (but you might want to consider exposing functionality that allows to clear/invalidate part or all the cache).
So basically, when a query comes to your repository layer, check in the cache. If it exists in there, check the time to live. If it is still valid, return that.
If the key does not exist or the TTL has expired, you add/overwrite the data in the cache. Keep in mind that when updating the data model yourself, you also invalidate the cache accordingly so that new/fresh data will be pulled from the DB on the next call.
You can declare the map field as public static and this would allow application wide access to hit via ClassLoadingData.mapField
I think a better solution, if I understood the problem would be a memoized function, that is a function storing the value of its call. Here is a sketch of how this could be done (note this does not handle possible synchronization problem in a multi threaded environment):
class ClassLoadingData {
private static Map<KeyType,ValueType> memoizedValues = new HashMap<>();
public Map<KeyType,ValueType> getMyData() {
if (memoizedData.isEmpty()) { // you can use more complex if to handle data refresh
populateData(memoizedData);
} else {
return memoizedData;
}
}
private void populateData() {
// do your query, and assign result to memoizedData
}
}
Premise: I suggest you to use an object-relational mapping tool like Hibernate on your java project to map the object-oriented
domain model to a relational database and let the tool handle the
cache mechanism implicitally. Hibernate specifically implements a multi-level
caching scheme ( take a look at the following link to get more
informations:
https://www.tutorialspoint.com/hibernate/hibernate_caching.htm )
Regardless my suggestion on premise you can also manually create a singleton class that will be used from every class in the project that goes to interact with the DB:
public class MongoDBConnector {
private static final Logger LOGGER = LoggerFactory.getLogger(MongoDBConnector.class);
private static MongoDBConnector instance;
//Cache period in seconds
public static int DB_ELEMENTS_CACHE_PERIOD = 30;
//Latest cache update time
private DateTime latestUpdateTime;
//The cache data layer from DB
private Map<KType,VType> elements;
private MongoDBConnector() {
}
public static synchronized MongoDBConnector getInstance() {
if (instance == null) {
instance = new MongoDBConnector();
}
return instance;
}
}
Here you can define then a load method that goes to update the map with values stored on the DB and also a write method that instead goes to write values on the DB with the following characteristics:
1- These methods should be synchronized in order to avoid issues if multiple calls are performed.
2- The load method should apply a cache period logic ( maybe with period configurable ) to avoid to load for each method call the data from the DB.
Example: Suppose your cache period is 30s. This means that if 10 read are performed from different points of the code within 30s you
will load data from DB only on the first call while others will read
from cached map improving the performance.
Note: The greater is the cache period the more is the performance of your code but if the DB is managed you'll create inconsistency
with cache if an insertion is performed externally ( from another tool
or manually ). So choose the best value for you.
public synchronized Map<KType, VType> getElements() throws ConnectorException {
final DateTime currentTime = new DateTime();
if (latestUpdateTime == null || (Seconds.secondsBetween(latestUpdateTime, currentTime).getSeconds() > DB_ELEMENTS_CACHE_PERIOD)) {
LOGGER.debug("Cache is expired. Reading values from DB");
//Read from DB and update cache
//....
sampleTime = currentTime;
}
return elements;
}
3- The store method should automatically update the cache if insert is performed correctly regardless the cache period is expired:
public synchronized void storeElement(final VType object) throws ConnectorException {
//Insert object on DB ( throws a ConnectorException if insert fails )
//...
//Update cache regardless the cache period
loadElementsIgnoreCachePeriod();
}
Then you can get elements from every point in your code as follow:
Map<KType,VType> liveElements = MongoDBConnector.getElements();
I have an use case where, I read in the newline json elements stored in google cloud storage and start processing each json. While processing each json, I have to call an external API for doing de-duplication whether that json element was discovered previously. I'm doing a ParDo with a DoFn on each json.
I haven't seen any online tutorial saying how to call an external API endpoint from apache beam DoFn Dataflow.
I'm using JAVA SDK of Beam. Some of the tutorial I studied explained that using startBundle and FinishBundle but I'm not clear on how to use it
If you need to check duplicates in external storage for every JSON record, then you still can use DoFn for that. There are several annotations, like #Setup, #StartBundle, #FinishBundle, etc, that can be used to annotate methods in your DoFn.
For example, if you need to instantiate a client object to send requests to your external database, then you might want to do this in #Setup method (like POJO constructor) and then leverage this client object in your #ProcessElement method.
Let's consider a simple example:
static class MyDoFn extends DoFn<Record, Record> {
static transient MyClient client;
#Setup
public void setup() {
client = new MyClient("host");
}
#ProcessElement
public void processElement(ProcessContext c) {
// process your records
Record r = c.element();
// check record ID for duplicates
if (!client.isRecordExist(r.id()) {
c.output(r);
}
}
#Teardown
public void teardown() {
if (client != null) {
client.close();
client = null;
}
}
}
Also, to avoid doing remote calls for every record, you can batch bundle records into internal buffer (Beam split input data into bundles) and check duplicates in batch mode (if your client support this). For this purpose, you might use #StartBundle and #FinishBundle annotated methods that will be called right before and after processing Beam bundle accordingly.
For more complicated examples, I'd recommend to take a look on a Sink implementations in different Beam IOs, like KinesisIO, for instance.
There is an example of calling external system in batches using a stateful DoFn in the following blog post: https://beam.apache.org/blog/2017/08/28/timely-processing.html, might be helpful.
I am building an application in Play Framework that has to do some intense file parsing. This parsing involves parsing multiple files, preferably in parallel.
A user uploads an archive that gets unziped and the files are stored on the drive.
In that archive there is a file (let's call it main.csv) that has multiple columns. One such column is the name of another file from the archive (like subPage1.csv). This column can be empty, so that not all rows from the main.csv have subpages.
Now, I start an Akka Actor to parse the main.csv file. In this actor, using #Inject, I have another ActorRef
public MainParser extends ActorRef {
#Inject
#Named("subPageParser")
private AcgtorRef subPageParser;
public Receive createReceive() {
...
if (column[3] != null) {
subPageParser.tell(column[3], getSelf());
}
}
}
SubPageParser Props:
public static Props getProps(JPAApi jpaApi) {
return new RoundRobinPool(3).props(Props.create((Class<?>) SubPageParser.class, jpaApi));
}
Now, my question is this. Considering that a subPage may take 5 seconds to be parsed, will I be using a single instance of SubPageParser or will there be multiple instances that do the processing in parallel.
Also, consider another scenario, where the names are stored in the DB, and I use something like this:
List<String> names = dao.getNames();
for (String name: names) {
subPageParser.tell(name, null);
}
In this case, considering that the subPageParser ActorRef is obtained using Guice #Inject as before, will I do parallel processing?
If I am doing processing in parallel, how do I control the number of Actors that are being spawned? If I have 1000 subPages, I don't want 1000 Actors. Also, their lifetime may be an issue.
NOTE:
I have an ActorsModule like this, so that I can use #Inject and not Props:
public class ActorsModule extends AbstractModule implements AkkaGuiceSupport {
#Override
protected void configure() {
bindActor(MainParser.class, "mainparser");
Function<Props, Props> props = p -> SubPageParser.getProps();
bindActor(SubPageParser.class, "subPageParser", props);
}
}
UPDATE: I have modified to use a RoundRobinPool. However, This does not work as intended. I specified 3 as the number of instances, but I get a new object for each parse request tin the if.
Injecting an actor like you did will lead to one SubPageParser per MainParser. While you might send 1000 messages to it (using tell), they will get processed one by one while the others are waiting in the mailbox to be processed.
With regards to your design, you need to be aware that injecting an actor like that will create another top-level actor rather than create the SubPageParser as a child actor, which would allow the parent actor to control and monitor it. The playframework has support for injecting child actors, as described in their documentation: https://www.playframework.com/documentation/2.6.x/JavaAkka#Dependency-injecting-child-actors
While you could get akka to use a certain number of child actors to distribute the load, I think you should question why you have used actors in the first place. Most problems can be solved with simple Futures. For example you can configure a custom thread pool to run your Futures with and have them do the work at a parallelization level as you wish: https://www.playframework.com/documentation/2.6.x/ThreadPools#Using-other-thread-pools
Hello In my web application I am maintaining list of URL authorized for user in a HashMap and compare the requested URL and revert as per the authorization. This Map has Role as key and URLs as value in form of List. My problem is where I should have this Map?
In Session: It may have hundreds of URLs and that can increase the burden of session.
In Cache at Application loading: The URLs may get modified on the fly and then I need to resync it by starting server again.
In Cache that update periodically: Application level Cache that will update periodically.
I require a well optimized approach that can serve the purpose, help me with the same.
I'm preferring to make it as a singleton Class and Have a thread that updates it periodically .. The thread will maintain the state of the cache .. this thread will be started when you get the fist instance of the cache
public class CacheSingleton {
private static CacheSingleton instance = null;
private HashMap<String,Role> authMap;
protected CacheSingleton() {
// Exists only to defeat instantiation.
// Start the thread to maintain Your map
}
public static CacheSingleton getInstance() {
if(instance == null) {
instance = new CacheSingleton();
}
return instance;
}
// Add your cache logic here
// Like getRole,checkURL() ... etc
}
wherever in your code you can get the cached data
CacheSingleton.getInstance().yourMethod();