I want to read/write to a raw device(which is just a file in linux) asyncly, and I have been using java.nio.channels.AsynchronousFileChannel.
But it's a 'fake asynchronous', because the AsynchronousFileChannel uses a thread pool to execute the read/write tasks. It's actually calling the synchronized read/write interface offered by OS.
What I really want is a real asynchronous implementation which is io_submit in linux.
But I can't find it in jdk or any other repositories like guava or apache.
So my question is this :
In java, is there an existing implementation of asynchronous file accessor based on the native io_submit interface ?
If not, why can't I see anyone else who need it ?
In java, is there an existing implementation of asynchronous file accessor based on the native io_submit interface
Not in the default Java libraries at the time of writing (2019). I doubt there's much enthusiasm to implement an io_submit() Java wrapper in the default libraries because:
libaio/KAIO is quirky. Linux's KAIO is fraught with constraints like only really being asynchronous when doing direct I/O (and even then there are very elaborate rules which will turn submission synchronous if broken that go beyond the caller's control)
There's no guarantee that libaio library itself would be around so you would have to bundle it with Java or otherwise reimplement it.
If not, why can't I see anyone else who need it ?
People who need it that badly have recreated wrappers (e.g. see https://github.com/zrlio/jaio ). However supporting KAIO would be a Linux only thing and thus not that portable (which goes a bit against a key Java ethos).
Related
I am new to the project, and I am trying to create a connector between Dataflow and a database.
The documentation clearly states that I should use a Source and a Sink but I see a lot of people using directly a PTransform associated with a PInput or a PDone.
The source/sink API is in experimental (which explaines all the examples with the PTransform), but seems more easy to integrate it with a custom runner (ie: spark for example).
If I refer to the code, the two methods are used. I cannot see any use case where it will be more interesting to use the PTransform API.
Is the Source/Sink API is supposed to remplace the PTranform API?
Did I miss something that clearly differentiate the two methods?
Is the Source/Sink API stable enough to be considered the good way to code inputs and outputs?
Thank for you advice!
The philosophy of Dataflow is that PTransform is the main unit of abstraction and composability, i.e., any self-contained data processing task should be encapsulated as a PTransform. This includes the task of connecting to a third-party storage system: ingesting data from somewhere or exporting it to somewhere.
Take, for example, Google Cloud Datastore. In the code snippet:
PCollection<Entity> entities =
p.apply(DatastoreIO.readFrom(dataset, query));
...
p.apply(some processing)
.apply(DatastoreIO.writeTo(dataset));
the return type of DatastoreIO.readFrom(dataset, query) is a subclass of PTransform<PBegin, PCollection<Entity>>, and the type of DatastoreIO.writeTo(dataset) is a subclass of PTransform<PCollection<Entity>, PDone>.
It is true that these functions are under the hood implemented using the Source and Sink classes, but to a user who just wants to read or write something to Datastore, that's an implementation detail that usually should not matter (however, see the note at the end of this answer about exposing the Source or Sink class). Any connector, or for that matter, any other data processing task is a PTransform.
Note: Currently connectors that read from somewhere tend to be PTransform<PBegin, PCollection<T>>, and connectors that write to somewhere tend to be PTransform<PCollection<T>, PDone>, but we are considering options to make it easier to use connectors in more flexible ways (for example, reading from a PCollection of filenames).
However, of course, this detail matters to somebody who wants to implement a new connector. In particular, you may ask:
Q: Why do I need the Source and Sink classes at all, if I could just implement my connector as a PTransform?
A: If you can implement your connector by just using the built-in transforms (such as ParDo, GroupByKey etc.), that's a perfectly valid way to develop a connector. However, the Source and Sink classes provide some low-level capabilities that, in case you need them, would be cumbersome or impossible to develop yourself.
For example, BoundedSource and UnboundedSource provide hooks for controlling how parallelization happens (both initial and dynamic work rebalancing - BoundedSource.splitIntoBundles, BoundedReader.splitAtFraction), while these hooks are not currently exposed for arbitrary DoFns.
You could technically implement a parser for a file format by writing a DoFn<FilePath, SomeRecord> that takes the filename as input, reads the file and emits SomeRecord, but this DoFn would not be able to dynamically parallelize reading parts of the file onto multiple workers in case the file turned out to be very large at runtime. On the other hand, FileBasedSource has this capability built-in, as well as handling of glob filepatterns and such.
Likewise, you could try implementing a connector to a streaming system by implementing a DoFn that takes a dummy element as input, establishes a connection and streams all elements into ProcessingContext.output(), but DoFns currently don't support writing unbounded amounts of output from a single bundle, nor do they explicitly support the checkpointing and deduplication machinery needed for the strong consistency guarantees Dataflow gives to streaming pipelines. UnboundedSource, on the other hand, supports all this.
Sink (more precisely, the Write.to() PTransform) is also interesting: it is just a composite transform that you could write yourself if you wanted to (i.e. it has no hard-coded support in the Dataflow runner or backend), but it was developed with consideration for typical distributed fault tolerance issues that arise when writing data to a storage system in parallel, and it provides hooks that force you to keep those issues in mind: e.g., because bundles of data are written in parallel, and some bundles may be retried or duplicated for fault tolerance, there is a hook for "committing" just the results of the successfully completed bundles (WriteOperation.finalize).
To summarize: using Source or Sink APIs to develop a connector helps you structure your code in a way that will work well in a distributed processing setting, and the source APIs give you access to advanced capabilities of the framework. But if your connector is a very simple one that needs neither, then you are free to just assemble your connector from other built-in transforms.
Q: Suppose I decide to make use of Source and Sink. Then how do I package my connector as a library: should I just provide the Source or Sink class, or should I wrap it into a PTransform?
A: Your connector should ultimately be packaged as a PTransform, so that the user can just p.apply() it in their pipeline. However, under the hood your transform can use Source and Sink classes.
A common pattern is to expose the Source and Sink classes as well, making use of the Fluent Builder pattern, and letting the user wrap them into a Read.from() or Write.to() transform themselves, but this is not a strict requirement.
I need to implement some kind of inter-process mutex in Java. I'm considering using the FileLock API as recommended in this thread. I'll basically be using a dummy file and locking it in each process.
Is this the best approach? Or is something like this built in the standard API (I can't find it).
For more details see below:
I have written an application which reads some input files and updates some database tables according to what it finds in them (it's more complex, but business logic is irrelevant here).
I need to ensure mutual exclusion between multiple database updates. I tried to implement this with LOCK TABLE, but this is unsupported by the engine I'm using. So, I want to implement the locking support in the application code.
I went for the FileLock API approach and implemented a simple mutex based on:
FileChannel.lock
FileLock.release
All the processes use the same dummy file for acquiring locks.
Why bother using files, if you have database on your hands? Try using database locking like this (https://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html).
By the way, what database engine do you use? It might help if you list it.
suppose I want to allow people run simple console java programs on my server without ability to access the file system, the network or other IO except via my own highly restricted API. But, I don't want to get too deep into operating system level restrictions, so for the sake of the current discussion I want to consider code level sanitization methods.
So suppose I try to achieve this restriction as follows. I will prohibit all "import" statements except for those explicitly whitelisted (let's say "import SanitizedSystemIO." is allowed while "import java.io." is not) and I will prohibit the string "java.*" anywhere in the code. So this way the user would be able to write code referencing File class from SanitizedSystemIO, but he will not be able to reference java.io.File. This way the user is forced to use my sanitized wrapper apis, while my own framework code (which will compile and run together with user's code, such as in order to provide the IO functionality) can access all regular java apis.
Will this approach work? Or is there a way to hack it to get access to the standard java api?
ETA: ok, first of all, it should of course be java.* strings not system.*. I think in C#, basically...
Second, ok, so people say, "use security manager" or "use class loader" approaches. But what, if anything, is wrong with the code analysis approach? One benefit of it to my mind is the sheer KISS simplicity - instead of figuring out all the things to check and sanitize in SecurityManager we just allow a small whitelist of functionality and block everything else. Implementation-wise this is a trivial exercise for people with minimal knowledge of java.
And to reiterate my original question, so can this be hacked? Is there some java language construct that would allow access to the underlying api despite such code restrictions?
In your shoes I'd rather run the loaded apps inside a custom ClassLoader.
Maybe I'm mistaken, but if he wants to allow limited access to IO through his own functions, wouldn't SecurityManager prevent those as well? With a custom ClassLoader, he could provide his SanitizedSystemIO while refusing to load the things he doesn't want people to load.
However, checking for strings inside code is definitely not the way to go.
You need to check the SecurityManager. It is called by lots of JVM classes to check, before they perform their work, if they have the permission needed.
You can implement your own SecurityManager. Tutorial.
I'm using JCaptcha in a project and needed a behavior that was not directly available. so I looked into the source code to see if I can extend it to obtain what I want and found that the store implementation I use (MapCaptchaStore) uses a HashMap as the store... with no synchronization.
I know JCaptcha does not work in a clustered environment, it is not my case, but how about multiple clients at the same time? Is the store implementation synchronized externally or should I roll my own and make sure it is properly synchronized?
TIA!
Judging by the reading source for MapCaptchaStore, this class is NOT thread-safe. I'm not 100% willing to stand behind this answer though, because synchronisation may be happening at a higher level (eg all accesses to a single instance of MapCaptchaStore may be synchronised on another object).
You could use another implementation of CaptchaStore. For example, EhcacheCaptchaStore
Basic hashmap implementation of the captcha store is not synchronized, that could lead to some weird behaviour.
Other stores are thread safe, for a simple implementation use FastHashMapCaptchaStore.
I'm assuming it is because it has been designed to be integrated with web applications which will always have multiple clients. It's also a CAPTCHA framework so they must have tested with both human and computer clients.
However, I would still recommend testing whether it behaves correctly in a multithreaded environment.
I'm developing a system that allows developers to upload custom groovy scripts and freemarker templates.
I can provide a certain level of security at a very high level with the default Java security infrastructure - i.e. prevent code from accessing the filesystem or network, however I have a need to restrict access to specific methods.
My plan was to modify the Groovy and Freemarker runtimes to read Annotations that would either whitelist or blacklist certain methods, however this would force me to maintain a forked version of their code, which is not desirable.
All I essentially need to be able to do is prevent the execution of specific methods when called from Groovy or Freemarker. I've considered a hack that would look at the call stack, but this would be a massive speed hit (and it quite messy).
Does anyone have any other ideas for implementing this?
You can do it by subclassing the GroovyClassLoader and enforcing your constraints within an AST Visitor. THis post explains how to do it: http://hamletdarcy.blogspot.com/2009/01/groovy-compile-time-meta-magic.html
Also, the code referenced there is in the samples folder of Groovy 1.6 installer.
You should have a look at the project groovy-sandbox from kohsuke. Have also a look to his blog post here on this topic and what is solution is addressing: sandboxing, but performance drawback.
OSGi is great for this. You can partition your code into bundles and set exactly what each bundle exposes, and to what other bundles. Would that work for you?
You might also consider the java-sandbox (http://blog.datenwerke.net/p/the-java-sandbox.html) a recently developed library that allows to securely execute untrusted code from within java.
Also see: http://blog.datenwerke.net/2013/06/sandboxing-groovy-with-java-sandbox.html