Custom operations using Stream API - java

I am implementing a batch program in java. The flow is as follows: I fetch data from a database convert the data into custom objects and then put those objects in a queue. Then the goal is to run some profiling logic (for example nlp) afterwards. A friend of mine told me I should consider using the java stream api since it supports parallel processing. I am relatively new to Java 8 so my question is where to put(or execute) the mentioned profiling logic? Is there a way to create custom operations or do I have to implement a custom Collector?
Thank you in advance.

As mentioned in the comments by #MarkoTopolnik map is the solution.

Related

Reactively reacting to Map-updated

I have an application based on Java and Spring (Springboot, Spring-reactive, Spring-Kafka) witch continuously consumes information from a Kafka-topic and store the data with a key in a ConcurrentHashMap (via a simple wrapper). The application also contains an REST-API for fetching streaming information using Reactive Flux.
I would like to come up with a way of where it is possible to call the API for data from the map (using a key), where the response is a stream of the currently associated value from the map together with subsequent changes to the value (as updated from the topic) i.e without closing the stream.
It feels like this should be possible using maby a PropertyChangeListener combined with a Flux.Generate but my reactive skills are to weak to see how I should achieve this. Ive done some tries but I cant see how to get the Generator to emit on PropertyChangeEvents.
Would this be possible?
If anyone could provide me with a example for this, or maby point me to one online it would be much appreciated.
BR

Is it recommended to convert Arraylist into Reactor's flux for Processing data?

One of my colleagues said to me that instead of processing streams i should create a Flux instead of using a List as it is and then Process my data to it.
But this doesn't makes sense to me because i thought reactive streams were generally useful for blocking io not data processing .
Can someone verify if the new approach suggested by my colleague is correct.
and if it is correct, what are the advantages of it over my previous method(performance wise)
//Model
rootObject{
List<rootNodes> rootNodes
}
//My current code
MonoOfRootObject.map( rootobject.getrootnodes.stream()
.(..do some filtering and replacement..) )
//Proposed code according to my colleague
MonoOfRootObject.map( Flux.fromIterable(rootobject.getrootnodes)
.(..do some filtering and replacement..) )
Please help i am a bit new to Reactor (or functional programming in general)
Thanks
Yes, you're right. Reactor and Reactive Streams in general are useful when you need to deal with asynchronous data and/or concurrency.
To do regular filtering, transformation on an in-memory list, Java Stream is totally fine and using Reactive Stream is overkill (and probably also overhead performance wise).

Why would you prefer Java 8 Stream API instead of direct hibernate/sql queries when working with the DB

Recently I see a lot of code in few projects using stream for filtering objects, like:
library.stream()
.map(book -> book.getAuthor())
.filter(author -> author.getAge() >= 50)
.map(Author::getSurname)
.map(String::toUpperCase)
.distinct()
.limit(15)
.collect(toList()));
Is there any advantages of using that instead of direct HQL/SQL query to the database returning already the filtered results.
Isn't the second aproach much faster?
If the data originally comes from a DB it is better to do the filtering in the DB rather than fetching everything and filtering locally.
First, Database management systems are good at filtering, it is part of their main job and they are therefore optimized for it. The filtering can also be sped up by using indexes.
Second, fetching and transmitting many records and to unmarshal the data into objects just to throw away a lot of them when doing local filtering is a waste of bandwidth and computing resources.
On a first glance: streams can be made to run in parallel; just by changing code to use parallelStream(). (disclaimer: of course it depends on the specific context if just changing the stream type will result in correct results; but yes, it can be that easy).
Then: streams "invite" to use lambda expressions. And those in turn lead to usage of invoke_dynamic bytecode instructions; sometimes gaining performance advantages compared to "old-school" kind of writing such code. (and to clarify the misunderstanding: invoke_dynamic is a property of lambdas, not streams!)
These would be reasons to prefer "stream" solutions nowadays (from a general point of view).
Beyond that: it really depends ... lets have a look at your example input. This looks like dealing with ordinary Java POJOs, that already reside in memory, within some sort of collection. Processing such objects in memory directly would definitely be faster than going to some off-process database to do work there!
But, of course: when the above calls, like book.getAuthor() would be doing a "deep dive" and actually talk to an underlying database; then chances are that "doing the whole thing in a single query" gives you better performance.
The first thing is to realize, that you can't tell from just this code, what statement is issued against the database. It might very well, that all the filtering, limiting and mapping is collected, and upon the invocation of collect all that information is used to construct a matching SQL statement (or whatever query language is used) and send to the database.
With this in mind there are many reasons why streamlike APIs are used.
It is hip. Streams and lambdas are still rather new to most java developers, so they feel cool when they use it.
If something like in the first paragraph is used it actually creates a nice DSL to construct your query statements. Scalas Slick and .Net LINQ where early examples I know about, although I assume somebody build something like it in LISP long before I was born.
The streams might be reactive streams and encapsulate a non-blocking API. While these APIs are really nice because they don't force you to block resources like threads while you are waiting for results. Using them requires either tons of callbacks or using a much nicer stream based API to process the results.
They are nicer to read the imperative code. Maybe the processing done in the stream can't [easily/by the author] be done with SQL. So the alternatives aren't SQL vs Java (or what ever language you are using), but imperative Java or "functional" Java. The later often reads nicer.
So there are good reasons to use such an API.
With all that said: It is, in almost all cases, a bad idea to do any sorting/filtering and the like in your application, when you can offload it to the database. The only exception I can currently think of is when you can skip the whole roundtrip to the database, because you already have the result locally (e.g. in a cache).
Well, your question should ideally be - Is it better to do reduction / filtering operations in the DB or fetch all records and do it in Java using Streams?
The answer isn't straightforward and any stats that give a "concrete" answer will not generalize to all cases.
The operations you are talking about are better done in the DB itself, because that is what DBs are designed for, very fast handling of data. Of course usually in case of relational databases, there will be some "book-keeping and locks" being used to ensure that independent transactions don't end up making the data inconsistent, but even with that, DBs do a pretty good job in filtering data, especially large data sets.
One case where I would prefer filtering data in Java code rather than in DB would be if you need to filter different features from the same data. For example, right now you are getting only the Author's surname. If you wanted to get all books written by the author, ages of authors, children of author, place of birth etc. Then it makes sense to get only one "read-only" copy from the DB and use parallel streams to get different information from the same data set.
Unless measured and proven for a specific scenario either could be good or equally bad. The reason you usually want to take these kind of queries to the database is because (among other things):
DB can handle much larger data then your java process
Queries in a database can be indexed (making them much faster)
On the other hand, if your data is small, using a Stream the way you did is effective. Writing such a Stream pipeline is very readable (once you talk Streams good enough).
Hibernate and other ORMs are usually way more useful for writing entities rather than reading, because they allow developers to offload ordering of specific writes to framework that almost never will "get that wrong".
Now, for reading and reporting, on the other hand (and considering we are talking DB here) an SQL query is likely to be better because there will not be any frameworks in-between, and you will be able to tune query performance in terms of database that will be invoking this query rather than in terms of your framework of choice, which gives more flexibility to how that tuning can be done, sort of.

Using Stream API for organising application pipeline

As far as I know Stream API is intended to be applied on collections. But I like the idea of them so much that I try to apply them when I can and when I shouldn't.
Originally my app had two threads communicating through BlockingQueue. First would populate new elements. Second make transformations on them and save on disk. Looked like a perfect stream oportunity for me at a time.
Code I ended up with:
Stream.generate().flatten().filter().forEach()
I'd like to put few maps in there but turns out I have to drag one additional field till forEach. So I either have to create meaningless class with two fields and obscure name or use AbstractMap.SimpleEntry to carry both fields through, which doesn't look like a great deal to me.
Anyway I'd rewritten my app and it even seems to work. However there are some caveats. As I have infinite stream 'the thing' can't be stopped. For now I'm starting it on daemon thread but this is not a solution. Business logic (like on connection loss/finding, this is probably not BL) looks alienated. Maybe I just need proxy for this.
On the other hand there is free laziness with queue population. One thread instead of two (not sure how good is this). Hopefully familiar pattern for other developers.
So my question is how viable is using of Stream API for application flow organising? Is there more underwather roks? If it's not recomended what are alternatives?

Serialize Java objects into Java code

Does somebody know a Java library which serializes a Java object hierarchy into Java code which generates this object hierarchy? Like Object/XML serialization, only that the output format is not binary/XML but Java code.
Serialised data represents the internal data of objects. There isn't enough information to work out what methods you would need to call on the objects to reproduce the internal state.
There are two obvious approaches:
Encode the serialised data in a literal String and deserialise that.
Use java.beans XML persistence, which should be easy enough to process with your favourite XML->Java source technique.
I am not aware of any libraries that will do this out of the box but you should be able to take one of the many object to XML serialisation libraries and customise the backend code to generate Java. Would probably not be much code.
For example a quick google turned up XStream. I've never used it but is seems to support multiple backends other than XML - e.g. JSON. You can implement your own writer and just write out the Java code needed to recreate the hierarchy.
I'm sure you could do the same with other libraries, in particular if you can hook into a SAX event stream.
See:
HierarchicalStreamWriter
Great question. I was thinking about serializing objects into java code to make testing easier. The use case would be to load some data into a db, then generate the code creating an object and later use this code in test methods to initialize data without the need to access the DB.
It is somehow true that the object state doesn't contain enough info to know how it's been created and transformed, however, for simple java beans there is no reason why this shouldn't be possible.
Do you feel like writing a small library for this purpose? I'll start coding soon!
XStream is a serialization library I used for serialization to XML. It should be possible and rather easy to extend it so that it writes Java code.

Categories