Recently I've been trying to reimplement my data parser into streams in java, but I can't figure out how to do one specific thing:
Consider object A with timeStamp.
Consider object B which is made of various A objects
Consider some metrics which tells us time range for object B.
What I have now is some method with state which goes though list with objects A and if it fits into last object B, it goes there, otherwise it creates new B instance and starts putting objects A there.
I would like to do this in streams way
Take whole list of objects A and make it as stream. Now I need to figure out function which will create "chunks" and accumulate them into objects B. How do I do that?
Thanks
EDIT:
A and B are complex, but I will try to post here some simplified version.
class A {
private final long time;
private A(long time) {
this.time = time;
}
long getTime() {
return time;
}
}
class B {
// not important, build from "full" temporaryB class
// result of accumulation
}
class TemporaryB {
private final long startingTime;
private int counter;
public TemporaryB(A a) {
this.startingTime = a.getTime();
}
boolean fits(A a) {
return a.getTime() - startingTime < THRESHOLD;
}
void add(A a) {
counter++;
}
}
class Accumulator {
private List<B> accumulatedB;
private TemporaryBParameters temporaryBParameters
public void addA(A a) {
if(temporaryBParameters.fits(a)) {
temporaryBParameters.add(a)
} else {
accumulateB.add(new B(temporaryBParameters)
temporaryBParameters = new TemporaryBParameters(a)
}
}
}
ok so this is very simplified way how do I do this now. I don't like it. it's ugly.
In general such problem is badly suitable for Stream API as you may need non-local knowledge which makes parallel processing harder. Imagine that you have new A(1), new A(2), new A(3) and so on up to new A(1000) with Threshold set to 10. So you basically need to combine input into batches by 10 elements. Here we have the same problem as discussed in this answer: when we split the task into subtasks the suffix part may not know exactly how many elements are in the prefix part, so it cannot even start combining data into batches until the whole prefix is processed. Your problem is essentially serial.
On the other hand, there's a solution provided by new headTail method in my StreamEx library. This method parallelizes badly, but having it you can define almost any operation in just a few lines.
Here's how to solve your problem with headTail:
static StreamEx<TemporaryB> combine(StreamEx<A> input, TemporaryB tb) {
return input.headTail((head, tail) ->
tb == null ? combine(tail, new TemporaryB(head)) :
tb.fits(head) ? combine(tail, tb.add(head)) :
combine(tail, new TemporaryB(head)).prepend(tb),
() -> StreamEx.ofNullable(tb));
}
Here I modified your TemporaryB method this way:
TemporaryB add(A a) {
counter++;
return this;
}
Sample (assuming Threshold = 1000):
List<A> input = Arrays.asList(new A(1), new A(10), new A(1000), new A(1001), new A(
1002), new A(1003), new A(2000), new A(2002), new A(2003), new A(2004));
Stream<B> streamOfB = combine(StreamEx.of(input), null).map(B::new);
streamOfB.forEach(System.out::println);
Output (I wrote simple B.toString()):
B [counter=2, startingTime=1]
B [counter=3, startingTime=1001]
B [counter=2, startingTime=2002]
So here you actually have a lazy Stream of B.
Explanation:
StreamEx.headTail parameters are two lambdas. First is called at most once when input stream is non-empty. It receives the first stream element (head) and the stream containing all other elements (tail). The second is called at most once when input stream is empty and receives no parameters. Both should produce an output stream which would be used instead. So what we have here:
return input.headTail((head, tail) ->
tb == null is the starting case, create new TemporaryB from the head and call self with the tail:
tb == null ? combine(tail, new TemporaryB(head)) :
tb.fits(head) ? Ok, just add the head into existing tb and call self with the tail:
tb.fits(head) ? combine(tail, tb.add(head)) :
Otherwise again create new TemporaryB(head), but also prepend the output stream with the current tb (actually emitting a new element into target stream):
combine(tail, new TemporaryB(head)).prepend(tb),
Input stream is exhausted? Ok, return the last gathered tb if any:
() -> StreamEx.ofNullable(tb));
Note that headTail implementation guarantees that such solution while looking recursive does not eat the stack and heap more than constant amount. You can check it on thousands of input elements if you doubt:
Stream<B> streamOfB = combine(LongStreamEx.range(100000).mapToObj(A::new), null).map(B::new);
streamOfB.forEach(System.out::println);
Related
Every 5 minutes, within the 20th minute cycle, I need to retrieve the data. Currently, I'm using the map data structure.
Is there a better data structure? Every time I read and set the data, I have to write to the file to prevent program restart and data loss.
For example, if the initial data in the map is:
{-1:"result1",-2:"result2",-3:"result3",-4:"result4"}
I want to get the last -4 period's value which is "result4", and set the new value "result5", so that the updated map will be:
{-1:"result5",-2:"result1",-3:"result2",-4:"result3"}
And again, I want to get the last -4 period's value which is "result3", and set the new value "result6", so the map will be:
{-1:"result6",-2:"result5",-3:"result1",-4:"result2"}
The code:
private static String getAndSaveValue(int a) {
//read the map from file
HashMap<Long,String> resultMap=getMapFromFile();
String value=resultMap.get(-4L);
for (Long i = 4L; i >= 2; i--){
resultMap.put(Long.parseLong(String.valueOf(i - 2 * i)),resultMap.get(1 - i));
}
resultMap.put(-1L,"result" + a);
//save the map to file
saveMapToFile(resultMap);
return value;
}
Based on your requirement, I think LinkedList data structure will be suitable for your requirement:
public class Test {
public static void main(String[] args) {
LinkedList<String> ls=new LinkedList<String>();
ls.push("result4");
ls.push("result3");
ls.push("result2");
ls.push("result1");
System.out.println(ls);
ls.push("result5"); //pushing new value
System.out.println("Last value:"+ls.pollLast()); //this will return `result4`
System.out.println(ls);
ls.push("result6"); //pushing new value
System.out.println("Last value:"+ls.pollLast()); // this will give you `result3`
System.out.println(ls);
}
}
Output:
[result1, result2, result3, result4]
Last value:result4
[result5, result1, result2, result3]
Last value:result3
[result6, result5, result1, result2]
Judging by your example, you need a FIFO data structure which has a bounded size.
There's no bounded general purpose implementation of the Queue interface in the JDK. Only concurrent implementation could be bounded in size. But if you're not going to use it in a multithreaded environment, it's not the best choice because thread safety doesn't come for free - concurrent collections are slower, and also can create confusing for the reader of your code.
To achieve your goal, I suggest you to use the composition by wrapping ArrayDeque, which is an array-based implementation of the Queue and performs way better than LinkedList.
Note that is a preferred approach not to extend ArrayDeque (IS A relationship) and override its methods add() and offer(), but include it in a class as a field (HAS A relationship), so that all the method calls on the instance of your class will be forwarded to the underlying collection. You can find more information regarding this approach in the item "Favor composition over inheritance" of Effective Java by Joshua Bloch.
public class BoundQueue<T> {
private Queue<T> queue;
private int limit;
public BoundQueue(int limit) {
this.queue = new ArrayDeque<>(limit);
this.limit = limit;
}
public void offer(T item) {
if (queue.size() == limit) {
queue.poll(); // or throw new IllegalStateException() depending on your needs
}
queue.add(item);
}
public T poll() {
return queue.poll();
}
public boolean isEmpty() {
return queue.isEmpty();
}
}
I have a method which performs processing on a stream. Part of that processing needs to be done under the control of a lock - one locked section for processing all the elements - but some of it doesn't (and shouldn't be because it might be quite time-consuming). So I can't just say:
Stream<V> preprocessed = Stream.of(objects).map(this::preProcess);
Stream<V> toPostProcess;
synchronized (lockObj) {
toPostProcess = preprocessed.map(this::doLockedProcessing);
}
toPostProcess.map(this::postProcess).forEach(System.out::println);
because the calls to doLockedProcessing would only be executed when the terminal operation forEach is invoked, and that is outside the lock.
So I think I need to make a copy of the stream, using a terminal operation, at each stage so that the right bits are done at the right time. Something like:
Stream<V> preprocessed = Stream.of(objects).map(this::preProcess).copy();
Stream<V> toPostProcess;
synchronized (lockObj) {
toPostProcess = preprocessed.map(this::doLockedProcessing).copy();
}
toPostProcess.map(this::postProcess).forEach(System.out::println);
Of course, the copy() method doesn't exist, but if it did it would perform a terminal operation on the stream and return a new stream containing all the same elements.
I'm aware of a few ways of achieving this:
(1) Via an array (not so easy if the element type is a generic type):
copy = Stream.of(stream.toArray(String[]::new));
(2) Via a list:
copy = stream.collect(Collectors.toList()).stream();
(3) Via a stream builder:
Stream.Builder<V> builder = Stream.builder();
stream.forEach(builder);
copy = builder.build();
What I want to know is: which of these methods is the most efficient in terms of time and memory? Or is there another way which is better?
I think you have already mentioned all possible options. There's no other structural way to do what you need. First, you'd have to consume the original stream. Then, create a new stream, acquire the lock and consume this new stream (thus invoking your locked operation). Finally, create a yet newer stream, release the lock and go on processing this newer stream.
From all the options you are considering, I would use the third one, because the number of elements it can handle is only limited by memory, meaning it doesn't have an implicit max size restriction, like i.e. ArrayList has (it can contain about Integer.MAX_VALUE elements).
Needless to say, this would be a quite expensive operation, both regarding time and space. You could do it was follows:
Stream<V> temp = Stream.of(objects)
.map(this::preProcess)
.collect(Stream::<V>builder,
Stream.Builder::accept,
(b1, b2) -> b2.build().forEach(b1))
.build();
synchronized (lockObj) {
temp = temp
.map(this::doLockedProcessing)
.collect(Stream::<V>builder,
Stream.Builder::accept,
(b1, b2) -> b2.build().forEach(b1))
.build();
}
temp.map(this::postProcess).forEach(System.out::println);
Note that I've used a single Stream instance temp, so that intermediate streams (and their builders) can be garbage-collected, if needed.
As suggested by #Eugene in the comments, it would be nice to have a utility method to avoid code duplication. Here's such method:
public static <T> Stream<T> copy(Stream<T> source) {
return source.collect(Stream::<T>builder,
Stream.Builder::accept,
(b1, b2) -> b2.build().forEach(b1))
.build();
}
Then, you could this method as follows:
Stream<V> temp = copy(Stream.of(objects).map(this::preProcess));
synchronized (lockObj) {
temp = copy(temp.map(this::doLockedProcessing));
}
temp.map(this::postProcess).forEach(System.out::println);
I created a benchmark test which compares the three methods. This suggested that using a List as the intermediate store is about 30% slower than using an array or a Stream.Builder, which are similar. I am therefore drawn to using a Stream.Builder because converting to an array is tricky where the element type is a generic type.
I've ended up writing a little function that creates a Collector which uses a Stream.Builder as the intermediate store:
private static <T> Collector<T, Stream.Builder<T>, Stream<T>> copyCollector()
{
return Collector.of(Stream::builder, Stream.Builder::add, (b1, b2) -> {
b2.build().forEach(b1);
return b1;
}, Stream.Builder::build);
}
I can then make a copy of any stream str by doing str.collect(copyCollector()) which feels quite in keeping with the idiomatic usage of streams.
The original code I posted would then look like this:
Stream<V> preprocessed = Stream.of(objects).map(this::preProcess).collect(copyCollector());
Stream<V> toPostProcess;
synchronized (lockObj) {
toPostProcess = preprocessed.map(this::doLockedProcessing).collect(copyCollector());
}
toPostProcess.map(this::postProcess).forEach(System.out::println);
Wrap doLockedProcessing it in synchronization. Here’s one way:
class SynchronizedFunction<T, R> {
private Function<T, R> function;
public SynchronizedFunction(Function<T, R> function) {
this.function = function;
}
public synchronized R apply(T t) {
return function.apply(t);
}
}
Then use that in your stream:
stream.parellel()
.map(this:preProcess)
.map(new SynchronizedFunction<>(this::doLockedProcessing))
.forEach(this::postProcessing)
This will serially process the locked code, but be parellel otherwise.
Question
How to execute an action after the last item of an ordered stream is processed but before it's closed ?
This Action should be able to inject zero or more items in the stream pipe.
Context
I've got a very large file of the form :
MASTER_REF1
SUBREF1
SUBREF2
SUBREF3
MASTER_REF2
MASTER_REF3
SUBREF1
...
Where SUBREF (if any) is applicable to MASTER_REF and both are complex objects (you can imagine it somewhat like JSON).
On first look I tried something like :
public void process(Path path){
MyBuilder builder = new MyBuilder();
Files.lines(path)
.map(line->{
if(line.charAt(0)==' '){
builder.parseSubRef(line);
return null;
}else{
Result result = builder.build()
builder.parseMasterRef(line);
return result;
}
})
//eliminate null
.filter(Objects::nonNull)
//some processing on results
.map(Utils::doSomething)
//terminal op
.forEachOrdered(System.out::println);
}
[EDIT] using forEach here was a bad idea ... the good way was to use forEachOrdered
But, for obvious reasons, the last item is never appended to the stream : it is still being built.
Therefore I'm wondering how to flush it in the stream at the end of line processing.
Your question sounds confusing. The Stream is closed when the close() method is called explicitly or when try-with-resources construct is used. In your code sample the stream is not closed at all. To perform custom action before the stream is closed, you can just write something at the end of try-with-resource statement.
In your case it seems that you want to concatenate some bogus entry to the stream. There's Stream.concat() method to do this:
Stream.concat(Files.lines(path), Stream.of("MASTER"))
.map(...) // do all your other steps
Finally note that my StreamEx library which enhances the Stream API provides partial reduction methods which are good to parse multi-line entries. The same thing can be done using StreamEx.groupRuns() which combines adjacent elements into intermediate list by given BiPredicate:
public void process(Path path){
StreamEx.of(Files.lines(path))
.groupRuns((line1, line2) -> line2.charAt(0) == ' ')
// Now Stream elements are List<String> starting with MASTER and having
// all subref strings after that
.map(record -> {
MyBuilder builder = new MyBuilder();
builder.parseMasterRef(record.get(0));
record.subList(1, record.size()).forEach(builder::parseSubRef);
return record.build();
})
//eliminate null
.filter(Objects::nonNull)
//some processing on results
.map(Utils::doSomething)
//terminal op
.forEach(System.out::println);
}
Now you don't need to use side-effect operations.
The primary problem here is that you are streaming - effectively - two types of records and this makes it difficult to manage because streams are primarily for amorphous data.
I would pre-process the file data and collect it into MasterAndSub records. You can then groupingBy these by the Master field.
class MasterAndSub {
final String master;
final String sub;
public MasterAndSub(String master, String sub) {
this.master = master;
this.sub = sub;
}
}
/**
* Allows me to use a final Holder of a mutable value.
*
* #param <T>
*/
class Holder<T> {
T it;
public T getIt() {
return it;
}
public T setIt(T it) {
return this.it = it;
}
}
public void process(Path path) throws IOException {
final Holder<String> currentMaster = new Holder<>();
Files.lines(path)
.map(line -> {
if (line.charAt(0) == ' ') {
return new MasterAndSub(currentMaster.getIt(), line);
} else {
return new MasterAndSub(currentMaster.setIt(line), null);
}
})
...
I have a function that processes vectors. Size of input vector can be anything up to few millions. Problem is that function can only process vectors that are no bigger than 100k elements without problems.
I would like to call function in smaller parts if vector has too many elements
Vector<Stuff> process(Vector<Stuff> input) {
Vector<Stuff> output;
while(1) {
if(input.size() > 50000) {
output.addAll(doStuff(input.pop_front_50k_first_ones_as_subvector());
}
else {
output.addAll(doStuff(input));
break;
}
}
return output;
}
How should I do this?
Not sure if a Vector with millions of elements is a good idea, but Vector implements List, and thus there is subList which provides a lightweight (non-copy) view of a section of the Vector.
You may have to update your code to work with the interface List instead of only the specific implementation Vector, though (because the sublist returned is not a Vector, and it is just good practice in general).
You probably want to rewrite your doStuff method to take a List rather than a Vector argument,
public Collection<Output> doStuff(List<Stuff> v) {
// calculation
}
(and notice that Vector<T> is a List<T>)
and then change your process method to something like
Vector<Stuff> process(Vector<Stuff> input) {
Vector<Stuff> output;
int startIdx = 0;
while(startIdx < input.size()) {
int endIdx = Math.min(startIdx + 50000, input.size());
output.addAll(doStuff(input.subList(startIdx, endIdx)));
startIdx = endIdx;
}
}
this should work as long as the "input" Vector isn't being concurrently updated during the running of the process method.
If you can't change the signature of doStuff, you're probably going to need to wrap a new Vector around the result of subList,
output.addAll(doStuff(new Vector<Stuff>(input.subList(startIdx, endIdx)))));
The problem: Maintain a bidirectional many-to-one relationship among java objects.
Something like the Google/Commons Collections bidi maps, but I want to allow duplicate values on the forward side, and have sets of the forward keys as the reverse side values.
Used something like this:
// maintaining disjoint areas on a gameboard. Location is a space on the
// gameboard; Regions refer to disjoint collections of Locations.
MagicalManyToOneMap<Location, Region> forward = // the game universe
Map<Region, <Set<Location>>> inverse = forward.getInverse(); // live, not a copy
Location parkplace = Game.chooseSomeLocation(...);
Region mine = forward.get(parkplace); // assume !null; should be O(log n)
Region other = Game.getSomeOtherRegion(...);
// moving a Location from one Region to another:
forward.put(parkplace, other);
// or equivalently:
inverse.get(other).add(parkplace); // should also be O(log n) or so
// expected consistency:
assert ! inverse.get(mine).contains(parkplace);
assert forward.get(parkplace) == other;
// and this should be fast, not iterate every possible location just to filter for mine:
for (Location l : mine) { /* do something clever */ }
The simple java approaches are: 1. To maintain only one side of the relationship, either as a Map<Location, Region> or a Map<Region, Set<Location>>, and collect the inverse relationship by iteration when needed; Or, 2. To make a wrapper that maintains both sides' Maps, and intercept all mutating calls to keep both sides in sync.
1 is O(n) instead of O(log n), which is becoming a problem. I started in on 2 and was in the weeds straightaway. (Know how many different ways there are to alter a Map entry?)
This is almost trivial in the sql world (Location table gets an indexed RegionID column). Is there something obvious I'm missing that makes it trivial for normal objects?
I might misunderstand your model, but if your Location and Region have correct equals() and hashCode() implemented, then the set of Location -> Region is just a classical simple Map implementation (multiple distinct keys can point to the same object value). The Region -> Set of Location is a Multimap (available in Google Coll.). You could compose your own class with the proper add/remove methods to manipulate both submaps.
Maybe an overkill, but you could also use in-memory sql server (HSQLDB, etc). It allows you to create index on many columns.
I think you could achieve what you need with the following two classes. While it does involve two maps, they are not exposed to the outside world, so there shouldn't be a way for them to get out of sync. As for storing the same "fact" twice, I don't think you'll get around that in any efficient implementation, whether the fact is stored twice explicitly as it is here, or implicitly as it would be when your database creates an index to make joins more efficient on your 2 tables. you can add new things to the magicset and it will update both mappings, or you can add things to the magicmapper, which will then update the inverse map auotmatically. The girlfriend is calling me to bed now so I cannot run this through a compiler - it should be enough to get you started. what puzzle are you trying to solve?
public class MagicSet<L> {
private Map<L,R> forward;
private R r;
private Set<L> set;
public MagicSet<L>(Map forward, R r) {
this.forward = map;
this.r = r;
this.set = new HashSet<L>();
}
public void add(L l) {
set.add(l);
forward.put(l,r);
}
public void remove(L l) {
set.remove(l);
forward.remove(l);
}
public int size() {
return set.size();
}
public in contains(L l){
return set.contains(l);
}
// caution, do not use the remove method from this iterator. if this class was going
// to be reused often you would want to return a wrapped iterator that handled the remove method properly. In fact, if you did that, i think you could then extend AbstractSet and MagicSet would then fully implement java.util.Set.
public Iterator iterator() {
return set.iterator();
}
}
public class MagicMapper<L,R> { // note that it doesn't implement Map, though it could with some extra work. I don't get the impression you need that though.
private Map<L,R> forward;
private Map<R,MagicSet<L>> inverse;
public MagicMapper<L,R>() {
forward = new HashMap<L,R>;
inverse = new HashMap<R,<MagicSet<L>>;
}
public R getForward(L key) {
return forward.get(key);
}
public Set<L> getBackward(R key) {
return inverse.get(key); // this assumes you want a null if
// you try to use a key that has no mapping. otherwise you'd return a blank MagicSet
}
public void put (L l, R r) {
R oldVal = forward.get(l);
// if the L had already belonged to an R, we need to undo that mapping
MagicSet<L> oldSet = inverse.get(oldVal);
if (oldSet != null) {oldSet.remove(l);}
// now get the set the R belongs to, and add it.
MagicSet<L> newSet = inverse.get(l);
if (newSet == null) {
newSet = new MagicSet<L>(forward, r);
inverse.put(r,newSet);
}
newSet.add(l); // magically updates the "forward" map
}
}