I have requirement to split a input string in to output string/s (with some order)
by applying a set of regexs on the input string. I thinking to implement
this functionality with cluster of akka actors in such a way I scatter the
regex and input string and gather the string. However I would like to know
while collecting back the processed string, how to keep track of the order. Iam not
certain if "Scatter-Gather" is the correct approach , if there is any other that
can be suited please suggest.
I guess you will have to provide hints to the gatherer on how to assemble the string in order. You don't mention how the order is established: whether the initial split can define the order or whether the regex processing will define it.
In both cases, you need to keep track of 3 things:
- the initial source,
- the order of each individual piece
- the total amount of pieces
You could use a message like:
case class StringSegment(id:String, total:Int, seqNr:Int, payload:String)
The scatterer produces StringSegments based on the input:
def scatter(s:String):List[StringSegment] ...
scatter(input).foreach(msg => processingActor ! msg)
and the gatherer will assemble them together, using the seqNr to know the order and total to know when all the pieces are present.
Related
I want to run a Dataflow job over multiple inputs from Google Cloud Storage, but the paths I want to pass to the job can't be specified with just the * glob operator.
Consider these paths:
gs://bucket/some/path/20160208/input1
gs://bucket/some/path/20160208/input2
gs://bucket/some/path/20160209/input1
gs://bucket/some/path/20160209/input2
gs://bucket/some/path/20160210/input1
gs://bucket/some/path/20160210/input2
gs://bucket/some/path/20160211/input1
gs://bucket/some/path/20160211/input2
gs://bucket/some/path/20160212/input1
gs://bucket/some/path/20160212/input2
I want my job to work on the files in the 20160209, 20160210 and 20160211 directories, but not on 20160208 (the first) and 20160212 (the last). In reality there's a lot of more dates, and I want to be able to specify an arbitrary range of dates for my job to work on.
The docs for TextIO.Read say:
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.
But I can't get this to work. There's a link to Java Filesystem glob patterns , which in turn links to getPathMatcher(String), that lists all the globbing options. One of them is {a,b,c}, which looks exactly like what I need, however, if I pass gs://bucket/some/path/201602{09,10,11}/* to TextIO.Read#from I get "Unable to expand file pattern".
Maybe the docs mean that only *, ? and […] are supported, and if that is the case, how can I construct a glob that Dataflow will accept and that can match an arbitrary date range like the one I describe above?
Update: I've figured out that I can write a chunk of code to so that I can pass in the path prefixes as a comma separated list, create an input from each and use the Flatten transform, but that seems like a very inefficient way of doing it. It looks like the first step reads all input files and immediately write them out again to the temporary location on GCS. Only when all the inputs have been read and written the actual processing starts. This step is completely unnecessary in the job I'm writing. I want the job to read the first file, start processing it and read the next, and so on. This just caused a ton other problems, I'll try to make it work, but it feels like a dead end because of the initial rewriting.
The docs do, indeed, mean that only *, ?, and [...] are supported. This means that arbitrary subsets or ranges in alphabetical or numeric order cannot be expressed as a single glob.
Here are some approaches that might work for you:
If the date represented in the file path is also present in the records in the files, then the simplest solution is to read them all and use a Filter transform to select the date range you are interested in.
The approach you tried of many reads in a separates TextIO.Read transforms and flattening them is OK for small sets of files; our tf-idf example does this. You can express arbitrary numerical ranges with a small number of globs so this need not be one read per file (for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7])
If the subset of files is more arbitrary then the number of globs/filenames may exceed the maximum graph size, and the last recommendation is to put the list of files into a PCollection and use a ParDo transform to read each file and emit its contents.
I hope this helps!
Edit
IMHO : I think it is not a duplicate because the two questions are trying to solve the problem in different ways and especially because they provide totally different technological skills (and finally, because I ask myself these two questions).
Question
How to aggregate items from an ordered stream, preferably in an intermediate operation ?
Context
Following my other question : Java8 stream lines and aggregate with action on terminal line
I've got a very large file of the form :
MASTER_REF1
SUBREF1
SUBREF2
SUBREF3
MASTER_REF2
MASTER_REF3
SUBREF1
...
Where SUBREF (if any) is applicable to MASTER_REF and both are complex objects (you can imagine it somewhat like JSON).
On first look I tried to group the lines with an operation returning null while agregating and a value when a group of line could be found (a "group" of lines ends if line.charAt(0)!=' ').
This code is hard to read and requires a .filter(Objects::nonNull).
I think one could achieve this using a .collect(groupingBy(...)) or a .reduce(...) but those are terminal operations which is :
not required in my case : lines are ordered and should be grouped by their position and groups of line are to be transformed afterwards (map+filter+...+foreach);
nor a good idea : I'm talking of a huge data file that is way bigger than the total amount of RAM+SWAP ... a terminal operation would saturate availiable resources (as said, by design I need to keep groups in memory because are to be transformed afterwards)
As I already noted in the answer to the previous question, it's possible to use some third-party libraries which provide partial reduction operations. One of such libraries is StreamEx which I develop by myself.
In StreamEx library the partial reduction operation is the intermediate stream operation which combines several input elements while some condition is met. Usually the condition is specified via BiPredicate applied to the pair of adjacent stream elements which returns true when elements should be combined together. The simplest way to combine elements is to make a List via StreamEx.groupRuns() method like this:
Stream<List<String>> records = StreamEx.of(Files.lines(path))
.groupRuns((line1, line2) -> !line2.startsWith("MASTER"));
Here we start a new record when the second of two adjacent lines starts with "MASTER" (as in your example). Otherwise we continue the previous record.
Note that such stream is still lazy. In sequential processing at most one intermediate List<String> is created at a time. Parallel processing is also supported, though turning the Files.lines stream into parallel mode rarely improves the performance (at least prior to Java-9).
I have to do an exercise for a parallel computing course.
The task is using N parallel processes to remove all combinations of letters "RTY" from the string.
Normally I'll do it with
String strAfter=str1.replaceAll("[RTY]","") ;
But how to make it in parallel?
Split, work, merge.
Split in the main thread storing the output in a Set
Create N worker threads.
Have each worker thread syncrhonized pick() a string from the set at a given index, increase the index and process the entry
When index reaches Set size, glue everything back together. You may want to use StringBuilder and append() instead of concatenating Strings
Split the String into N parts then make each process work on one chunk of String. The splitting mechanism should be intelligent enough to handle boundary values. You need to communicate one chunk of String to corresponding processes using Send() and Recv() methods for processing and in the end updated String should be communicated in same manner. Here you can find Javadocs http://mpj-express.org/docs/javadocs/index.html
My guess is you need to find a way to do this without using single-threaded functions on the entire string. What about breaking the string into N parts and letting each of your N parallel processes run the replace function on that part and concatenating the string after all the threads finished?
I'm working on implementing probablistic matching for person record searching. As part of this, I plan to have blocking performed before any scoring is done. Currently, there are a lot of good options for transforming strings so that they can be stored and then searched for, with similar strings matching each other (things like soundex, metaphone, etc).
However, I've struggled to find something similar for purely numeric values. For example, it would be nice to be able to block on a social security number and not have numbers that are off or have transposed digits be removed from the results. 123456789 should have blocking results for 123456780 or 213456789.
Now, there are certainly ways to simply compare two numerical values to determine how similar they are, but what could I do when there are million of numbers in the database? It's obviously impractical to compare them all (and that would certainly invalidate the point of blocking).
What would be nice would be something where those three SSNs above could somehow be transformed into some other value that would be stored. Purely for example, imagine those three numbers ended up as AAABBCCC after this magical transformation. However, something like 987654321 would be ZZZYYYYXX and 123547698 would be AAABCCBC or something like that.
So, my question is, is there a good transformation for numeric values like there exists for alphabetical values? Or, is there some other approach that might make sense (besides some highly complex or low performing SQL or logic)?
The first thing to realize is that social security numbers are basically strings of digits. You really want to treat them like you would strings rather than numbers.
The second thing to realize is that your blocking function maps from a record to a list of strings that identify comparison worthy sets of items.
Here is some Python code to get you started. (I know you asked for Java, but I think the Python is clear and you aren't paying me enough to write it in Java :P ). The basic idea is to take your input record, simulate roughing it up in multiple ways (to get your blocking keys), and then group on by any match on those blocking keys.
import itertools
def transpositions(s):
for pos in range(len(s) - 1):
yield s[:pos] + s[pos + 1] + s[pos] + s[pos + 2:]
def substitutions(s):
for pos in range(len(s)):
yield s[:pos] + '*' + s[pos+1:]
def all_blocks(s):
return itertools.chain([s], transpositions(s), substitutions(s))
def are_blocked_candidates(s1, s2):
return bool(set(all_blocks(s1)) & set(all_blocks(s2)))
assert not are_blocked_candidates('1234', '5555')
assert are_blocked_candidates('1234', '1239')
assert are_blocked_candidates('1234', '2134')
assert not are_blocked_candidates('1234', '1255')
I'm writing a JUnit test that check messages (one message per line) inside one unified String. The format is as follows:
[* Message for Alice *]
Hey, first message
Second message
[* Message for Jim *]
Holler
Are you there?
[* General Messages *]
Welcome everyone!
This is yet another message.
The problem is that the actual string's order that I receive may change (except for the General Messages that always comes at the end of the string). For example: one time I can get Jim's messages first, so when I try to use assertEquals() the test fails. Unfortunately I don't have access to the code that generate the messages, so I can't make any modifications.
What is the best way to compare these strings and validate that they're the same?
You should re-organize your tests to address arbitrary re-ordering, for example, like this:
Split the string into individual messages
Separate general messages and all other messages
Order expected and actual messages in the same order (e.g. alphabetical)
Compare ordered lists of expected and actual messages. Now that they are ordered the same, they should equal item-by-item
Check that the general messages come after all other messages in the actual message stream.
You'd better compare Sets of messages, as the fuzzy string comparison you're after is going to be too tricky...