I want to run a Dataflow job over multiple inputs from Google Cloud Storage, but the paths I want to pass to the job can't be specified with just the * glob operator.
Consider these paths:
gs://bucket/some/path/20160208/input1
gs://bucket/some/path/20160208/input2
gs://bucket/some/path/20160209/input1
gs://bucket/some/path/20160209/input2
gs://bucket/some/path/20160210/input1
gs://bucket/some/path/20160210/input2
gs://bucket/some/path/20160211/input1
gs://bucket/some/path/20160211/input2
gs://bucket/some/path/20160212/input1
gs://bucket/some/path/20160212/input2
I want my job to work on the files in the 20160209, 20160210 and 20160211 directories, but not on 20160208 (the first) and 20160212 (the last). In reality there's a lot of more dates, and I want to be able to specify an arbitrary range of dates for my job to work on.
The docs for TextIO.Read say:
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.
But I can't get this to work. There's a link to Java Filesystem glob patterns , which in turn links to getPathMatcher(String), that lists all the globbing options. One of them is {a,b,c}, which looks exactly like what I need, however, if I pass gs://bucket/some/path/201602{09,10,11}/* to TextIO.Read#from I get "Unable to expand file pattern".
Maybe the docs mean that only *, ? and […] are supported, and if that is the case, how can I construct a glob that Dataflow will accept and that can match an arbitrary date range like the one I describe above?
Update: I've figured out that I can write a chunk of code to so that I can pass in the path prefixes as a comma separated list, create an input from each and use the Flatten transform, but that seems like a very inefficient way of doing it. It looks like the first step reads all input files and immediately write them out again to the temporary location on GCS. Only when all the inputs have been read and written the actual processing starts. This step is completely unnecessary in the job I'm writing. I want the job to read the first file, start processing it and read the next, and so on. This just caused a ton other problems, I'll try to make it work, but it feels like a dead end because of the initial rewriting.
The docs do, indeed, mean that only *, ?, and [...] are supported. This means that arbitrary subsets or ranges in alphabetical or numeric order cannot be expressed as a single glob.
Here are some approaches that might work for you:
If the date represented in the file path is also present in the records in the files, then the simplest solution is to read them all and use a Filter transform to select the date range you are interested in.
The approach you tried of many reads in a separates TextIO.Read transforms and flattening them is OK for small sets of files; our tf-idf example does this. You can express arbitrary numerical ranges with a small number of globs so this need not be one read per file (for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7])
If the subset of files is more arbitrary then the number of globs/filenames may exceed the maximum graph size, and the last recommendation is to put the list of files into a PCollection and use a ParDo transform to read each file and emit its contents.
I hope this helps!
Edit
IMHO : I think it is not a duplicate because the two questions are trying to solve the problem in different ways and especially because they provide totally different technological skills (and finally, because I ask myself these two questions).
Question
How to aggregate items from an ordered stream, preferably in an intermediate operation ?
Context
Following my other question : Java8 stream lines and aggregate with action on terminal line
I've got a very large file of the form :
MASTER_REF1
SUBREF1
SUBREF2
SUBREF3
MASTER_REF2
MASTER_REF3
SUBREF1
...
Where SUBREF (if any) is applicable to MASTER_REF and both are complex objects (you can imagine it somewhat like JSON).
On first look I tried to group the lines with an operation returning null while agregating and a value when a group of line could be found (a "group" of lines ends if line.charAt(0)!=' ').
This code is hard to read and requires a .filter(Objects::nonNull).
I think one could achieve this using a .collect(groupingBy(...)) or a .reduce(...) but those are terminal operations which is :
not required in my case : lines are ordered and should be grouped by their position and groups of line are to be transformed afterwards (map+filter+...+foreach);
nor a good idea : I'm talking of a huge data file that is way bigger than the total amount of RAM+SWAP ... a terminal operation would saturate availiable resources (as said, by design I need to keep groups in memory because are to be transformed afterwards)
As I already noted in the answer to the previous question, it's possible to use some third-party libraries which provide partial reduction operations. One of such libraries is StreamEx which I develop by myself.
In StreamEx library the partial reduction operation is the intermediate stream operation which combines several input elements while some condition is met. Usually the condition is specified via BiPredicate applied to the pair of adjacent stream elements which returns true when elements should be combined together. The simplest way to combine elements is to make a List via StreamEx.groupRuns() method like this:
Stream<List<String>> records = StreamEx.of(Files.lines(path))
.groupRuns((line1, line2) -> !line2.startsWith("MASTER"));
Here we start a new record when the second of two adjacent lines starts with "MASTER" (as in your example). Otherwise we continue the previous record.
Note that such stream is still lazy. In sequential processing at most one intermediate List<String> is created at a time. Parallel processing is also supported, though turning the Files.lines stream into parallel mode rarely improves the performance (at least prior to Java-9).
I have a general question on your opinion about my "technique".
There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.
But this sounds actually really, really time consuming and complicated to me.
Can you think of any other method to compare those two files?
How long do you think the comparison could take?
For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...
Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?
I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.
Thanks in advance.
EDIT
I think I have to explain my problem a bit more.
The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same "characteristic".
Here's an example:
file_1 looks somewhat like this:
mat1 1000 2000 TEXT //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT
file_2looks like this:
mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT
TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.
So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!
So my question is: how would you search for the matching lines?
Yes, I use Java as my programming language.
EDIT
I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-)
Nonentheless all your approaches were very helpful to me, thank you for your replies!
I think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM
In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.
As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.
EDIT: In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.
Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>> that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange int and an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)
boolean isInRange(String material, int value)
{
List<Range> ranges = hashMapName.get(material);
for (Range range : ranges)
{
if (value >= range.getStart() && value <= range.getEnd())
{
return true;
}
}
return false;
}
and call it for each (parsed) line of file_2.
Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.
This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.
Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.
That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.
Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.
Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).
Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.
there is a tradeoff: if you read a big chunk of the file, you save the disc seek time, but you may have read information you will not need, since the change was encountered on the first lines.
You should probably run some experiments [benchmarks], with varying chunk size, to find out what is the optimal chunk to read, in the average case.
No sure how good an answer this would be - but have a look at this page: http://c2.com/cgi/wiki?DiffAlgorithm - it summarises a few diff algorithms. Hunt-McIlroy algorithm is probably the better implementation. From that page there's also a link to a java implementation of the GNU diff. However, I think an implementation in C/C++ and compiled into native code will be much faster. If you're stuck with java, you may want to consider JNI.
Indeed, that could take a while. You have to make 1,200.000,000 line comparisions.
There are several possibilities to speed that up by an order of magnitute:
One would be to sort file2 and do kind of a binary search on file level.
Another approach: compute a checksum of each line, and search that. Depending on average line length, the file in question would be much smaller and you really can do a binary search if you store the checksums in a fixed format (i.e. a long)
The number of lines you read at once from file_1 does not matter, however. This is micro-optimization in the face of great complexity.
If you want a simple approach: you can hash both of the files and compare the hash. But it's probably faster (especially if the files differ) to use your approach. About the memory consumption: just make sure you use enough memory, using no buffer for this kind a thing is a bad idea..
And all those answers about hashes, checksums etc: those are not faster. You have to read the whole file in both cases. With hashes/checksums you even have to compute something...
What you can do is sort each individual file. e.g. the UNIX sort or similar in Java. You can read the sorted files one line at a time to perform a merge sort.
I have never worked with such huge files but this is my idea and should work.
You could look into hash. Using SHA-1 Hashing.
Import the following
import java.io.FileInputStream;
import java.security.MessageDigest;
Once your text file etc has been loaded have it loop through each line and at the end print out the hash. The example links below will go into more depth.
StringBuffer myBuffer = new StringBuffer("");
//For each line loop through
for (int i = 0; i < mdbytes.length; i++) {
myBuffer.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
}
System.out.println("Computed Hash = " + sb.toString());
SHA Code example focusing on Text File
SO Question about computing SHA in JAVA (Possibly helpful)
Another sample of hashing code.
Simple read each file seperatley, if the hash value for each file is the same at the end of the process then the two files are identical. If not then something is wrong.
Then if you get a different value you can do the super time consuming line by line check.
Overall, It seems that reading line by line by line by line etc would take forever. I would do this if you are trying to find each individual difference. But I think hashing would be quicker to see if they are the same.
SHA checksum
If you want to know exactly if the files are different or not then there isn't a better solution than yours -- comparing sequentially.
However you can make some heuristics that can tell you with some kind of probability if the files are identical.
1) Check file size; that's the easiest.
2) Take a random file position and compare block of bytes starting at this position in the two files.
3) Repeat step 2) to achieve the needed probability.
You should compute and test how many reads (and size of block) are useful for your program.
My solution would be to produce an index of one file first, then use that to do the comparison. This is similar to some of the other answers in that it uses hashing.
You mention that the number of lines is up to about 45 million. This means that you could (potentially) store an index which uses 16 bytes per entry (128 bits) and it would use about 45,000,000*16 = ~685MB of RAM, which isn't unreasonable on a modern system. There are overheads in using the solution I describe below, so you might still find you need to use other techniques such as memory mapped files or disk based tables to create the index. See Hypertable or HBase for an example of how to store the index in a fast disk-based hash table.
So, in full, the algorithm would be something like:
Create a hash map which maps Long to a List of Longs (HashMap<Long, List<Long>>)
Get the hash of each line in the first file (Object.hashCode should be sufficient)
Get the offset in the file of the line so you can find it again later
Add the offset to the list of lines with matching hashCodes in the hash map
Compare each line of the second file to the set of line offsets in the index
Keep any lines which have matching entries
EDIT:
In response to your edited question, this wouldn't really help in itself. You could just hash the first part of the line, but it would only create 50 different entries. You could then create another level in the data structure though, which would map the start of each range to the offset of the line it came from.
So something like index.get("mat32") would return a TreeMap of ranges. You could look for the range preceding the value you are looking for lowerEntry(). Together this would give you a pretty fast check to see if a given matX/number combination was in one of the ranges you are checking for.
try to avoid memory consuming and make it disc consuming.
i mean divide each file into loadable size parts and compare them, this may take some extra time but will keep you safe dealing with memory limits.
What about using source control like Mercurial? I don't know, maybe it isn't exactly what you want, but this is a tool that is designed to track changes between revisions. You can create a repository, commit the first file, then overwrite it with another one an commit the second one:
hg init some_repo
cd some_repo
cp ~/huge_file1.txt .
hg ci -Am "Committing first huge file."
cp ~/huge_file2.txt huge_file1.txt
hg ci -m "Committing second huge file."
From here you can get a diff, telling you what lines differ. If you could somehow use that diff to determine what lines were the same, you would be all set.
That's just an idea, someone correct me if I'm wrong.
I would try the following: for each file that you are comparing, create temporary files (i refer to it as partial file later) on disk representing each alphabetic letter and an additional file for all other characters. then read the whole file line by line. while doing so, insert the line into the relevant file that corresponds to the letter it starts with. since you have done that for both files, you can now limit the comparison for loading two smaller files at a time. a line starting with A for example can appear only in one partial file and there will not be a need to compare each partial file more than once. If the resulting files are still very large, you can apply the same methodology on the resulting partial files (letter specific files) that are being compared by creating files according to the second letter in them. the trade-of here would be usage of large disk space temporarily until the process is finished. in this process, approaches mentioned in other posts here can help in dealing with the partial files more efficiently.
is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary
Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.
Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.
OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.
If you are on a unix like OS look in /usr/share/dict.
Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html
Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary
I've got about 2500 short phrases in a file. I want to be able to find phrases as I type possible substrings of them. My app has a text box and a list of phrases. The text box is initially empty and the list contains all 2500 phrases, since the empty string is a substring of all of them. As I type in the text box, the list updates so that it always only contains phrases which contain the text box's value as a substring.
At the moment I have one of Google's Multimaps, specifically:
LinkedHashMultimap<String, String>
with every single possible substring mapped to its possible matches. This takes a while to load (about a second) and I think it must be taking up quite a bit of space (which may be a concern in the future.) It's very fast with the lookups though.
Is there a way I could do this with some other data structure or strategy that would be quicker to load and take less space (possibly at the expense of the speed of the lookups)?
If your list only contains 2500 elements, a simple loop and checking contains() on all of them should be fast enough.
If it grows bigger and/or is too slow, you can apply some easy optimizations:
Don't search immediately as the user types each character, but introduce some delay. So if he types "foobar" really fast, you only search for "foobar", not first "f" then "fo" then "foo",...
Reuse your previous results: if the user first types "foo" and then extends that to "foobar", don't search in the whole original list again, but search inside the results for "foo" (because everything that contains "foobar" must contain "foo").
In my experience, these these basic optimizations already get you quite far.
Now, if the list grows so big that even that is too slow, some "smarter" optimizations as proposed in other answers here (tries, suffix trees,...) would be needed.
You'll want to look into using the Trie data structure.
Try simply looping over the entire list and calling contains() - doing that 2500 times is probably completely unnoticeable.
You definetely need a Suffix Tree.. (wiki)
(i think this implementation could be ok: link)
EDIT:
I've read your comment, you shouldn't blindly check if the string is a substring somewhere in you phrase, you usually start with a word, not with a space. So maybe it's better to tokenize words inside your phrase?
Are you allowed to do it? Otherwise the best way is to build an automata for every phrase or using similar algorithms (for example the Karp-Rabin string search algorithm).
Wouter Coekaerts has a good approach, but I would go a bit further.
Don't bring up anything when the textbox contains a single character. The results won't be useful. You may find that this is true for two characters as well.
Precompute the results for two characters. When there are two characters bring up the precomputed list.
When a third character is added do the 'contains' search on the list you have currently displayed (anything that doesn't contain c1c2 can't contain c1c2c3). By now the list should be small enough that 'contains' has perfectly adequate performance.
Similarly for four characters etc.
As said above, put in a little delay before starting the search. Or better still arrange for a search to be killed if another character is typed before it finishes.