How to treat file contents as String

How to treat file contents as String - java

I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.
Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).
Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.
NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29
This method is not a part of the java API.
HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.

I think your best option is to load them all in memory, in a HashSet. There contains(word) is O(1).
If you are fine with having it in memory, having it as String on which to call contains(..) is much less efficient than a HashSet.
And I have to mention another option - there's a data structure to represent dictionaries - it's called Trie. You can't find an implementation in the JDK though.
A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)
If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a java.util.Scanner and its scanner.findWithHorizon(..). But that would be inefficient - I assume O(n), and I/O overhead.

While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.
The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.
Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.
All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.
HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:
Dictionary = 5MB.
Loaded into HashSet (assuming lots of overhead) = 20MB.
Memory usage in relation to other things = Minimal (assumes laptop/desktop)
Time to implement with HashSet = 2 Minutes.
I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)
Happy coding.
Links to random data-structure implementations (may or may not be suitable):
TernarySearchTrie Reads in a flat file (must be specially constructed?)
TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
FileHash Hash which uses a file backing.
HashStore Another disk-based hash
WB B-Tree Simple B-tree implementation / "database"
SQLite Small embedded RDBMS.
UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)

You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as the there these where those words all have the same prefix.
There is a way to make this solution even more memory efficient. (Hint: letter order)

Use the readline() of java.io.BufferedReader. That returns a string.
String line = new BufferedReader (new FileReader (file) ).readline ();

Related

from InputStream to List<String>, why java is allocating space twice in the JVM?

I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?

tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).

Any way to compress java arraylist?

I have a data structure:
ArrayList<String>[] a = new ArrayList[100000];
each list has about 1000 strings with about 100 characters.
I'm doing an one-off job with it, and it cost a little more memory than I can bear.
I think I can change less code if I can find ways to reduce some memory cost , as the cost is not too much ， and it's just an one-off job. So, please tell me all possible ways you know.
add some info: the reason I;m using a array of arraylists is that the size 100000 is what I can know now. But I don't know the size of each arraylist before I work through all the data.
And the problem is indeed too much data, so I want to find ways to compress it. It's not a allocation problem. There will finally be too much data to exceed the memory.

it cost a little more memory than I can bear
So, how much is "a little"?
Some quick estimates:
You have collections of string of 1000x100 characters. That should be about 1000x100x2 = 200kb of string data.
If you have 100000 of those, you'll need almost 20Gb for the data alone.
Compared to the 200kb of each collection's data the overhead of your data structures is miniscule, even if it was 100 bytes for each collection (0.05%).
So, not much to be gained here.
Hence, the only viable ways are:
Data compression of some kind to reduce the size of the 20Gb payload
Use of external storage, e.g. by only reading in strings which are needed at the moment and then discarding them
To me, it is not clear if your memory problem really comes from the data structure you showed (did you profile the program?) or from the total memory usage of the program. As I commented on another answer, resizing an array(list) for instance temporarily requires at least 2x the size of the array(list) for the copying operation. Then notice that you can create memory leaks in Java - or just be holding on to data you actually won't need again.
Edit:
A String in Java consists of an array of chars. Every char occupies two bytes.
You can convert a String to a byte[], where any ASCII character should need one byte only (non-ASCII characters will still need 2 (or more) bytes):
str.getBytes(Charset.forName("UTF-8"))
Then you make a Comparator for byte[] and you're good to go. (Notice though that byte has a range of [-128,127] which makes comparing non-intuitive in this case; you may want to compare (((int)byteValue) & 0xff).)

Why are you using Arrays when you don't know the size at compile time itself, Size is the main concern why Linked lists are preferable over arrays
ArrayList< String>[] a = new ArrayList[100000];
Why are you allocating so much memory at once initially, ArrayList will resize itself whenever required you need not do it, manually.
I think below structure will suffice your requirement
List<List<String> yourListOfStringList = new ArrayList<>();

What are the advantages and disadvantages of reading an entire file into a single String as opposed to reading it line by line?

Specifically, my end goal is to store every comma separated word from the file in a List<String> and I was wondering which approach I should take.
Approach 1:
String fileContents = new Scanner(new File("filepath")).useDelimiter("\\Z").next();
List<String> list = Arrays.asList(fileContents.split("\\s*,\\s*"));
Approach 2:
Scanner s = new Scanner(new File("filepath")).useDelimiter(",");
List<String> list = new ArrayList<>();
while (s.hasNext()){
list.add(s.next());
}
s.close();

Approach #1 will read the entire file into memory. This has a couple of performance-related issues:
If the file is big that uses a lot of memory.
Because of the way that the character's need to be accumulated by the Scanner.next() call, the characters may need to be copied 2 or even 3 times.
There are other inefficiencies due to the fact that you are using a general pattern matching engine for a very specific purpose.
Approach #3 (which is Approach #1 with the File reading done better) addresses a lot of the efficiency issues, but you still hold the entire file contents in memory.
Approach #2 is best from memory usage perspective because you don't hold the entire file contents as a single string or buffer1. The performance is also likely to be best because (my intuition says) this approach avoids at least one copy of the characters.
However, if this really matters, you should benchmark the alternatives, bearing in mind 2 things:
"Premature optimization" is usually wasted effort. (Or to put it another, the chances are that the performance of this part of your code really doesn't matter. The performance bottleneck is likely somewhere else.)
There a lot of pitfalls for writing Java benchmarks that can lead to bogus performance measures and incorrect conclusions.
The other thing to note is that what you are trying to do (create a list of all "words" in order) does not scale. For a large enough input file, the application will run out of heap space. If you anticipate running this on input files larger than 100Mb or so, it may start to become a concern.
The solution may be to convert your processing into something that is more "stream" based ... so that you don't need to have a list of all words in memory.
This is essentially the same problem as the problem with Approach #1.
1 - unless the file is small and fits into the buffer ... and then the whole question is largely moot.

If you read the entire file into memory when you don't actually need to you are:
wasting time: nothing is processed until you've read the entire file
wasting space
using a technique that won't scale to large files.
Doing this has nothing to recommend it.

Approach 1:
Limit of String's maximum size i.e. a String of max length Integer.MAX_VALUE only is possible or the largest possible array at runtime
Hence, Prefer Approach 2 if it is a very large fie

Tokenize big files to hashtable in Java

I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:
read the first file
read the first line of the file
split the important tokens to a string array
copy the string array to my final map, incrementing word frequencies
repeat for all files
I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.

Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?
You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.
==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc
By the way, why are you using double for the frequency? Surely it should be an integer value...

The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.
What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.
I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).

I am trying to rethink your problem:
Since you are trying to construct an inverted index:
Use Multimap rather then Map<String, Map<String, Integer>>
Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.

You could try using this library to improve your performance.
http://high-scale-lib.sourceforge.net/
It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.
Here is an article that will help you with some more inputs.
http://www.javaspecialists.eu/archive/Issue193.html

Why not use a custom class,
public class CustomData {
private String word;
private double frequency;
//Setters and Getters
}
and use your map as
Map<fileName, List<CustomData>>
this way atleast you will have only 900 keys in your map.
-Ivar

is there a dictionary i can download for java?

is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary

Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.

Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.

OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.

If you are on a unix like OS look in /usr/share/dict.

Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html

Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.