Tokenize big files to hashtable in Java

Tokenize big files to hashtable in Java - java

I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:
read the first file
read the first line of the file
split the important tokens to a string array
copy the string array to my final map, incrementing word frequencies
repeat for all files
I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.

Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?
You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.
==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc
By the way, why are you using double for the frequency? Surely it should be an integer value...

The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.
What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.
I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).

I am trying to rethink your problem:
Since you are trying to construct an inverted index:
Use Multimap rather then Map<String, Map<String, Integer>>
Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.

You could try using this library to improve your performance.
http://high-scale-lib.sourceforge.net/
It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.
Here is an article that will help you with some more inputs.
http://www.javaspecialists.eu/archive/Issue193.html

Why not use a custom class,
public class CustomData {
private String word;
private double frequency;
//Setters and Getters
}
and use your map as
Map<fileName, List<CustomData>>
this way atleast you will have only 900 keys in your map.
-Ivar

Related

Any way to compress java arraylist?

I have a data structure:
ArrayList<String>[] a = new ArrayList[100000];
each list has about 1000 strings with about 100 characters.
I'm doing an one-off job with it, and it cost a little more memory than I can bear.
I think I can change less code if I can find ways to reduce some memory cost , as the cost is not too much ， and it's just an one-off job. So, please tell me all possible ways you know.
add some info: the reason I;m using a array of arraylists is that the size 100000 is what I can know now. But I don't know the size of each arraylist before I work through all the data.
And the problem is indeed too much data, so I want to find ways to compress it. It's not a allocation problem. There will finally be too much data to exceed the memory.

it cost a little more memory than I can bear
So, how much is "a little"?
Some quick estimates:
You have collections of string of 1000x100 characters. That should be about 1000x100x2 = 200kb of string data.
If you have 100000 of those, you'll need almost 20Gb for the data alone.
Compared to the 200kb of each collection's data the overhead of your data structures is miniscule, even if it was 100 bytes for each collection (0.05%).
So, not much to be gained here.
Hence, the only viable ways are:
Data compression of some kind to reduce the size of the 20Gb payload
Use of external storage, e.g. by only reading in strings which are needed at the moment and then discarding them
To me, it is not clear if your memory problem really comes from the data structure you showed (did you profile the program?) or from the total memory usage of the program. As I commented on another answer, resizing an array(list) for instance temporarily requires at least 2x the size of the array(list) for the copying operation. Then notice that you can create memory leaks in Java - or just be holding on to data you actually won't need again.
Edit:
A String in Java consists of an array of chars. Every char occupies two bytes.
You can convert a String to a byte[], where any ASCII character should need one byte only (non-ASCII characters will still need 2 (or more) bytes):
str.getBytes(Charset.forName("UTF-8"))
Then you make a Comparator for byte[] and you're good to go. (Notice though that byte has a range of [-128,127] which makes comparing non-intuitive in this case; you may want to compare (((int)byteValue) & 0xff).)

Why are you using Arrays when you don't know the size at compile time itself, Size is the main concern why Linked lists are preferable over arrays
ArrayList< String>[] a = new ArrayList[100000];
Why are you allocating so much memory at once initially, ArrayList will resize itself whenever required you need not do it, manually.
I think below structure will suffice your requirement
List<List<String> yourListOfStringList = new ArrayList<>();

What are the advantages and disadvantages of reading an entire file into a single String as opposed to reading it line by line?

Specifically, my end goal is to store every comma separated word from the file in a List<String> and I was wondering which approach I should take.
Approach 1:
String fileContents = new Scanner(new File("filepath")).useDelimiter("\\Z").next();
List<String> list = Arrays.asList(fileContents.split("\\s*,\\s*"));
Approach 2:
Scanner s = new Scanner(new File("filepath")).useDelimiter(",");
List<String> list = new ArrayList<>();
while (s.hasNext()){
list.add(s.next());
}
s.close();

Approach #1 will read the entire file into memory. This has a couple of performance-related issues:
If the file is big that uses a lot of memory.
Because of the way that the character's need to be accumulated by the Scanner.next() call, the characters may need to be copied 2 or even 3 times.
There are other inefficiencies due to the fact that you are using a general pattern matching engine for a very specific purpose.
Approach #3 (which is Approach #1 with the File reading done better) addresses a lot of the efficiency issues, but you still hold the entire file contents in memory.
Approach #2 is best from memory usage perspective because you don't hold the entire file contents as a single string or buffer1. The performance is also likely to be best because (my intuition says) this approach avoids at least one copy of the characters.
However, if this really matters, you should benchmark the alternatives, bearing in mind 2 things:
"Premature optimization" is usually wasted effort. (Or to put it another, the chances are that the performance of this part of your code really doesn't matter. The performance bottleneck is likely somewhere else.)
There a lot of pitfalls for writing Java benchmarks that can lead to bogus performance measures and incorrect conclusions.
The other thing to note is that what you are trying to do (create a list of all "words" in order) does not scale. For a large enough input file, the application will run out of heap space. If you anticipate running this on input files larger than 100Mb or so, it may start to become a concern.
The solution may be to convert your processing into something that is more "stream" based ... so that you don't need to have a list of all words in memory.
This is essentially the same problem as the problem with Approach #1.
1 - unless the file is small and fits into the buffer ... and then the whole question is largely moot.

If you read the entire file into memory when you don't actually need to you are:
wasting time: nothing is processed until you've read the entire file
wasting space
using a technique that won't scale to large files.
Doing this has nothing to recommend it.

Approach 1:
Limit of String's maximum size i.e. a String of max length Integer.MAX_VALUE only is possible or the largest possible array at runtime
Hence, Prefer Approach 2 if it is a very large fie

JVM Tunning of Java Class

My java class reads in a 60MB file and produces a HashMap of a HashMap with over 300 million records.
HashMap<Integer, HashMap<Integer, Double>> pairWise =
new HashMap<Integer, HashMap<Integer, Double>>();
I already tunned the VM argument to be:
-Xms512M -Xmx2048M
But system still goes for:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.createEntry(HashMap.java:869)
at java.util.HashMap.addEntry(HashMap.java:856)
at java.util.HashMap.put(HashMap.java:484)
at com.Kaggle.baseline.BaselineNew.createSimMap(BaselineNew.java:70)
at com.Kaggle.baseline.BaselineNew.<init>(BaselineNew.java:25)
at com.Kaggle.baseline.BaselineNew.main(BaselineNew.java:315)
How big of the heap will it take to run without failing with an OOME?

Your dataset is ridiculously large to process it in memory, this is not a final solution, just an optimization.
You're using boxed primitives, which is a very painful thing to look at.
According to this question, a boxed integer can be 20 bytes larger than an unboxed integer. This is not what I call memory efficient.
You can optimize this with specialized collections, which don't box the primitive values. One project providing these is Trove. You could use a TIntDoubleMap instead of your HashMap<Integer, Double> and a TIntObjectHashMap instead of your HashMap<Integer, …>.
Therefore your type would look like this:
TIntObjectHashMap<TIntDoubleHashMap> pairWise =
new TIntObjectHashMap<TIntDoubleHashMap>();
Now, do the math.
300.000.000 Doubles, each 24 bytes, use 7.200.000.000 bytes of memory, that is 7.2 GB.
If you store 300.000.000 doubles, taking 4 bytes each, you only need 1.200.000.000 bytes, which is 1.2 GB.
Congrats, you saved around 83% of the memory you previously used for storing your numbers!
Note that this calculation is rough, depends on the platform and implementation, and does not account for the memory used for the HashMap/T*Maps.

Your data set is large enough that holding all of it in memory at one time is not going to happen.
Consider storing the data in a database and loading partial data sets to perform manipulation.
Edit: My assumption was that you were going to do more than one pass on the data. If all you are doing is loading it and performing one action on each item, then Lex Webb's suggestion (comment below) is a better solution than a database. If you are performing more than one action per item, then database appears to be a better solution. The database does not need to be an SQL database, if your data is record oriented a NoSQL database might be a better fit.

You are using the wrong data structures for data of this volume. Java adds significant overhead in memory and time for every object it creates -- and at the 300 million object level you're looking at a lot of overhead. You should consider leaving this data in the file and use random access techniques to address it in place -- take a look at memory mapped files using nio.

How to compare large text files?

I have a general question on your opinion about my "technique".
There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.
But this sounds actually really, really time consuming and complicated to me.
Can you think of any other method to compare those two files?
How long do you think the comparison could take?
For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...
Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?
I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.
Thanks in advance.
EDIT
I think I have to explain my problem a bit more.
The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same "characteristic".
Here's an example:
file_1 looks somewhat like this:
mat1 1000 2000 TEXT //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT
file_2looks like this:
mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT
TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.
So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!
So my question is: how would you search for the matching lines?
Yes, I use Java as my programming language.
EDIT
I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-)
Nonentheless all your approaches were very helpful to me, thank you for your replies!

I think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM

In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.
As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.
EDIT: In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.
Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>> that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange int and an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)
boolean isInRange(String material, int value)
{
List<Range> ranges = hashMapName.get(material);
for (Range range : ranges)
{
if (value >= range.getStart() && value <= range.getEnd())
{
return true;
}
}
return false;
}
and call it for each (parsed) line of file_2.

Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.
This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.
Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.
That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.
Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.
Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).
Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.

there is a tradeoff: if you read a big chunk of the file, you save the disc seek time, but you may have read information you will not need, since the change was encountered on the first lines.
You should probably run some experiments [benchmarks], with varying chunk size, to find out what is the optimal chunk to read, in the average case.

No sure how good an answer this would be - but have a look at this page: http://c2.com/cgi/wiki?DiffAlgorithm - it summarises a few diff algorithms. Hunt-McIlroy algorithm is probably the better implementation. From that page there's also a link to a java implementation of the GNU diff. However, I think an implementation in C/C++ and compiled into native code will be much faster. If you're stuck with java, you may want to consider JNI.

Indeed, that could take a while. You have to make 1,200.000,000 line comparisions.
There are several possibilities to speed that up by an order of magnitute:
One would be to sort file2 and do kind of a binary search on file level.
Another approach: compute a checksum of each line, and search that. Depending on average line length, the file in question would be much smaller and you really can do a binary search if you store the checksums in a fixed format (i.e. a long)
The number of lines you read at once from file_1 does not matter, however. This is micro-optimization in the face of great complexity.

If you want a simple approach: you can hash both of the files and compare the hash. But it's probably faster (especially if the files differ) to use your approach. About the memory consumption: just make sure you use enough memory, using no buffer for this kind a thing is a bad idea..
And all those answers about hashes, checksums etc: those are not faster. You have to read the whole file in both cases. With hashes/checksums you even have to compute something...

What you can do is sort each individual file. e.g. the UNIX sort or similar in Java. You can read the sorted files one line at a time to perform a merge sort.

I have never worked with such huge files but this is my idea and should work.
You could look into hash. Using SHA-1 Hashing.
Import the following
import java.io.FileInputStream;
import java.security.MessageDigest;
Once your text file etc has been loaded have it loop through each line and at the end print out the hash. The example links below will go into more depth.
StringBuffer myBuffer = new StringBuffer("");
//For each line loop through
for (int i = 0; i < mdbytes.length; i++) {
myBuffer.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
}
System.out.println("Computed Hash = " + sb.toString());
SHA Code example focusing on Text File
SO Question about computing SHA in JAVA (Possibly helpful)
Another sample of hashing code.
Simple read each file seperatley, if the hash value for each file is the same at the end of the process then the two files are identical. If not then something is wrong.
Then if you get a different value you can do the super time consuming line by line check.
Overall, It seems that reading line by line by line by line etc would take forever. I would do this if you are trying to find each individual difference. But I think hashing would be quicker to see if they are the same.
SHA checksum

If you want to know exactly if the files are different or not then there isn't a better solution than yours -- comparing sequentially.
However you can make some heuristics that can tell you with some kind of probability if the files are identical.
1) Check file size; that's the easiest.
2) Take a random file position and compare block of bytes starting at this position in the two files.
3) Repeat step 2) to achieve the needed probability.
You should compute and test how many reads (and size of block) are useful for your program.

My solution would be to produce an index of one file first, then use that to do the comparison. This is similar to some of the other answers in that it uses hashing.
You mention that the number of lines is up to about 45 million. This means that you could (potentially) store an index which uses 16 bytes per entry (128 bits) and it would use about 45,000,000*16 = ~685MB of RAM, which isn't unreasonable on a modern system. There are overheads in using the solution I describe below, so you might still find you need to use other techniques such as memory mapped files or disk based tables to create the index. See Hypertable or HBase for an example of how to store the index in a fast disk-based hash table.
So, in full, the algorithm would be something like:
Create a hash map which maps Long to a List of Longs (HashMap<Long, List<Long>>)
Get the hash of each line in the first file (Object.hashCode should be sufficient)
Get the offset in the file of the line so you can find it again later
Add the offset to the list of lines with matching hashCodes in the hash map
Compare each line of the second file to the set of line offsets in the index
Keep any lines which have matching entries
EDIT:
In response to your edited question, this wouldn't really help in itself. You could just hash the first part of the line, but it would only create 50 different entries. You could then create another level in the data structure though, which would map the start of each range to the offset of the line it came from.
So something like index.get("mat32") would return a TreeMap of ranges. You could look for the range preceding the value you are looking for lowerEntry(). Together this would give you a pretty fast check to see if a given matX/number combination was in one of the ranges you are checking for.

try to avoid memory consuming and make it disc consuming.
i mean divide each file into loadable size parts and compare them, this may take some extra time but will keep you safe dealing with memory limits.

What about using source control like Mercurial? I don't know, maybe it isn't exactly what you want, but this is a tool that is designed to track changes between revisions. You can create a repository, commit the first file, then overwrite it with another one an commit the second one:
hg init some_repo
cd some_repo
cp ~/huge_file1.txt .
hg ci -Am "Committing first huge file."
cp ~/huge_file2.txt huge_file1.txt
hg ci -m "Committing second huge file."
From here you can get a diff, telling you what lines differ. If you could somehow use that diff to determine what lines were the same, you would be all set.
That's just an idea, someone correct me if I'm wrong.

I would try the following: for each file that you are comparing, create temporary files (i refer to it as partial file later) on disk representing each alphabetic letter and an additional file for all other characters. then read the whole file line by line. while doing so, insert the line into the relevant file that corresponds to the letter it starts with. since you have done that for both files, you can now limit the comparison for loading two smaller files at a time. a line starting with A for example can appear only in one partial file and there will not be a need to compare each partial file more than once. If the resulting files are still very large, you can apply the same methodology on the resulting partial files (letter specific files) that are being compared by creating files according to the second letter in them. the trade-of here would be usage of large disk space temporarily until the process is finished. in this process, approaches mentioned in other posts here can help in dealing with the partial files more efficiently.

How to treat file contents as String

I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.
Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).
Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.
NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29
This method is not a part of the java API.
HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.

I think your best option is to load them all in memory, in a HashSet. There contains(word) is O(1).
If you are fine with having it in memory, having it as String on which to call contains(..) is much less efficient than a HashSet.
And I have to mention another option - there's a data structure to represent dictionaries - it's called Trie. You can't find an implementation in the JDK though.
A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)
If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a java.util.Scanner and its scanner.findWithHorizon(..). But that would be inefficient - I assume O(n), and I/O overhead.

While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.
The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.
Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.
All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.
HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:
Dictionary = 5MB.
Loaded into HashSet (assuming lots of overhead) = 20MB.
Memory usage in relation to other things = Minimal (assumes laptop/desktop)
Time to implement with HashSet = 2 Minutes.
I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)
Happy coding.
Links to random data-structure implementations (may or may not be suitable):
TernarySearchTrie Reads in a flat file (must be specially constructed?)
TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
FileHash Hash which uses a file backing.
HashStore Another disk-based hash
WB B-Tree Simple B-tree implementation / "database"
SQLite Small embedded RDBMS.
UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)

You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as the there these where those words all have the same prefix.
There is a way to make this solution even more memory efficient. (Hint: letter order)

Use the readline() of java.io.BufferedReader. That returns a string.
String line = new BufferedReader (new FileReader (file) ).readline ();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.