Creating an inverted index with limited memory in java - java

Im curious on how create an Inverted Index on data that doesn't fit into memory. So right now I'm reading a file directory and indexing the files based on the contents inside the file, I am using a HashMap to store the index. The code below is a snippet from a function I use and I call the function on an entire directory. What do I do if this directory was just massive and the HashMap can't fit all the entries. Yes, This does sound like premature optimization. Im just having fun. I don't want to use Lucene so don't even mention it because I'm tired as to seeing that as the majority answer to "Index" stuff. This HashMap is my only constraint everything else is stored in files to easily reference stuff later on.
Im just curious how I can do this since it stores it in the map like so
keyword -> file1,file2,file3,etc..(locations)
keyword2 -> file9,file11,file13,etc..(locations)
My thoughts were to create a file which would some how be able to update itself to be like the format above but I feel thats not efficient.
Code Snippet
br = new BufferedReader(new FileReader(file));
while ((line = br.readLine()) != null) {
for (String _word : line.split("\\W+")) {
word = _word.toLowerCase();
if (!ignore_words.contains(word)) {
fileLocations = index.get(word);
if (fileLocations == null) {
fileLocations = new LinkedList<Long>();
index.put(word, fileLocations);
}
fileLocations.add(file_offset);
}
}
}
br.close();
Update:
So I managed to come up with something, but performance wise I feel this is slow, especially if there was a large amount of data. I basically created a file that would just have to word and its offset on each line the word appeared.Lets name it index.txt.
It had the format of like so
word1:offset
word2:offset
word1:offset <-encountered again.
word3:offset
etc...
I then created multiple files for each word and appended the offset to that file each time it was encountered in the index.txt file.
So basically the format of the word files are like so
word1.txt -- Format
word1:offset1:offset2:offset3:offset4...and so on
each time word1 is encountered in the index.txt file it would append it to the word1.txt file and add to end.
Then finally, I go through all the word files I created and overwrite the index.txt file with the final output in the index file looking like so
word1:offset1:offset2:offset3:offset4:...
word2:offset9:offset11:offset13:offset14:...
etc..
Then to finish it up, I delete all the word files.
The nasty code snippet for this is below, its a fair amount.
public void createIndex(String word, long file_offset)
{
PrintWriter writer;
try {
writer = new PrintWriter(new FileWriter(this.file,true));
writer.write(word + ":" + file_offset + "\n");
writer.close();
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
public void mergeFiles()
{
String line;
String wordLine;
String[] contents;
String[] wordContents;
BufferedReader reader;
BufferedReader mergeReader;
PrintWriter writer;
PrintWriter mergeWriter;
try {
reader = new BufferedReader(new FileReader(this.file));
while((line = reader.readLine()) != null)
{
contents = line.split(":");
writer = new PrintWriter(new FileWriter(
new File(contents[0] + ".txt"),true));
if(this.words.get(contents[0]) == null)
{
this.words.put(contents[0], contents[0]);
writer.write(contents[0] + ":");
}
writer.write(contents[1] + ":");
writer.close();
}
//This could be put in its own method below.
mergeWriter = new PrintWriter(new FileWriter(this.file));
for(String word : this.words.keySet())
{
mergeReader = new BufferedReader(
new FileReader(new File(word + ".txt")));
while((wordLine = mergeReader.readLine()) != null)
{
mergeWriter.write(wordLine + "\n");
}
}
mergeWriter.close();
deleteFiles();
}
catch(IOException ioe)
{
ioe.printStackTrace();
}
}
public void deleteFiles()
{
File toDelete;
for(String word : this.words.keySet())
{
toDelete = new File(word + ".txt");
if(toDelete.exists())
{
toDelete.delete();
}
}
}

Related

How to split single text file into multiple with character as delimiter

I have a text document that has multiple separate entries all compiled into one .log file.
The format of the file looks something like this.
$#UserID#$
Date
User
UserInfo
SteamFriendID
=========================
<p>Message</p>
$#UserID#$
Date
User
UserInfo
SteamFriendID
========================
<p>Message</p>
$#UserID#$
Date
User
UserInfo
SteamFriendID
========================
<p>Message</p>
I'm trying to take everything in between the instances of "$#UserID$#", and print them into separate text files.
So far, with the looking that I've done, I tried implementing it using StringBuilder in something like this.
FileReader fr = new FileReader(“Path to raw file.”);
int idCount = 1;
FileWriter fw = new FileWriter("Path to parsed files" + idCount);
BufferedReader br = new BufferedReader(fr);
//String line, date, user, userInfo, steamID;
StringBuilder sb = new StringBuilder();
//br.readLine();
while ((line = br.readLine()) != null) {
if(line.substring(0,1).contains("$#")) {
if (sb.length() != 0) {
File file = new File("Path to parsed logs" + idCount);
PrintWriter pw = new PrintWriter(file, "UTF-8");
pw.println(sb.toString());
pw.close();
//System.out.println(sb.toString());
Sb.delete(0, sb.length());
idCount++;
}
continue;
}
sb.append(line + "\r\n");
}
But this only gives me the first 2 of the entries in separate parsed files. Leaving the 3rd one out for some reason.
The other way I was thinking about doing it was reading in all the lines using .readAllLines(), store the list as an array, loop through the lines to find "$#", get that line's index & then recursively write the lines starting at the index given.
Does anyone know of a better way to do this, or would be willing to explain to me why I'm only getting two of the three entries parsed?
Short / quick fix is to write the contents of the StringBuilder once after your while loop like this:
public static void main(String[] args) {
try {
int idCount = 1;
FileReader fr = new FileReader("<path to desired file>");
BufferedReader br = new BufferedReader(fr);
//String line, date, user, userInfo, steamID;
StringBuilder sb = new StringBuilder();
//br.readLine();
String line = "";
while ((line = br.readLine()) != null) {
if(line.startsWith("$#")) {
if (sb.length() != 0) {
writeFile(sb.toString(), idCount);
System.out.println(sb);
sb.setLength(0);
idCount++;
}
continue;
}
sb.append(line + "\r\n");
}
if (sb.length() != 0) {
writeFile(sb.toString(), idCount);
System.out.println(sb);
idCount++;
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void writeFile(String content, int id) throws IOException
{
File file = new File("<path to desired dir>\\ID_" + id + ".txt");
file.createNewFile();
PrintWriter pw = new PrintWriter(file, "UTF-8");
pw.println(content);
pw.close();
}
I've changed two additional things:
the condition "line.substring(0,1).contains("$#")" did not work properly, the substring call only returns one character, but is compared to two characters -> never true. I changed that to use the 'startsWith' method.
After the content of the StringBuilder is written to file, you did not reset or empty it, resulting in the second and third file containing every previous blocks aswell (thrid file equals input then...). So thats done with "sb.setLength(0);".

Split file into multiple files

I want to cut a text file.
I want to cut the file 50 lines by 50 lines.
For example, If the file is 1010 lines, I would recover 21 files.
I know how to count the number of files, the number of lines but as soon as I write, it's doesn't work.
I use the Camel Simple (Talend) but it's Java code.
private void ExtractOrderFromBAC02(ProducerTemplate producerTemplate, InputStream content, String endpoint, String fileName, HashMap<String, Object> headers){
ArrayList<String> list = new ArrayList<String>();
BufferedReader br = new BufferedReader(new InputStreamReader(content));
String line;
long numSplits = 50;
int sourcesize=0;
int nof=0;
int number = 800;
try {
while((line = br.readLine()) != null){
sourcesize++;
list.add(line);
}
System.out.println("Lines in the file: " + sourcesize);
double numberFiles = (sourcesize/numSplits);
int numberFiles1=(int)numberFiles;
if(sourcesize<=50) {
nof=1;
}
else {
nof=numberFiles1+1;
}
System.out.println("No. of files to be generated :"+nof);
for (int j=1;j<=nof;j++) {
number++;
String Filename = ""+ number;
System.out.println(Filename);
StringBuilder builder = new StringBuilder();
for (String value : list) {
builder.append("/n"+value);
}
producerTemplate.sendBodyAndHeader(endpoint, builder.toString(), "CamelFileName",Filename);
}
}
} catch (IOException e) {
e.printStackTrace();
}
finally{
try {
if(br != null)br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
For people who don't know Camel, this line is used to send the file:
producerTemplate.sendBodyAndHeader (endpoint, line.toString (), "CamelFileName" Filename);
endpoint ==> Destination (it's ok with another code)
line.toString () ==> Values
And then the file name (it's ok with another code)
you count the lines first
while((line = br.readLine()) != null){
sourcesize++; }
and then you're at the end of the file: you read nothing
for (int i=1;i<=numSplits;i++) {
while((line = br.readLine()) != null){
You have to seek back to the start of the file before reading again.
But that's a waste of time & power because you'll read the file twice
It's better to read the file once and for all, put it in a List<String> (resizable), and proceed with your split using the lines stored in memory.
EDIT: seems that you followed my advice and stumbled on the next issue. You should have maybe asked another question, well... this creates a buffer with all the lines.
for (String value : list) {
builder.append("/n"+value);
}
You have to use indexes on the list to build small files.
for (int k=0;k<numSplits;k++) {
builder.append("/n"+list[current_line++]);
current_line being the global line counter in your file. That way you create files of 50 different lines each time :)

Words read from file are null

I am attempting to read some words off of the file "words.txt", then use them in other classes of my program when it runs. This is what I have found on the internet, and it doesn't seem to be working properly.
public static List<String> wordsList;
public static void refreshWords(){
String fileName = "words.txt";
String line = null;
try {
FileReader fileReader =
new FileReader(fileName);
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
for(String tempWord : line.split(" ")){
wordsList.add(tempWord);
}
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
}
catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
}
}
public static List<String> getListOfWords(){
return wordsList;
}
I, from the message displayed before the program even runs, which cancels the entire thing, can determine that the error is sparking from adding tempWord to wordsList. I would assume that tempWord is null, but I can't seem to find a reason why it is.
All that I have in the file are a bunch of random words that I thought of off the top of my head, formatted like the following:
this game turtle forest soccer football ball java list annoyed
What you are using there is the old way of doing it (before Java 7).
With Java 7 / 8, reading a file is much easier. So rather than looking for bugs, I'd rewrite this using the new API:
List<String> lines = Files.readAllLines(yourFile.toPath(), StandardCharsets.UTF_8);
See Files.readAllLines(Path, Charset)
Also, in your question, you are splitting lines into words. That's highly unusual, word lists are almost always one word per line.

Updating a single line on a text file with a Java method

I know previous questions LIKE this one have been asked, but this question has to do with the specifics of the code that I have written. I am trying to update a single line of code on a file that will be permanently updated even when the program terminates so that the data can be brought up again. The method that I am writing currently looks like this (no compile errors found with eclipse)
public static void editLine(String fileName, String name, int element,
String content) throws IOException {
try {
// Open the file specified in the fileName parameter.
FileInputStream fStream = new FileInputStream(fileName);
BufferedReader br = new BufferedReader(new InputStreamReader(
fStream));
String strLine;
StringBuilder fileContent = new StringBuilder();
// Read line by line.
while ((strLine = br.readLine()) != null) {
String tokens[] = strLine.split(" ");
if (tokens.length > 0) {
if (tokens[0].equals(name)) {
tokens[element] = content;
String newLine = tokens[0] + " " + tokens[1] + " "
+ tokens[2];
fileContent.append(newLine);
fileContent.append("\n");
} else {
fileContent.append(strLine);
fileContent.append("\n");
}
}
/*
* File Content now has updated content to be used to override
* content of the text file
*/
FileWriter fStreamWrite = new FileWriter(fileName);
BufferedWriter out = new BufferedWriter(fStreamWrite);
out.write(fileContent.toString());
out.close();
// Close InputStream.
br.close();
}
} catch (IOException e) {
System.out.println("COULD NOT UPDATE FILE!");
System.exit(0);
}
}
If you could look at the code and let me know what you would suggest, that would be wonderful, because currently I am only getting my catch message.
Okay. First off the bat, StringBuilder fileContent = new StringBuilder(); is bad practice as this file could well be larger than the user's available memory. You should not keep much of the file in memory at all. Do this by reading into a buffer, processing the buffer (adjusting it if necessary), and writing the buffer to a new file. When done, delete the old file and rename the secondary to the old one's name. Hope this helps.

How to read a file in Android

I have a text file called "high.txt". I need the data inside for my Android app. But I have absolutely no idea how to read it into an ArrayList of the Strings. I tried the normal way of doing it in Java but apparently that doesn't work in Android since it cant find the file. So how do I go about doing this? I have put it in my res folder. But how do you take the input stream that you get from opening the file within Android and read it into an ArrayList of Strings. I am stuck on that part.
Basically it would look something like this:
3. What do you do for an upcoming test?
L: make sure I know what I'm studying and really review and study for this thing. Its what Im good at. Understand the material really well.
CL: Time to study. I got this, but I really need to make sure I know it,
M: Tests can be tough, but there are tips and tricks. Focus on the important, interesting stuff. Cram in all the little details just to get past this test.
CR: -sigh- I don't like these tests. Hope I've studied enough to pass or maybe do well.
R: Screw the test. I'll study later, day before should be good.
This is for a sample question and all the lines will be stored as separate strings in the array list.
If you put the text file in your assets folder you can use code like this which I've taken and modified from one of my projects:
public static void importData(Context context) {
try {
BufferedReader br = new BufferedReader(new InputStreamReader(context.getAssets().open("high.txt")));
String line;
while ((line = br.readLine()) != null) {
String[] columns = line.split(",");
Model model = new Model();
model.date = DateUtil.getCalendar(columns[0], "MM/dd/yyyy");
model.name = columns[1];
dbHelper.insertModel(model);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Within the loop you can do anything you need with the columns, what this example is doing is creating an object from each row and saving it in the database.
For this example the text file would look something like this:
15/04/2013,Bob
03/03/2013,John
21/04/2013,Steve
If you want to read file from External storage than use below method.
public void readFileFromExternal(){
String path = Environment.getExternalStorageDirectory().getPath()
+ "/AppTextFile.txt";
try {
BufferedReader reader = new BufferedReader(new FileReader(path));
String line, results = "";
while( ( line = reader.readLine() ) != null)
{
results += line;
}
reader.close();
Log.d("FILE","Data in your file : " + results);
} catch (Exception e) {
}
}
//find all files from folder /assets/txt/
String[] elements;
try {
elements = getAssets().list("txt");
} catch (IOException e) {
e.printStackTrace();
}
//for every files read text per line
for (String fileName : elements) {
Log.d("xxx", "File: " + fileName);
try {
InputStream open = getAssets().open("txt/" + fileName);
InputStreamReader inputStreamReader = new InputStreamReader(open);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
String line = "";
while ((line = bufferedReader.readLine()) != null) {
Log.d("xxx", line);
}
} catch (IOException e) {
e.printStackTrace();
}
}

Categories