Im having a problem with a BufferedWriter. I am reading in a 50,000 word wordlist, using a stemming algorithm and creating a new wordlist that just contains the word stems. Instead of this new file containing any stems however it litrally just contains:
-
Here is my code:
public static void main(String[] args) {
BufferedReader reader=null;
BufferedWriter writer=null;
try {
writer = new BufferedWriter(new FileWriter(new File("src/newwordlist.txt")));
HashSet<String> db = new HashSet<String>();
reader = new BufferedReader(new InputStreamReader(new FileInputStream("src/wordlist"),"UTF-8"));
String word;
int i=0;
while ((word=reader.readLine())!=null) {
i++;
Stemmer s= new Stemmer();
s.addword(word);
s.stem();
String stem =s.toString();
if(!db.contains(stem)){
db.add(stem);
writer.write(stem);
//System.out.println(stem);
}
}
System.out.println("Reduced file from " + i + " words to " + db.size());
reader.close();
writer.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
The output i get on the console is:
Reduced file from 58110 words to 28201
So i know its working. Ive also tried changing writer.write(stem); to writer.write("hi"); and I still get the same output in newwordlist.txt.
I know its no fault of the Stemmer class, Ive tried outputting the stem string (where I commented the code) and that produced the correct output to console so the fault must be with the writer but I dont understand what.
Edit 1
I simplified to code to:
BufferedReader reader=null;
BufferedWriter writer=null;
try {
writer = new BufferedWriter(new FileWriter(new File("src/newwordlist.txt")));
HashSet<String> db = new HashSet<String>();
reader = new BufferedReader(new InputStreamReader(new FileInputStream("src/wordlist.txt"),"UTF-8"));
String word;
int i=0;
while ((word=reader.readLine())!=null) {
i++;
if(!db.contains(word)){
db.add(word);
writer.write("hi");
}
}
System.out.println("Reduced file from " + i + " words to " + db.size());
reader.close();
writer.close();
} catch (IOException e1) {
e1.printStackTrace();
}
Now i get console output:
Reduced file from 58110 words to 58109
But the output file is still blank
I would expect the code as given in the Question to produce a file that consists of one line, consisting of all of the "stems" concatenated. (Or in the "hi" version, one line consisting of "hihihi...." repeated a large number of times.)
It is conceivable that whatever you are using to view the file cannot cope with an input file that consists of many thousands of characters ... and no end-of-line.
Change
writer.write(stem);
to
writer.write(stem);
writer.write(EOL);
where EOL is the platform specific end-of-line sequence.
Assuming you are using Java 7, it would be better to use try-with-resource to make sure that the output stream is always closed / flushed, even if there is an error:
public static void main(String[] args) {
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("src/wordlist"), "UTF-8"));
BufferedWriter writer = new BufferedWriter(new FileWriter(
new File("src/newwordlist.txt")));
HashSet<String> db = new HashSet<>();
String EOL = System.getProperty("line.separator");
String word;
int i = 0;
while ((word = reader.readLine()) != null) {
i++;
Stemmer s = new Stemmer();
s.addword(word);
s.stem();
String stem = s.toString();
if (db.add(stem)) {
writer.write(stem);
writer.write(EOL);
}
}
System.out.println("Reduced file from " + i + " words to " + db.size());
} catch (IOException e1) {
e1.printStackTrace();
}
}
(I tidied up a couple of other things too ...)
The reason you get Reduced file from 58110 words to 58109 console output is that you only have one System.out.println statement after the loop.
The writer should write words only to the output file src/newwordlist.txt and not to the console. If you want your program to output words to the console add additional System.out.println(word) after writer.write("hi");
Hope this helps...
Works for me. Is this your exact class, did you edit it before pasting in?
wordlist;
the
cat
sat
on
the
mat
newwordlist.txt;
thecatsatonmat
My Stemmer just returns the word you gave it.
public class Stemmer {
private String word;
public void addword(String word) {
this.word = word;
}
public void stem() {
// TODO Auto-generated method stub
}
#Override
public String toString() {
return word;
}
}
According to the Java documentation you need to use BufferedWriter.write() as follows:
write(string,offset,length);
so try:
writer.write(stem,0,stem.length());
When I run your edited code I get one line with
hihihihihihihihihihihihihi ............
As expected.
Perhaps you intended to add newline characters line this.
if(!db.contains(word)){
db.add(word);
writer.write(word);
writer.write("\n");
}
Related
Im curious on how create an Inverted Index on data that doesn't fit into memory. So right now I'm reading a file directory and indexing the files based on the contents inside the file, I am using a HashMap to store the index. The code below is a snippet from a function I use and I call the function on an entire directory. What do I do if this directory was just massive and the HashMap can't fit all the entries. Yes, This does sound like premature optimization. Im just having fun. I don't want to use Lucene so don't even mention it because I'm tired as to seeing that as the majority answer to "Index" stuff. This HashMap is my only constraint everything else is stored in files to easily reference stuff later on.
Im just curious how I can do this since it stores it in the map like so
keyword -> file1,file2,file3,etc..(locations)
keyword2 -> file9,file11,file13,etc..(locations)
My thoughts were to create a file which would some how be able to update itself to be like the format above but I feel thats not efficient.
Code Snippet
br = new BufferedReader(new FileReader(file));
while ((line = br.readLine()) != null) {
for (String _word : line.split("\\W+")) {
word = _word.toLowerCase();
if (!ignore_words.contains(word)) {
fileLocations = index.get(word);
if (fileLocations == null) {
fileLocations = new LinkedList<Long>();
index.put(word, fileLocations);
}
fileLocations.add(file_offset);
}
}
}
br.close();
Update:
So I managed to come up with something, but performance wise I feel this is slow, especially if there was a large amount of data. I basically created a file that would just have to word and its offset on each line the word appeared.Lets name it index.txt.
It had the format of like so
word1:offset
word2:offset
word1:offset <-encountered again.
word3:offset
etc...
I then created multiple files for each word and appended the offset to that file each time it was encountered in the index.txt file.
So basically the format of the word files are like so
word1.txt -- Format
word1:offset1:offset2:offset3:offset4...and so on
each time word1 is encountered in the index.txt file it would append it to the word1.txt file and add to end.
Then finally, I go through all the word files I created and overwrite the index.txt file with the final output in the index file looking like so
word1:offset1:offset2:offset3:offset4:...
word2:offset9:offset11:offset13:offset14:...
etc..
Then to finish it up, I delete all the word files.
The nasty code snippet for this is below, its a fair amount.
public void createIndex(String word, long file_offset)
{
PrintWriter writer;
try {
writer = new PrintWriter(new FileWriter(this.file,true));
writer.write(word + ":" + file_offset + "\n");
writer.close();
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
public void mergeFiles()
{
String line;
String wordLine;
String[] contents;
String[] wordContents;
BufferedReader reader;
BufferedReader mergeReader;
PrintWriter writer;
PrintWriter mergeWriter;
try {
reader = new BufferedReader(new FileReader(this.file));
while((line = reader.readLine()) != null)
{
contents = line.split(":");
writer = new PrintWriter(new FileWriter(
new File(contents[0] + ".txt"),true));
if(this.words.get(contents[0]) == null)
{
this.words.put(contents[0], contents[0]);
writer.write(contents[0] + ":");
}
writer.write(contents[1] + ":");
writer.close();
}
//This could be put in its own method below.
mergeWriter = new PrintWriter(new FileWriter(this.file));
for(String word : this.words.keySet())
{
mergeReader = new BufferedReader(
new FileReader(new File(word + ".txt")));
while((wordLine = mergeReader.readLine()) != null)
{
mergeWriter.write(wordLine + "\n");
}
}
mergeWriter.close();
deleteFiles();
}
catch(IOException ioe)
{
ioe.printStackTrace();
}
}
public void deleteFiles()
{
File toDelete;
for(String word : this.words.keySet())
{
toDelete = new File(word + ".txt");
if(toDelete.exists())
{
toDelete.delete();
}
}
}
I have a text file with an integer on each line, ordered from least to greatest, and I want to put them in a new text file with any duplicate numbers removed.
I've managed to read in the text file and print the numbers on the screen, but I'm unsure on how to actually write them in a new file, with duplicates removed?
public static void main(String[] args)
{
try
{
FileReader fr = new FileReader("sample.txt");
BufferedReader br = new BufferedReader(fr);
String str;
while ((str = br.readLine()) != null) {
out.println(str + "\n");
}
br.close();
}
catch (IOException e) {
out.println("File not found");
}
}
When reading the file, you could add the numbers to a Set, which is a data structure that doesn't allow duplicate values (just Google for "java collections" for more details)
Then you iterate through this Set, writing the numbers to a FileOutputStream (google for "java io" for more details)
Instead of printing each of the numbers, add them to an Array. After you've added all the integers, you can cycle through the array to remove duplicates (sample code for this can be found fairly easily).
Once you have an array, use BufferedWriter to write to an output file. Example code for how to do this can be found here: https://www.mkyong.com/java/how-to-write-to-file-in-java-bufferedwriter-example/
Alternatively, use a Set, and BufferedWriter should still work in the same way.
assuming the input file is already ordered:
public class Question42475459 {
public static void main(final String[] args) throws IOException {
final String inFile = "sample.txt";
try (final Scanner scanner = new Scanner(new BufferedInputStream(new FileInputStream("")), "UTF-8");
BufferedWriter writer = new BufferedWriter(new FileWriter(inFile + ".out", false))) {
String lastLine = null;
while (scanner.hasNext()) {
final String line = scanner.next();
if (!line.equals(lastLine)) {
writer.write(line);
writer.newLine();
lastLine = line;
}
}
}
}
}
I have tried doing it like this:
import java.io.*;
public class ConvertChar {
public static void main(String args[]) {
Long now = System.nanoTime();
String nomCompletFichier = "C:\\Users\\aahamed\\Desktop\\test\\test.xml";
Convert(nomCompletFichier);
Long inter = System.nanoTime() - now;
System.out.println(inter);
}
public static void Convert(String nomCompletFichier) {
FileWriter writer = null;
BufferedReader reader = null;
try {
File file = new File(nomCompletFichier);
reader = new BufferedReader(new FileReader(file));
String oldtext = "";
while (reader.ready()) {
oldtext += reader.readLine() + "\n";
}
reader.close();
// replace a word in a file
// String newtext = oldtext.replaceAll("drink", "Love");
// To replace a line in a file
String newtext = oldtext.replaceAll("&(?!amp;)", "&");
writer = new FileWriter(file);
writer.write(newtext);
writer.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
However the code above takes more time to execute than creating two different files:
import java.io.*;
public class ConvertChar {
public static void main(String args[]) {
Long now = System.nanoTime();
String nomCompletFichier = "C:\\Users\\aahamed\\Desktop\\test\\test.xml";
Convert(nomCompletFichier);
Long inter = System.nanoTime() - now;
System.out.println(inter);
}
private static void Convert(String nomCompletFichier) {
BufferedReader br = null;
BufferedWriter bw = null;
try {
File file = new File(nomCompletFichier);
File tempFile = File.createTempFile("buffer", ".tmp");
bw = new BufferedWriter(new FileWriter(tempFile, true));
br = new BufferedReader(new FileReader(file));
while (br.ready()) {
bw.write(br.readLine().replaceAll("&(?!amp;)", "&") + "\n");
}
bw.close();
br.close();
file.delete();
tempFile.renameTo(file);
} catch (IOException e) {
// writeLog("Erreur lors de la conversion des caractères : " + e.getMessage(), 0);
} finally {
try {
bw.close();
} catch (Exception ignore) {
}
try {
br.close();
} catch (Exception ignore) {
}
}
}
}
Is there any way to do the 2nd code without creating a temp file and reducing the execution time? I am doing a code optimization.
The main reason why your first program is slow is probably that it's building up the string oldtext incrementally. The problem with that is that each time you add another line to it it may need to make a copy of it. Since each copy takes time roughly proportional to the length of the string being copied, your execution time will scale like the square of the size of your input file.
You can check whether this is your problem by trying with files of different lengths and seeing how the runtime depends on the file size.
If so, one easy way to get around the problem is Java's StringBuilder class which is intended for exactly this task: building up a large string incrementally.
The main culprit in your first example is that you're building oldtext inefficiently using String concatenations, as explained here. This allocates a new string for every concatenation. Java provides you StringBuilder for building strings:
StringBuilder builder = new StringBuilder;
while(reader.ready()){
builder.append(reader.readLine());
builder.append("\n");
}
String oldtext = builder.toString();
You can also do the replacement when you're building your text in StringBuilder. Another problem with your code is that you shouldn't use ready() to check if there is some content left in the file - check the result of readLine(). Finally, closing the stream should be in a finally or try-with-resources block. The result could look like this:
StringBuilder builder = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
String line = reader.readLine();
while (line != null) {
builder.append(line.replaceAll("&(?!amp;)", "&"));
builder.append('\n');
line = reader.readLine();
}
}
String newText = builder.toString();
Writing to a temporary file is a good solution too, though. The amount of I/O, which is the slowest to handle, is the same in both cases - read the full content once, write result once.
I am creating a registry snapshot with the command:
Runtime.getRuntime().exec("REG EXPORT HKLM " + pathVariable + "\HKLM.txt /y");
I am then parsing through this file trying to group the registry entries into a single String as they are often broken up over multiple lines. When I use this bit of code I am always getting the "NUL" character for every even character.
String line, concatLine;
Scanner scanner;
try {
scanner = new Scanner(myFile);
line = null;
concatLine = "";
while(scanner.hasNextLine()){
line = scanner.nextLine();
if(line !=null && !(line.isEmpty())){
concatLine += line;
}
else if(!(concatLine.equals(""))){
System.out.println(concatLine);
concatLine = "";
}
}
} catch (IOException e) {//Catch I/O Exceptions
System.err.println(e);
}
I am looking at the file before scanning it in NP++ and there are no "NUL" characters, but if I write these concatenated lines to a file the entire file has them between each expected character.
In my search to understand the problem I came across Java reading and writing paractices which is definitely worth looking over. Apart from that, it seems like the early comments were correct. If the file is opened as a UTF-16 stream, and written as such, then the output is without the null characters. By the way, you will also need to deal with escaped newlines in registry dump, because if you don't you will end up with things like: "00,00,\ 00," where you should have "00,00,00,".
Here is an example:
import java.io.*;
import java.util.*;
import static java.lang.System.out;
public class ReadReg {
public static void main(String[] argv){
String line=null; StringBuilder sb = new StringBuilder();
Scanner scanner;
FileOutputStream fos;
BufferedOutputStream bos; OutputStreamWriter fosw;
try {
scanner = new Scanner(new File("hklm-hw.txt"), "UTF-16");
fos = new FileOutputStream("hklm-hw.cat.txt");
bos = new BufferedOutputStream(fos);
fosw = new OutputStreamWriter(bos, "UTF-16");
while (scanner.hasNextLine()) {
sb.append( line = scanner.nextLine());
if (line.isEmpty()) {
sb.append("\n");
}
}
if (null != scanner.ioException()) {
out.format("scanner ioe:\n\t%s\n", scanner.ioException().getMessage());
//scanner.ioException().printStackTrace();
}
fosw.write( sb.toString(), 0, sb.length());
fosw.flush();
fosw.close();
scanner.close();
} catch (IOException io) {
io.printStackTrace();
}
}
}
Output:
$ javac ReadReg.java && java ReadReg ; file *
hklm-hw.cat.txt: Big-endian UTF-16 Unicode text, with very long lines
hklm-hw.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
ReadReg.class: compiled Java class data, version 50.0 (Java 1.6)
ReadReg.java: C source, ASCII text
My java code takes almost 10-15minutes to run (Input file is 7200+ lines long list of query). How do I make it run in short time to get same results?
How do I make my code to search only for aA to zZ and 0 to 9??
If I don't do #2, some characters in my output are shown as "?". How do I solve this issue?
// no parameters are used in the main method
public static void main(String[] args) {
// assumes a text file named test.txt in a folder under the C:\file\test.txt
Scanner s = null;
BufferedWriter out = null;
try {
// create a scanner to read from the text file test.txt
FileInputStream fstream = new FileInputStream("C:\\user\\query.txt");
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
// Write to the file
out = new BufferedWriter(new FileWriter("C:\\user\\outputquery.txt"));
// keep getting the next String from the text, separated by white space
// and print each token in a line in the output file
//while (s.hasNext()) {
// String token = s.next();
// System.out.println(token);
// out.write(token + "\r\n");
//}
String strLine="";
String str="";
while ((strLine = br.readLine()) != null) {
str+=strLine;
}
String st=str.replaceAll(" ", "");
char[]third =st.toCharArray();
System.out.println("Character Total");
for(int counter =0;counter<third.length;counter++){
//String ch= "a";
char ch= third[counter];
int count=0;
for ( int i=0; i<third.length; i++){
// if (ch=="a")
if (ch==third[i])
count++;
}
boolean flag=false;
for(int j=counter-1;j>=0;j--){
//if(ch=="b")
if(ch==third[j])
flag=true;
}
if(!flag){
System.out.println(ch+" "+count);
out.write(ch+" "+count);
}
}
// close the output file
out.close();
} catch (IOException e) {
// print any error messages
System.out.println(e.getMessage());
}
// optional to close the scanner here, the close can occur at the end of the code
finally {
if (s != null) {
// close the input file
s.close();
}
}
}
For something like this I would NOT recommend java though it entirely possible it is much easier with GAWK or something similar. GAWK also has java like syntax so its easy to pick up. You should check it out.
SO isn't really the place to ask such a broad how-do-I-do-this-question but I will refer you to the following page on regular expression and text match in Java. Also, check out the Javadocs for regexes.
If you follow that link you should get what you want, else you could post a more specific question back on SO.