Stop word removing went wrong - java

For some IR purpouses, I would like to extract some text snippet and before analyzing, I wish to remove stop words. To do so, I made a txt file of stop words and then using following code, trying to remove those useless words:
private static void stopWordRemowal() throws FileNotFoundException, IOException {
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
stopWords.add(line.trim());
BufferedReader br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);
for(String readReady;(readReady = br2.readLine()) != null;)
{
StringTokenizer tokenizer =new StringTokenizer(readReady) ;
String temp=tokenizer.nextToken();
if(!stopWords.equals(temp))
{
theNewWords.write(temp.getBytes());
theNewWords.write(System.getProperty("line.separator").getBytes());
}}
}
But in fact it does not working well. Considering the following example text snippet:
Text summarization is the process of extracting salient information from the source text and to present that
information to the user in the form of summary
the output will be like:
Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary
it is almost like no effect. But I do not know why.

You should use contains method of Set and not equals method like:
if(!stopWords.contains(temp))//does set contains my string temp?
Instead of
if(!stopWords.equals(temp))//set equals to string? not possible

Related

How to read a list from a text file in java

I am trying to read data from a text file which has multiple lines, for example, look the image below, it is my text file
Given a keyword from the user which is the first string in the list from the text file. I want to print the list or line corresponding to the keyword given. For example, if I am giving the keyword=59d2211ec3671594c987d008f89f043e97670a5ba6f08fe073e465116c35b440
Then I want to store [59d2211ec3671594c987d008f89f043e97670a5ba6f08fe073e465116c35b440, id4, id6, id1] as a list.
I have tried using the following function to read the text file and return the data but it's giving me some wrong input.
public static List<String> readLines(File file) throws Exception {
if (!file.exists()) {
return new ArrayList<String>();
}
BufferedReader reader = new BufferedReader(new FileReader(file));
List<String> results = new ArrayList<String>();
String line = reader.readLine();
while (line != null) {
results.add(line);
line = reader.readLine();
}
return results;
}
Can someone guide me through on how to implement this in a right way.
if I am giving the keyword
Well, if you're given a value, then you should be using it
readLines(File file, String keyword)
store as a list
Now, I assume you mean split the line into columns. If that's the case, you need to be returning at least a List<List<String>>
However, if not, you can do that later than collecting the lines containing the keyword like so
List<String> results = new ArrayList<String>();
String line = reader.readLine();
while (line != null) {
if (line.contains(keyword)) results.add(line);
line = reader.readLine();
}
return results;
FWIW, I would suggest having a look at Java 8 stream functions, including filter, map, and toList

Putting a text file into an ArrayList, but if word exist it skips it

I´m in a bit of a struggle here, I´m trying to add each word from a textfile to an ArrayList and every time the reader comes across the same word again it will skip it. (Makes sense?)
I don't even know where to start. I kind of know that I need one loop that adds the textfile to the ArrayList and one the checks if the word is not in the list. Any ideas?
PS: Just started with Java
This is what I've done so far, don't even know if I'm on the right path..
public String findWord(){
int text = 0;
int i = 0;
while sc.hasNextLine()){
wordArray[i] = sc.nextLine();
}
if wordArray[i].contains() {
}
i++;
}
A List (an ArrayList or otherwise) is not the best data structure to use; a Set is better. In pseudo code:
define a Set
for each word
if adding to the set returns false, skip it
else do whatever do want to do with the (first time encountered) word
The add() method of Set returns true if the set changed as a result of the call, which only happens if the word isn't already in the set, because sets disallow duplicates.
I once made a similar program, it read through a textfile and counted how many times a word came up.
Id start with importing a scanner, as well as a file system(this needs to be at the top of the java class)
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.File;
import java.io.PrintStream;
import java.util.Scanner;
then you can make file, as well as a scanner reading from this file, make sure to adjsut the path to the file accordingly. The new Printstream is not necessary but when dealing with a big amount of data i dont like to overflow the console.
public static void main(String[] args) throws FileNotFoundException {
File file=new File("E:/Youtube analytics/input/input.txt");
Scanner scanner = new Scanner(file); //will read from the file above
PrintStream out = new PrintStream(new FileOutputStream("E:/Youtube analytics/output/output.txt"));
System.setOut(out);
}
after this you can use scanner.next() to get the next word so you would write something like this:
String[] array=new String[MaxAmountOfWords];//this will make an array
int numberOfWords=0;
String currentWord="";
while(scanner.hasNext()){
currentWord=scanner.next();
if(isNotInArray(currentWord))
{
array[numberOfWords]=currentWord
}
numberOfWords++;
}
If you dont understand any of this or need further guidence to progress, let me know. It is hard to help you if we dont exactly know where you are at...
You can try this:
public List<String> getAllWords(String filePath){
String line;
List<String> allWords = new ArrayList<String>();
BufferedReader reader = new BufferedReader(new FileReader(new File(filePath)));
//read each line of the file
while((line = reader.readLine()) != null) {
//get each word in the line
for(String word: line.split("(\\w)+"))
//validate if the current word is not empty
if(!word.isEmpty())
if(!allWords.contains(word))
allWords.add(word);
}
}
return allWords;
}
Best solution is to use a Set. But if you still want to use a List, here goes:
Suppose the file has the following data:
Hi how are you
I am Hardi
Who are you
Code will be:
List<String> list = new ArrayList<>();
// Get the file.
FileInputStream fis = new FileInputStream("C:/Users/hdinesh/Desktop/samples.txt");
//Construct BufferedReader from InputStreamReader
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
String line = null;
// Loop through each line in the file
while ((line = br.readLine()) != null) {
// Regex for finding just the words
String[] strArray = line.split("[ ]");
for (int i = 0; i< strArray.length; i++) {
if (!list.contains(strArray[i])) {
list.add(strArray[i]);
}
}
}
br.close();
System.out.println(list.toString());
If your text file has sentences with special characters, you will have to write a regex for that.

White Spaces Java Text File

Is is possible to remove the white spaces of a String in a text file in Java? I have tried my approach but doesn't working.
public static void main(String[] args) throws IOException {
File f = new File("ejer2.txt");
BufferedReader br = new BufferedReader(new FileReader(f));
String linea = br.readLine();
linea.replaceAll("\\s", "");
while (linea != null) {
System.out.println(linea);
linea = br.readLine();
}
br.close();
}
The only way I can get the white spaces out of the String is when I print the line out in the While loop by using the replaceAll method in the String class, but im trying to take them out of the Stringin the File, and I'm not sure if this is possible.
Try with this:
linea = linea.replaceAll("\\s+","")
EDIT: It is because you didn't save the value of your new string in your variable linea. You have to asign it.
If you want to actually replace the spaces in the file, you need to write to the file instead of just reading from it.
You'll need to add a linea.replaceAll line inside your while loop.
You'll need to store all these lines as well - I suggest using a Stringbuilder and adding everything you read to the builder (after you run replaceAll).
You'll also need to write the final text to the text file. I suggest using a PrintStream.
eg: PrintStream out = new PrintStream(new FileOutputStream("finalFile.txt"));
then out.print(yourStringBuilder.toString())

java read properties and xml file using stringbuilder

I need to read a set of xml and property files and parse the data. Currently I am using inputstream ans string builder to do this. But this does not create the file in the same way as input file is. I donot want to remove the white spaces and new lines. How do i achieve this.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
}
String s = sb5.toString();
My output is:
#test 123 #test2 345
Expected output is:
#test
123
#test2
345
Any thoughts ? Thanks
br.readLine() consumes the line breaks, you need to add them to your StringBuilder after appending the line.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
sb5.append("\n");
}
If you want an extremely simple solution for reading a file to a String, Apache Commons-IO has a method for performing such a task (org.apache.commons.io.FileUtils).
FileUtils.readFileToString(File file, String encoding);
readLine() method doesn't add the EOL character (\n). So while appending the string to the builder, you need to add the EOL char, like sb5.append(line5+"\n");
The various readLine methods discard the newline from the input.
From the BufferedReader docs:
Returns: A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached
A solution may be as simple as adding back a newline to your StringBuilder for every readLine: sb5.append(line5 + "\n");.
A better alternative is to read into an intermediate buffer first, using the read method, supplying your own char[]. You can still use StringBuilder.append, and get a String will match the file contents.

Java create strings from Buffered Reader and compare Strings

I am using Java + Selenium 1 to test a web application.
I have to read through a text file line by line using befferedreader.readLine and compare the data that was found to another String.
Is there way to assign each line a unique string? I think it would be something like this:
FileInputStream fstream = new FileInputStream("C:\\write.txt");
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
String[] strArray = null;
int p=0;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
strArray[p] = strLine;
assertTrue(strArray[p].equals(someString));
p=p+1;
}
The problem with this is that you don't know how many lines there are, so you can't size your array correctly. Use a List<String> instead.
In order of decreasing importance,
You don't need to store the Strings in an array at all, as pointed out by Perception.
You don't know how many lines there are, so as pointed out by Qwerky, if you do need to store them you should use a resizeable collection like ArrayList.
DataInputStream is not needed: you can just wrap your FileInputStream directly in an InputStreamReader.
You may want to try something like:
public final static String someString = "someString";
public boolean isMyFileOk(String filename){
Scanner sc = new Scanner(filename);
boolean fileOk = true;
while(sc.hasNext() && fileOk){
String line = sc.nextLine();
fileOk = isMyLineOk(line);
}
sc.close();
return fileOk;
}
public boolean isMyLineOk(String line){
return line.equals(someString);
}
The Scanner class is usually a great class to read files :)
And as suggested, you may check one line at a time instead of loading them all in memory before processing them. This may not be an issue if your file is relatively small but you better keep your code scalable, especially for doing the exact same thing :)

Categories