Java: Searching for specific word in a text file

Java: Searching for specific word in a text file - java

I've currently got a large text file with lots of the most popular names. I get the user to input a specific name and I'm currently trying to print the line that has the name. My problem is that if the user enters a name like Alex, every name that contains Alex like Alexander, Alexis, Alexia gets printed when I only want Alex to get printed. What can I do to "if(line.contains(name)){" to fix this.
The line contains info like the name, it's popularity ranking and number of people with that name
try {
line = reader.readLine();
while (line != null) {
if(line.contains(name)){
text += line;
line = reader.readLine();
}
line = reader.readLine();
}
}catch(Exception e){
System.out.println("Error");
}
System.out.println(text);

A shorthand would be to use Java8 Streams: Here is a look :
public class Test2 {
public static void main(String[] args) {
String fileName = "c://lines.txt";
String name = "nametosearch";
try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
stream.filter(line -> line.contains(" " + name + " ")).forEach(System.out::println);
} catch (IOException e) {
e.printStackTrace();
}
}
}

You can use regex with a word boundary for this task:
final String regex = String.format("\\b%s\\b", name);
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(line);
matcher.find();
if( matcher.group(0).length() > 0 ) {
text += line;
line = reader.readLine();
}

line.equals(name)
Replace
line.contains(name)

Related

Buffer Reader code to read input file

I have a text file named "message.txt" which is read using Buffer Reader. Each line of the text file contains both "word" and "meaning" as given in this example:
"PS:Primary school"
where PS - word, Primary school - meaning
When the file is being read, each line is tokenized to "word" and "meaning" from ":".
If the "meaning" is equal to the given input string called "f_msg3", "f_msg3" is displayed on the text view called "txtView". Otherwise, it displays "f_msg" on the text view.
But the "if condition" is not working properly in this code. For example if "f_msg3" is equal to "Primary school", the output on the text view must be "Primary school". But it gives the output as "f_msg" but not "f_msg3". ("f_msg3" does not contain any unnecessary strings.)
Can someone explain where I have gone wrong?
try {
BufferedReader file = new BufferedReader(new InputStreamReader(getAssets().open("message.txt")));
String line = "";
while ((line = file.readLine()) != null) {
try {
/*separate the line into two strings at the ":" */
StringTokenizer tokens = new StringTokenizer(line, ":");
String word = tokens.nextToken();
String meaning = tokens.nextToken();
/*compare the given input with the meaning of the read line */
if(meaning.equalsIgnoreCase(f_msg3)) {
txtView.setText(f_msg3);
} else {
txtView.setText(f_msg);
}
} catch (Exception e) {
txtView.setText("Cannot break");
}
}
} catch (IOException e) {
txtView.setText("File not found");
}

Try this
............
meaning = meaning.replaceAll("\\s+", " ");
/*compare the given input with the meaning of the read line */
if(meaning.equalsIgnoreCase(f_msg3)) {
txtView.setText(f_msg3);
} else {
txtView.setText(f_msg);
}
............
Otherwise comment the else part, then it will work.

I don't see any obvious error in your code, maybe it is just a matter
of cleaning the string (i.e. removing heading and trailing spaces, newlines and so on) before comparing it.
Try trimming meaning, e.g. like this :
...
String meaning = tokens.nextToken();
if(meaning != null) {
meaning = meaning.trim();
}
if(f_msg3.equalsIgnoreCase(meaning)) {
txtView.setText(f_msg3);
} else {
txtView.setText(f_msg);
}
...

A StringTokenizer takes care of numbers (the cause for your error) and other "tokens" - so might be considered to invoke too much complexity.
String[] pair = line.split("\\s*\\:\\s*", 2);
if (pair.length == 2) {
String word = pair[0];
String meaning = pair[1];
...
}
This splits the line into at most 2 parts (second optional parameter) using a regular expression. \s* stands for any whitespace: tabs and spaces.
You could also load all in a Properties. In a properties file the format key=value is convention, but also key:value is allowed. However then some escaping might be needed.

ArrayList vals = new ArrayList();
String jmeno = "Adam";
vals.add("Honza");
vals.add("Petr");
vals.add("Jan");
if(!(vals.contains(jmeno))){
vals.add(jmeno);
}else{
System.out.println("Adam je už v seznamu");
}
for (String jmena : vals){
System.out.println(jmena);
}
try (BufferedReader br = new BufferedReader(new FileReader("dokument.txt")))
{
String aktualni = br.readLine();
int pocetPruchodu = 0;
while (aktualni != null)
{
String[] znak = aktualni.split(";");
System.out.println(znak[pocetPruchodu] + " " +znak[pocetPruchodu + 1]);
aktualni = br.readLine();
}
br.close();
}
catch (IOException e)
{
System.out.println("Nezdařilo se");
}
try (BufferedWriter bw = new BufferedWriter(new FileWriter("dokument2.txt")))
{
int pocetpr = 0;
while (pocetpr < vals.size())
{
bw.write(vals.get(pocetpr));
bw.append(" ");
pocetpr++;
}
bw.close();
}
catch (IOException e)
{
System.out.println("Nezdařilo se");
}

Splitting a text file into multiple files by specific character sequence

I have a file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
what I need is read the text between ".I 1" and ".I 2" and save it as a separate file like "DOC_ID_1.txt" and then read the text between ".I 2" and ".I 3"
and save it as a separate file like "DOC_ID_2.txt" and so on. lets assume that the number of .I # is not known.
I have tried this but cannot finish it. any help will be appreciated
String inputDocFile="C:\\Dropbox\\Data\\cran.all.1400";
try {
File inputFile = new File(inputDocFile);
FileReader fileReader = new FileReader(inputFile);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line=null;
String outputDocFileSeperatedByID="DOC_ID_";
//Pattern docHeaderPattern = Pattern.compile(".I ", Pattern.MULTILINE | Pattern.COMMENTS);
ArrayList<ArrayList<String>> result = new ArrayList<> ();
int docID =0;
try {
StringBuilder sb = new StringBuilder();
line = bufferedReader.readLine();
while (line != null) {
if (line.startsWith(".I"))
{
result.add(new ArrayList<String>());
result.get(docID).add(".I");
line = bufferedReader.readLine();
while(line != null && !line.startsWith(".I")){
line = bufferedReader.readLine();
}
++docID;
}
else line = bufferedReader.readLine();
}
} finally {
bufferedReader.close();
}
} catch (IOException ex) {
Logger.getLogger(ReadFile.class.getName()).log(Level.SEVERE, null, ex);
}

You want to find the lines which match "I n".
The regex you need is : ^.I \d$
^ indicates the beginning of the line. Hence, if there are some whitespaces or text before I, the line will not match the regex.
\d indicates any digit. For the sake of simplicty, I allow only one digit in this regex.
$ indicates the end of the line. Hence, if there are some characters after the digit, the line will not match the expression.
Now, you need to read the file line by line and keep a reference to the file in which you write the current line.
Reading a file line by line is much easier in Java 8 with Files.lines();
private String currentFile = "root.txt";
public static final String REGEX = "^.I \\d$";
public void foo() throws Exception{
Path path = Paths.get("path/to/your/input/file.txt");
Files.lines(path).forEach(line -> {
if(line.matches(REGEX)) {
//Extract the digit and update currentFile
currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
System.out.println("Current file is now : currentFile);
} else {
System.out.println("Writing this line to "+currentFile + " :" + line);
//Files.write(...);
}
});
Note : In order to extract the digit, I use a raw "".substring() which I consider as evil but it is easier to understand. You can do it in a better way with a Pattern and a Matcher :
With this regex : ".I (\\d)". (The same as before but with parenthesis which indicates what you will want to capture). Then :
Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
System.out.println(matcher.group(1));//display "3"
}

Look up regex, Java has inbuilt libraries for this.
https://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
These links will give you a starting point, effectively you can use counter to perform a pattern match against the string and store anything between the first pattern match and the second pattern match. This information can be output to a separate file using the Formatter class.
Found here:-
http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
public class Test {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String inputFile="C:\\logs\\test.txt";
BufferedReader br = new BufferedReader(new FileReader(new File(inputFile)));
String line=null;
StringBuilder sb = new StringBuilder();
int count=1;
try {
while((line = br.readLine()) != null){
if(line.startsWith(".I")){
if(sb.length()!=0){
File file = new File("C:\\logs\\DOC_ID_"+count+".txt");
PrintWriter writer = new PrintWriter(file, "UTF-8");
writer.println(sb.toString());
writer.close();
sb.delete(0, sb.length());
count++;
}
continue;
}
sb.append(line);
}
} catch (Exception ex) {
ex.printStackTrace();
}
finally {
br.close();
}
}
}

How to apply regex to entire file, not just line after line?

I want to apply my regular expression not just to the first line of the text file, but to the all lines together.
Currently it matches only when the entire appropriate match is on one line. And if the appropriate match continues on the next line - it doesn't match at all.
class Parser {
public static void main(String[] args) throws IOException {
Pattern patt = Pattern.compile("(include|"
+ "integrate|"
+ "driven based on|"
+ "facilitate through|"
+ "contain|"
+ "using|"
+ "equipped"
+ "integrate|"
+ "implement|"
+ "utilized to facilitate|"
+ "comprise){1}"
+ "[\\s\\w\\,\\(\\)\\;\\:]*\\."); //Regex
BufferedReader r = new BufferedReader(new FileReader("E:/test/test.txt")); // read the file
String line;
PrintWriter pWriter = null;
while ((line = r.readLine()) != null) {
Matcher matcher = patt.matcher(line);
while (matcher.find()) {
try{
pWriter = new PrintWriter(new BufferedWriter(new FileWriter("E:/test/test1.txt", true)));//append any given input
pWriter.println(matcher.group()); //write the result of matcher to the new file
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
if (pWriter != null){
pWriter.flush();
pWriter.close();
}
}
System.out.println(matcher.group());
}
}
}
}

Change while ((line = r.readLine()) != null) to this:
String file = ""; // Basically, a conglomerate of all of the lines in the file
while ((line = r.readLine()) != null) {
file += line; // Append each line to the "file" string
}
Matcher matcher = patt.matcher(file);
while (matcher.find()) {
/* Blah blah blah, your outputting goes here. */
}
The reason why this happens is because you're doing each line individually. For what you want, you need to apply the regex to the file all at once.

Currently the matcher is applied per line, it needs to be applied to the whole file to work as intended.
Regex are greedy, you will match the whole String on the first match unless you have . (or other special characters) in your String:
...
+ "comprise){1}"
+ "[\\s\\w\\,\\(\\)\\;\\:]*\\."); //Regex
On the last line you match any whitespace and word, so pretty much anything but .. Also the {1} and most of the \ are superfluous (because in []):
...
+ "comprise)"
+ "[\\s\\w,();:]*\\."); //Regex
If you don't care about the newline characters just remove them first and it should work (I see no way around it if you have something like "com\nprise" and want to match that):
s = s.replaceAll("\\n+", "");

java string matching from a large text file issue

I would like to implement a task of string matching from a large text file.
1. replace all the non-alphanumeric characters
2. count the number of a specific term in the text file. For example, matching term "tom". The matching is not case sensitive.so term "Tom" should me counted. However the term tomorrow should not be counted.
code template one:
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile));
} catch (FileNotFoundException e1) {
System.out.println("Not found the text file: "+inputFile);
}
Scanner scanner = null;
try {
while (( line = in.readLine())!=null){
String newline=line.replaceAll("[^a-zA-Z0-9\\s]", " ").toLowerCase();
scanner = new Scanner(newline);
while (scanner.hasNext()){
String term = scanner.next();
if (term.equalsIgnoreCase(args[1]))
countstr++;
}
}
} catch (IOException e) {
e.printStackTrace();
}
code template two:
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile));
} catch (FileNotFoundException e1) {
System.out.println("Not found the text file: "+inputFile);
}
Scanner scanner = null;
try {
while (( line = in.readLine())!=null){
String newline=line.replaceAll("[^a-zA-Z0-9\\s]", " ").toLowerCase();
String[] strArray=newline.split(" ");//split by blank space
for (int =0;i<strArray.length;i++)
if (strArray[i].equalsIgnoreCase(args[1]))
countstr++;
}
}
} catch (IOException e) {
e.printStackTrace();
}
By running the two codes, I get the different results, the Scanner looks like to get the right one.But for the large text file, the Scanner runs much more slower than the latter one. Anyone who can tell me the reason and give a much more efficient solution.

In your first approch. You dont need to use two scanner. Scanner with "" is not good choice for the large line.
your line is already Converted to lowercase. So you just need to do lowercase of key outside once . And do equals in loop
Or get the line
String key = String.valueOf(".*?\\b" + "Tom".toLowerCase() + "\\b.*?");
Pattern p = Pattern.compile(key);
word = word.toLowerCase().replaceAll("[^a-zA-Z0-9\\s]", "");
Matcher m = p.matcher(word);
if (m.find()) {
countstr++;
}
Personally i would choose BufferedReader approach for the large file.
String key = String.valueOf(".*?\\b" + args[0].toLowerCase() + "\\b.*?");
Pattern p = Pattern.compile(key);
try (final BufferedReader br = Files.newBufferedReader(inputFile,
StandardCharsets.UTF_8)) {
for (String line; (line = br.readLine()) != null;) {
// processing the line.
line = line.toLowerCase().replaceAll("[^a-zA-Z0-9\\s]", "");
Matcher m = p.matcher(line);
if (m.find()) {
countstr++;
}
}
}
Gave Sample in Java 7. Change if required!!

Regex in Java is not matching anything from text file

I have a text file with several lines
Category: Type of problem you're having
Description: Overview of the problem
How To Fix: Directions to fix your problem (has carriage
returns, sometimes)
Related Links: Additional Resources
**There are no numbers in my list; it was the only way I could think of to make it neater...*
I've been trying to get my code to recognize all of the information between "How To Fix:" and "Related Links" when it has more than one line. I know from my research that I have to use either (?s) or Pattern.DOTALL, however neither of them seem to be working. I'm fairly new to Regex, so I'm expect is something elementary. Here is my code:
String fileName = System.getProperty("user.home") + "/Desktop/Test.txt";
try {
FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr);
sc1 = new Scanner(br);
String findingRegex = "(Description:.*)";
String recommRegex = "(?<=How To Fix:)(.*)(?=Related Links)";//regex I'm trying to use
Pattern pFinding = Pattern.compile(findingRegex);
Pattern pRecomm = Pattern.compile(recommRegex, Pattern.DOTALL);
while (sc1.hasNextLine()) {
String clean = sc1.nextLine().trim();
String clean2 = clean.replaceAll("\\\\x\\p{XDigit}{2}", "");
Matcher mFinding = pFinding.matcher(clean2);
Matcher mRecomm = pRecomm.matcher(clean2);
while (mFinding.find()) {
System.out.println(mFinding);
}
while (mRecomm.find()){
System.out.println(mRecomm); //nothing prints?
}
}
br.close();
fr.close();
System.out.println("The following data was imported: ");
try {
tbl.displayAll();
} catch (NullPointerException npe) {
System.out.println("You have no data.");
}
} catch (FileNotFoundException fnfe) {
System.out.println("File named Test.txt was not located on your desktop. Program Terminated.");
System.exit(0);
} catch (IOException ioe) {
System.out.println("The import operation failed. Program Terminated");
System.exit(0);
} finally {
sc1.close();
}
Lastly, I tested my Regex here and it worked as expected?
MY SOLUTION:
String findingRegex = "(?<=Description:)(.*)(?=How To Fix)";
String recommRegex = "(?<=How To Fix:)(.*)(?=Related Links)";
Pattern pFinding = Pattern.compile(findingRegex, Pattern.DOTALL);
Pattern pRecomm = Pattern.compile(recommRegex, Pattern.DOTALL);
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null){
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String newFile = sb.toString();
Matcher mFinding = pFinding.matcher(newFile);
Matcher mRecomm = pRecomm.matcher(newFile);
while (mFinding.find()) {
System.out.println(mFinding);
}
while (mRecomm.find()){
System.out.println(mRecomm);
}

Here:
String clean = sc1.nextLine().trim();
You are breaking your input up by line. But then you're trying to match multiple lines. There aren't multiple lines to match, because you only kept the one.
You could read the entire file into memory first, and then match against it. Or you could do something like
StringBuilder sb = new StringBuilder();
int state = 0;
while (sc1.hasNextLine()) {
String line = sc1.nextLine();
if (line.contains("How To Fix:")) {
state = 1;
}
if (state == 1) {
sb.append(line);
}
if (line.contains("Related Links:")) {
state = 0;
}
}
(You'll need to modify this if you need to match more than once per file.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Searching for specific word in a text file - java

You can use regex with a word boundary for this task: final String regex = String.format("\\b%s\\b", name); final Pattern pattern = Pattern.compile(regex); final Matcher matcher = pattern.matcher(line); matcher.find(); if( matcher.group(0).length() > 0 ) { text += line; line = reader.readLine(); }

line.equals(name) Replace line.contains(name)

Related

Buffer Reader code to read input file

Splitting a text file into multiple files by specific character sequence

How to apply regex to entire file, not just line after line?

java string matching from a large text file issue

Regex in Java is not matching anything from text file

Categories

Resources