java string matching from a large text file issue

java string matching from a large text file issue - java

I would like to implement a task of string matching from a large text file.
1. replace all the non-alphanumeric characters
2. count the number of a specific term in the text file. For example, matching term "tom". The matching is not case sensitive.so term "Tom" should me counted. However the term tomorrow should not be counted.
code template one:
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile));
} catch (FileNotFoundException e1) {
System.out.println("Not found the text file: "+inputFile);
}
Scanner scanner = null;
try {
while (( line = in.readLine())!=null){
String newline=line.replaceAll("[^a-zA-Z0-9\\s]", " ").toLowerCase();
scanner = new Scanner(newline);
while (scanner.hasNext()){
String term = scanner.next();
if (term.equalsIgnoreCase(args[1]))
countstr++;
}
}
} catch (IOException e) {
e.printStackTrace();
}
code template two:
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile));
} catch (FileNotFoundException e1) {
System.out.println("Not found the text file: "+inputFile);
}
Scanner scanner = null;
try {
while (( line = in.readLine())!=null){
String newline=line.replaceAll("[^a-zA-Z0-9\\s]", " ").toLowerCase();
String[] strArray=newline.split(" ");//split by blank space
for (int =0;i<strArray.length;i++)
if (strArray[i].equalsIgnoreCase(args[1]))
countstr++;
}
}
} catch (IOException e) {
e.printStackTrace();
}
By running the two codes, I get the different results, the Scanner looks like to get the right one.But for the large text file, the Scanner runs much more slower than the latter one. Anyone who can tell me the reason and give a much more efficient solution.

In your first approch. You dont need to use two scanner. Scanner with "" is not good choice for the large line.
your line is already Converted to lowercase. So you just need to do lowercase of key outside once . And do equals in loop
Or get the line
String key = String.valueOf(".*?\\b" + "Tom".toLowerCase() + "\\b.*?");
Pattern p = Pattern.compile(key);
word = word.toLowerCase().replaceAll("[^a-zA-Z0-9\\s]", "");
Matcher m = p.matcher(word);
if (m.find()) {
countstr++;
}
Personally i would choose BufferedReader approach for the large file.
String key = String.valueOf(".*?\\b" + args[0].toLowerCase() + "\\b.*?");
Pattern p = Pattern.compile(key);
try (final BufferedReader br = Files.newBufferedReader(inputFile,
StandardCharsets.UTF_8)) {
for (String line; (line = br.readLine()) != null;) {
// processing the line.
line = line.toLowerCase().replaceAll("[^a-zA-Z0-9\\s]", "");
Matcher m = p.matcher(line);
if (m.find()) {
countstr++;
}
}
}
Gave Sample in Java 7. Change if required!!

Related

How to remove a new-line character (\n) from copy-pasted text in Java?

I have a code which replace some characters (space, tabulator) of string introduced by the user, and then shows the text:
System.out.println("Text:");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in));
try {
String text = bufferedReader.readLine();
text = text.replaceAll("\n", "");
text = text.replaceAll(" ", "");
text = text.replaceAll("\t", "");
System.out.println(text);
} catch (IOException e) {
}
But when I paste a text of varios lines:
First Substring Introduced
Second Substring Introduced
Third Substring Introduced
it shows just the substring before the first newline like:
firstSubtringIntroduced
I want to obtain the next result of whole pasted text:
FirstSubstringIntroducedSecondSubstringIntroducedThirdSubstringIntroduced

You are reading just one line, the first one:
String text = bufferedReader.readLine(); //just one line
That's why you got that output that only shows the first line processed. You should make a loop in order to read all of the lines you are entering:
while((text=bufferedReader.readLine())!=null)
{
text = text.replaceAll("\n", "");
text = text.replaceAll(" ", "");
text = text.replaceAll("\t", "");
System.out.print(text);
}
The first loop will print FirstSubtringIntroduced, the second SecondSubstringIntroduced, and so on, until all the lines are processed.

Try aggregating all lines together, after removing tab and space from each line:
StringBuilder sb = new StringBuilder();
String text = "";
try {
while ((text = br.readLine()) != null) {
text = text.replaceAll("[\t ]", "");
sb.append(text);
}
}
catch (IOException e) {
}
System.out.println(sb);
The issue here is that your BufferedReader is reading one line at a time.
As an alternative, and closer to your current solution, you could just using System.out.print, which does not automatically print a newline, instead of System.out.println:
try {
while ((text = br.readLine()) != null) {
text = text.replaceAll("[\t ]", "");
System.out.print(text);
}
}
catch (IOException e) {
}

Note that String#replaceAll expects a regular expression. String#replace replaces all occurrences of the first argument with the second argument (which is what you want).
System.out.println(text.replace("\n", "").replace("\r", ""));
The method names are a little bit confusing.

public static void main(String args[]) {
System.out.println("Text:");
StringBuilder stringBuilder = new StringBuilder();
try (InputStreamReader inputStreamReader = new InputStreamReader(System.in);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
Scanner scanner = new Scanner(bufferedReader);
) {
while (scanner.hasNext()) {
stringBuilder.append(scanner.next());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(stringBuilder.toString());
}
I do think this is what you need.

How Can I Read the Movie Names Only? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a data like this
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
and suppose the link part is in the same line with the movie names part.I am
only interested in movie numbers in the leftmost part and the movie names.
How can I read this file in Java and return like:
1|Toy Story
2|GoldenEye
Thanks for helping in advance.

Pretty easy, just split on " (" and remember to escape it using \\.
public static void main(String[] args) {
String result = movie("1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0");
System.out.println(result); //prints 1|Toy Story
}
public static String movie(String movieString){
return movieString.split(" \\(")[0];
}

You can use regular expressions to extract the part that you want.
It is assumed that a movie title only contains word characters or spaces.
List<String> movieInfos = Arrays.asList(
"1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0",
"2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0",
"3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0",
"4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0",
"5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0"
);
Pattern pattern = Pattern.compile("^(\\d+)\\|([\\w\\s]+) \\(\\d{4}\\).*$");
for (String movieInfo : movieInfos) {
Matcher matcher = pattern.matcher(movieInfo);
if (matcher.matches()) {
String id = matcher.group(1);
String title = matcher.group(2);
System.out.println(String.format("%s|%s", id, title));
} else {
System.out.println("Unexpected data");
}
}

This works only if you have all the lines formated like that.
private static final String FILENAME = "pathToFile";
public static void main(String[] args) {
BufferedReader br = null;
FileReader fr = null;
ArrayList<String> output = new ArrayList<>();
try {
//br = new BufferedReader(new FileReader(FILENAME));
fr = new FileReader(FILENAME);
br = new BufferedReader(fr);
String currentLine;
while ((currentLine= br.readLine()) != null) {
String movie = currentLine.split(" \\(")[0];
output.add(movie);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null)
br.close();
if (fr != null)
fr.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}

Considering the file format is the same as you have given, read the file line by line and for each read line, split it on the "(" parenthesis and print the first index in the resultant array obtained after the split operation.
static void readMovieNamesFromFile(String fileName) {
try (BufferedReader br = new BufferedReader(new FileReader(new File(fileName)))) {
String line;
while( (line = br.readLine()) != null){
System.out.println((line.split("\\(")[0]).trim());
}
} catch (IOException e) {
e.printStackTrace();
}
}

Assuming you are reading t.txt
File file = new File("t.txt");
try {
Scanner in = new Scanner(file);
while(in.hasNextLine())
{
String arr[] = in.nextLine().split("\\|");
if(arr.length > 1)
{
System.out.println(arr[0] +"|"+arr[1].split("\\(")[0]);
System.out.println();
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
Will give you as an output
1|Toy Story
2|GoldenEye
3|Four Rooms
4|Get Shorty
5|Copycat
There are 2 things which you have to take care in this.
(Here we assume we are reading the first line)
Split by |. Now since | is a meta character you have to use to escape it. Hence in.nextLine().split("\\|");
Now arr[0] will contain 1 and arr[2] will contain Toy Story (1995). So we split arr[2] via "(". you need the first match hence you can write it as arr[1].split("\\(")[0]) (you again have to escape it as "(" is also a metacharacter).
PS : if(arr.length > 1) this line is there to avoid blank new lines so that you don't end up with ArrayIndexOutOfBoundsException.

You can save data in String
For example
String name = //data of move
Then use if with is char
for(int i =0;i<name.lenght;i++)
{
if(name.charat(i).equals("(") //will read when it catch ( after name it will stop
{Break;}
Else
System.out.print("name.charat(i);
}
You can also fixt by other way

Buffer Reader code to read input file

I have a text file named "message.txt" which is read using Buffer Reader. Each line of the text file contains both "word" and "meaning" as given in this example:
"PS:Primary school"
where PS - word, Primary school - meaning
When the file is being read, each line is tokenized to "word" and "meaning" from ":".
If the "meaning" is equal to the given input string called "f_msg3", "f_msg3" is displayed on the text view called "txtView". Otherwise, it displays "f_msg" on the text view.
But the "if condition" is not working properly in this code. For example if "f_msg3" is equal to "Primary school", the output on the text view must be "Primary school". But it gives the output as "f_msg" but not "f_msg3". ("f_msg3" does not contain any unnecessary strings.)
Can someone explain where I have gone wrong?
try {
BufferedReader file = new BufferedReader(new InputStreamReader(getAssets().open("message.txt")));
String line = "";
while ((line = file.readLine()) != null) {
try {
/*separate the line into two strings at the ":" */
StringTokenizer tokens = new StringTokenizer(line, ":");
String word = tokens.nextToken();
String meaning = tokens.nextToken();
/*compare the given input with the meaning of the read line */
if(meaning.equalsIgnoreCase(f_msg3)) {
txtView.setText(f_msg3);
} else {
txtView.setText(f_msg);
}
} catch (Exception e) {
txtView.setText("Cannot break");
}
}
} catch (IOException e) {
txtView.setText("File not found");
}

Try this
............
meaning = meaning.replaceAll("\\s+", " ");
/*compare the given input with the meaning of the read line */
if(meaning.equalsIgnoreCase(f_msg3)) {
txtView.setText(f_msg3);
} else {
txtView.setText(f_msg);
}
............
Otherwise comment the else part, then it will work.

I don't see any obvious error in your code, maybe it is just a matter
of cleaning the string (i.e. removing heading and trailing spaces, newlines and so on) before comparing it.
Try trimming meaning, e.g. like this :
...
String meaning = tokens.nextToken();
if(meaning != null) {
meaning = meaning.trim();
}
if(f_msg3.equalsIgnoreCase(meaning)) {
txtView.setText(f_msg3);
} else {
txtView.setText(f_msg);
}
...

A StringTokenizer takes care of numbers (the cause for your error) and other "tokens" - so might be considered to invoke too much complexity.
String[] pair = line.split("\\s*\\:\\s*", 2);
if (pair.length == 2) {
String word = pair[0];
String meaning = pair[1];
...
}
This splits the line into at most 2 parts (second optional parameter) using a regular expression. \s* stands for any whitespace: tabs and spaces.
You could also load all in a Properties. In a properties file the format key=value is convention, but also key:value is allowed. However then some escaping might be needed.

ArrayList vals = new ArrayList();
String jmeno = "Adam";
vals.add("Honza");
vals.add("Petr");
vals.add("Jan");
if(!(vals.contains(jmeno))){
vals.add(jmeno);
}else{
System.out.println("Adam je už v seznamu");
}
for (String jmena : vals){
System.out.println(jmena);
}
try (BufferedReader br = new BufferedReader(new FileReader("dokument.txt")))
{
String aktualni = br.readLine();
int pocetPruchodu = 0;
while (aktualni != null)
{
String[] znak = aktualni.split(";");
System.out.println(znak[pocetPruchodu] + " " +znak[pocetPruchodu + 1]);
aktualni = br.readLine();
}
br.close();
}
catch (IOException e)
{
System.out.println("Nezdařilo se");
}
try (BufferedWriter bw = new BufferedWriter(new FileWriter("dokument2.txt")))
{
int pocetpr = 0;
while (pocetpr < vals.size())
{
bw.write(vals.get(pocetpr));
bw.append(" ");
pocetpr++;
}
bw.close();
}
catch (IOException e)
{
System.out.println("Nezdařilo se");
}

Using trim() in Java to remove parts of an ouput

I have some code I wrote that outputs a batch file output to a jTextArea. Currently the batch file outputs an active directory query for the computer name, but there is a bunch of stuff that outputs as well that I want to be removed from the output from the variable String trimmedLine. Currently it's still outputting everything else and I can't figure out how to get only the computer name to appear.
Output: "CN=FDCD111304,OU=Workstations,OU=SIM,OU=Accounts,DC=FL,DC=NET"
I want the output to instead just show only this:
FDCD111304
Can anyone show me how to fix my code to only output the computer name and nothing else?
Look at console output (Ignore top line in console output)
btnPingComputer.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent arg0) {
String line;
BufferedWriter bw = null;
BufferedWriter writer =null;
try {
writer = new BufferedWriter(new FileWriter(tempFile));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String lineToRemove = "OU=Workstations";
String s = null;
Process p = null;
try {
p = Runtime.getRuntime().exec("c:\\computerQuery.bat");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
StringBuffer sbuffer = new StringBuffer(); // new trial
BufferedReader in = new BufferedReader(new InputStreamReader(p
.getInputStream()));
try {
while ((line = in.readLine()) != null) {
System.out.println(line);
textArea.append(line);
textArea.append(String.format(" %s%n", line));
sbuffer.append(line + "\n");
s = sbuffer.toString();
String trimmedLine = line.trim();
if(trimmedLine.equals(lineToRemove)) continue;
writer.write(line + System.getProperty("line.separator"));
}
fw.write("commandResult is " + s);
String input = "CN=FDCD511304,OU=Workstations,OU=SIM,OU=Accounts,DC=FL,DC=NET";
Pattern pattern = Pattern.compile("(.*?)\\=(.*?)\\,");
Matcher m = pattern.matcher(input);
while(m.find()) {
String currentVar = m.group().substring(3, m.group().length() - 1);
System.out.println(currentVar); //store or do whatever you want
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally
{
try {
fw.close();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
try {
in.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});

You could also use javax.naming.ldap.LdapName when dealing with distinguished names. It also handles escaping which is tricky with regex alone (i.e. cn=foo\,bar,dc=fl,dc=net is a perfectly valid DN)
String dn = "CN=FDCD111304,OU=Workstations,OU=SIM,OU=Accounts,DC=FL,DC=NET";
LdapName ldapName = new LdapName(dn);
String commonName = (String) ldapName.getRdn(ldapName.size() - 1).getValue();

Well I would personally use the split() function to first get the parts split up and then parse out again. So my (probably unprofessional and buggy code) would be
String args[] = line.split(",");
String args2[] = args[0].split("=");
String computerName = args2[1];
And that would be where this is:
while ((line = in.readLine()) != null) {
System.out.println(line);
String trimmedLine = line.trim();
if (trimmedLine.equals(lineToRemove))
continue;
writer.write(line
+ System.getProperty("line.separator"));
textArea.append(trimmedLine);
textArea.append(String.format(" %s%n", line));
}

You can use a different regular expression and Matcher.matches() to find only the value you're looking for:
String str = "CN=FDCD111304,OU=Workstations,OU=SIM,OU=Accounts,DC=FL,DC=NET";
Pattern pattern = Pattern.compile("(?:.*,)?CN=([^,]+).*");
Matcher matcher = pattern.matcher(str);
if(matcher.matches()) {
System.out.println(matcher.group(1));
} else {
System.out.println("No value for CN found");
}
FDCD111304
That regular expression will find the value for CN regardless of where in the string it is. The first group is to discard anything in front of CN= (we use a group starting with ?: here to indicate that the contents of the group should not be kept), then we match CN=, then the value, which may not contain a comma and then the rest of the string (which we don't care about).
You can also use a different regex and Matcher.find() to get both the keys and values and choose which keys to act on:
String str = "CN=FDCD111304,OU=Workstations,OU=SIM,OU=Accounts,DC=FL,DC=NET";
Pattern pattern = Pattern.compile("([^=]+)=([^,]+),?");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
String key = matcher.group(1);
String value = matcher.group(2);
if("CN".equals(key) || "DC".equals(key)) {
System.out.printf("%s: %s%n", key, value);
}
}
CN: FDCD111304
DC: FL
DC: NET

Try using substring to chop off the parts you dont require hence creating a new string

There're few options, simples dumbest:
str.substring(str.indexOf("=") + 1, str.indexOf(","))
Second one and more flexible approach would be to build HashArray, it would be helpful in future to read other values.
Edit: Second method
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.HashMap;
public class HelloWorld{
public static void main(String []args){
String input = "CN=FDCD111304,OU=Workstations,OU=SIM,OU=Accounts,DC=FL,DC=NET";
Pattern pattern = Pattern.compile("(.*?)\\=(.*?)\\,");
Matcher m = pattern.matcher(input);
while(m.find()) {
String currentVar = m.group().substring(0, m.group().length() - 2);
System.out.println(currentVar); //store or do whatever you want
}
}
}
This one will print all values like CN=FDCD11130, you can split it by '=' and store in key/value container like HashMap or just inside list.

Regex in Java is not matching anything from text file

I have a text file with several lines
Category: Type of problem you're having
Description: Overview of the problem
How To Fix: Directions to fix your problem (has carriage
returns, sometimes)
Related Links: Additional Resources
**There are no numbers in my list; it was the only way I could think of to make it neater...*
I've been trying to get my code to recognize all of the information between "How To Fix:" and "Related Links" when it has more than one line. I know from my research that I have to use either (?s) or Pattern.DOTALL, however neither of them seem to be working. I'm fairly new to Regex, so I'm expect is something elementary. Here is my code:
String fileName = System.getProperty("user.home") + "/Desktop/Test.txt";
try {
FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr);
sc1 = new Scanner(br);
String findingRegex = "(Description:.*)";
String recommRegex = "(?<=How To Fix:)(.*)(?=Related Links)";//regex I'm trying to use
Pattern pFinding = Pattern.compile(findingRegex);
Pattern pRecomm = Pattern.compile(recommRegex, Pattern.DOTALL);
while (sc1.hasNextLine()) {
String clean = sc1.nextLine().trim();
String clean2 = clean.replaceAll("\\\\x\\p{XDigit}{2}", "");
Matcher mFinding = pFinding.matcher(clean2);
Matcher mRecomm = pRecomm.matcher(clean2);
while (mFinding.find()) {
System.out.println(mFinding);
}
while (mRecomm.find()){
System.out.println(mRecomm); //nothing prints?
}
}
br.close();
fr.close();
System.out.println("The following data was imported: ");
try {
tbl.displayAll();
} catch (NullPointerException npe) {
System.out.println("You have no data.");
}
} catch (FileNotFoundException fnfe) {
System.out.println("File named Test.txt was not located on your desktop. Program Terminated.");
System.exit(0);
} catch (IOException ioe) {
System.out.println("The import operation failed. Program Terminated");
System.exit(0);
} finally {
sc1.close();
}
Lastly, I tested my Regex here and it worked as expected?
MY SOLUTION:
String findingRegex = "(?<=Description:)(.*)(?=How To Fix)";
String recommRegex = "(?<=How To Fix:)(.*)(?=Related Links)";
Pattern pFinding = Pattern.compile(findingRegex, Pattern.DOTALL);
Pattern pRecomm = Pattern.compile(recommRegex, Pattern.DOTALL);
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null){
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String newFile = sb.toString();
Matcher mFinding = pFinding.matcher(newFile);
Matcher mRecomm = pRecomm.matcher(newFile);
while (mFinding.find()) {
System.out.println(mFinding);
}
while (mRecomm.find()){
System.out.println(mRecomm);
}

Here:
String clean = sc1.nextLine().trim();
You are breaking your input up by line. But then you're trying to match multiple lines. There aren't multiple lines to match, because you only kept the one.
You could read the entire file into memory first, and then match against it. Or you could do something like
StringBuilder sb = new StringBuilder();
int state = 0;
while (sc1.hasNextLine()) {
String line = sc1.nextLine();
if (line.contains("How To Fix:")) {
state = 1;
}
if (state == 1) {
sb.append(line);
}
if (line.contains("Related Links:")) {
state = 0;
}
}
(You'll need to modify this if you need to match more than once per file.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java string matching from a large text file issue - java

Related

How to remove a new-line character (\n) from copy-pasted text in Java?

How Can I Read the Movie Names Only? [closed]

Buffer Reader code to read input file

Using trim() in Java to remove parts of an ouput

Regex in Java is not matching anything from text file

Categories

Resources