Regex patter in Java matching single letter instead of complete word. - java

I am new to java and been trying to write some line of code where the requirement is something regex patter will be saved in file, read the content from file and save it array list then compare with some string variable and find the match. But in this process when am trying to do its matching single letter instead of the whole word. below is the code .
import java.io.*;
import java.util.Scanner;
import java.util.ArrayList;
import java.util.regex.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches {
public void findfile( String path ){
File f = new File(path);
if(f.exists() && !f.isDirectory()) {
System.out.println("file found.....!!!!");
if(f.length() == 0 ){
System.out.println("file is empty......!!!!");
}}
else {
System.out.println("file missing");
}
}
public void readfilecontent(String path, String sql){
try{Scanner s = new Scanner(new File(path));
ArrayList<String> list = new ArrayList<String>();
while (s.hasNextLine()){
list.add(s.nextLine());
}
s.close();
System.out.println(list);
Pattern p = Pattern.compile(list.toString(),Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(sql);
if (m.find()){
System.out.println("match found");
System.out.println(m.group());
}
else {System.out.println("match not found"); }
}
catch (FileNotFoundException ex){}
}
public static void main( String args[] ) {
String path = "/code/sql.pattern";
String sql = "select * from schema.test";
RegexMatches regex = new RegexMatches();
regex.findfile(path);
regex.readfilecontent(path,sql);
}
the sql.pattern contains
\\buser\\b
\\border\\b
Am expecting that it shouldn't match anything and print message saying match not found instead it says match found and m.group() prints letter s as output could anyone please help.
Thanks in advance.

The problem here seems to be the double slash.
I would not recommend you to provide list.toString() in Pattern.compile method because it also inserts '[', ',' and ']' character which can mess up with you regex, instead you can refer below code:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches {
public void findfile(String path) {
File f = new File(path);
if (f.exists() && !f.isDirectory()) {
System.out.println("file found.....!!!!");
if (f.length() == 0) {
System.out.println("file is empty......!!!!");
}
} else {
System.out.println("file missing");
}
}
public void readfilecontent(String path, String sql) {
try {
Scanner s = new Scanner(new File(path));
ArrayList<String> list = new ArrayList<String>();
while (s.hasNextLine()) {
list.add(s.nextLine());
}
s.close();
System.out.println(list);
list.stream().forEach(regex -> {
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(sql);
if (m.find()) {
System.out.println("match found for regex " + regex );
System.out.println("matched substring: "+ m.group());
} else {
System.out.println("match not found for regex " + regex);
}
});
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
}
public static void main(String args[]) {
String path = "/code/sql.pattern";
String sql = "select * from schema.test";
RegexMatches regex = new RegexMatches();
regex.findfile(path);
regex.readfilecontent(path, sql);
}
}
while keeping /code/sql.pattern as below:
\buser\b
\border\b
\bfrom\b

Related

counting the number of occurences of each word in a pdf file java

I am making a java program using PDFbox that reads any pdf file and counts how many times each word appears in the file but for some reason nothing appears when I run the program, I expect it to print each word and the number of occurrences of that word next to it. thanks in advance.
here is my code:
package lab8;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;
import java.util.Scanner;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Extractor {
public static void main(String[] args) throws FileNotFoundException {
Map<String, Integer> frequencies = new TreeMap<String, Integer>();
PDDocument pd;
File input = new File("C:\\Users\\Ammar\\Desktop\\Application.pdf");
Scanner in = new Scanner(input);
try {
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setEndPage(20);
String text = stripper.getText(pd);
while (in.hasNext()) {
String word = clean(in.next());
if (word != "") {
Integer count = frequencies.get(word);
if (count == null) {
count = 1;
} else {
count = count + 1;
}
frequencies.put(word, count);
}
}
for (String key : frequencies.keySet()) {
System.out.println(key + ": " + frequencies.get(key));
}
if (pd != null) {
pd.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static String clean(String s) {
String r = "";
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (Character.isLetter(c)) {
r = r + c;
}
}
return r.toLowerCase();
}
}
I have tried to resolve the logic.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Extractor {
public static void main(String[] args) throws FileNotFoundException {
Map<String, Integer> wordFrequencies = new TreeMap<String, Integer>();
Map<Character, Integer> charFrequencies = new TreeMap<Character, Integer>();
PDDocument pd;
File input = new File("C:\\Users\\Ammar\\Desktop\\Application.pdf");
try {
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setEndPage(20);
String text = stripper.getText(pd);
for(int i=0; i<text.length(); i++)
{
char c = text.charAt(i);
int count = charFrequencies.get(c) != null ? (charFrequencies.get(c)) + 1 : 1;
charFrequencies.put(c, count);
}
String[] texts = text.split(" ");
for (String txt : texts) {
int count = wordFrequencies.get(txt) != null ? (wordFrequencies.get(txt)) + 1 : 1;
wordFrequencies.put(txt, count);
}
System.out.println("Printing the number of words");
for (String key : wordFrequencies.keySet()) {
System.out.println(key + ": " + wordFrequencies.get(key));
}
System.out.println("Printing the number of characters");
for (char charKey : charFrequencies.keySet()) {
System.out.println(charKey + ": " + charFrequencies.get(charKey));
}
if (pd != null) {
pd.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Try this code. If there is still some problem and you are not able to resolve. I can try to resolve.
In your code you can also use StringTokenizer's object by passing your string i.e
StringTokenizer st = new StringTokenizer(stripper.getText(pd));
And in while loop st.hasMoreTokens() and to render each word String word = clean(st.nextToken()); This is also working fine.

how to delete up extra line breakers in string

I have got a text like this in my String s (which I have already read from txt.file)
trump;Donald Trump;trump#yahoo.eu
obama;Barack Obama;obama#google.com
bush;George Bush;bush#inbox.com
clinton,Bill Clinton;clinton#mail.com
Then I'm trying to cut off everything besides an e-mail address and print out on console
String f1[] = null;
f1=s.split("(.*?);");
for (int i=0;i<f1.length;i++) {
System.out.print(f1[i]);
}
and I have output like this:
trump#yahoo.eu
obama#google.com
bush#inbox.com
clinton#mail.com
How can I avoid such output, I mean how can I get output text without line breakers?
Try using below approach. I have read your file with Scanner as well as BufferedReader and in both cases, I don't get any line break. file.txt is the file that contains text and the logic of splitting remains the same as you did
public class CC {
public static void main(String[] args) throws IOException {
Scanner scan = new Scanner(new File("file.txt"));
while (scan.hasNext()) {
String f1[] = null;
f1 = scan.nextLine().split("(.*?);");
for (int i = 0; i < f1.length; i++) {
System.out.print(f1[i]);
}
}
scan.close();
BufferedReader br = new BufferedReader(new FileReader(new File("file.txt")));
String str = null;
while ((str = br.readLine()) != null) {
String f1[] = null;
f1 = str.split("(.*?);");
for (int i = 0; i < f1.length; i++) {
System.out.print(f1[i]);
}
}
br.close();
}
}
You may just replace all line breakers as shown in the below code:
String f1[] = null;
f1=s.split("(.*?);");
for (int i=0;i<f1.length;i++) {
System.out.print(f1[i].replaceAll("\r", "").replaceAll("\n", ""));
}
This will replace all of them with no space.
Instead of split, you might match an email like format by matching not a semicolon or a whitespace character one or more times using a negated character class [^\\s;]+ followed by an # and again matching not a semicolon or a whitespace character.
final String regex = "[^\\s;]+#[^\\s;]+";
final String string = "trump;Donald Trump;trump#yahoo.eu \n"
+ " obama;Barack Obama;obama#google.com \n"
+ " bush;George Bush;bush#inbox.com \n"
+ " clinton,Bill Clinton;clinton#mail.com";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
final List<String> matches = new ArrayList<String>();
while (matcher.find()) {
matches.add(matcher.group());
}
System.out.println(String.join("", matches));
[^\\s;]+#[^\\s;]+
Regex demo
Java demo
package com.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String s = "trump;Donald Trump;trump#yahoo.eu "
+ "obama;Barack Obama;obama#google.com "
+ "bush;George Bush;bush#inbox.com "
+ "clinton;Bill Clinton;clinton#mail.com";
String spaceStrings[] = s.split("[\\s,;]+");
String output="";
for(String word:spaceStrings){
if(validate(word)){
output+=word;
}
}
System.out.println(output);
}
public static final Pattern VALID_EMAIL_ADDRESS_REGEX = Pattern.compile(
"^[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,6}$",
Pattern.CASE_INSENSITIVE);
public static boolean validate(String emailStr) {
Matcher matcher = VALID_EMAIL_ADDRESS_REGEX.matcher(emailStr);
return matcher.find();
}
}
Just replace '\n' that may arrive at start and end.
write this way.
String f1[] = null;
f1=s.split("(.*?);");
for (int i=0;i<f1.length;i++) {
f1[i] = f1[i].replace("\n");
System.out.print(f1[i]);
}

Java regex - get line number from matching text

It's based from my previous question.
For my case I want to get number of line from regex pattern. E.g :
name : andy
birth : jakarta, 1 jan 1990
number id : 01011990 01
age : 26
study : Informatics engineering
I want to get number of line from text that match of number [0-9]+. I wish output like this :
line 2
line 3
line 4
This will do it for you. I modified the regular expression to ".*[0-9].*"
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.stream.Stream;
import java.util.regex.Pattern;
import java.util.concurrent.atomic.AtomicInteger;
class RegExLine
{
public static void main(String[] args)
{
new RegExLine().run();
}
public void run()
{
String fileName = "C:\\Path\\to\\input\\file.txt";
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fileName)))
{
stream.forEach(s ->
{
atomicInteger.getAndIncrement();
if(Pattern.matches(".*[0-9].*", s))
{
System.out.println("line "+ atomicInteger);
}
});
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
Use a Scanner to iterate all lines of your input. And use Matcher Object to check for RegEx Pattern.
String s = "name : andy\n" +
"birth : jakarta, 1 jan 1990\n" +
"number id : 01011990 01\n" +
"age : 26\n" +
"study : Informatics engineering";
Scanner sc = new Scanner(s);
int lineNr = 1;
while (sc.hasNextLine()) {
String line = sc.nextLine();
Matcher m = Pattern.compile(".*[0-9].*").matcher(line);
if(m.matches()){
System.out.println("line " + lineNr);
}
lineNr++;
}
You could simply have the following:
public static void main(String[] args) throws IOException {
int i = 1;
Pattern pattern = Pattern.compile(".*[0-9]+.*");
try (BufferedReader br = new BufferedReader(new FileReader("..."))) {
String line;
while ((line = br.readLine()) != null) {
if (pattern.matcher(line).matches()) {
System.out.println("line " + i);
}
i++;
}
}
}
This code simply opens a BufferedReader to a given file path and iterates over each line in it (until readLine() returns null, indicating the end of the file). If the line matches the pattern ".*[0-9]+.*", meaning the line contains at least a digit, the line number is printed.
Use Matcher Object to check for RegEx Pattern.
public static void main( String[] args )
{
String s = "name : andy\n" + "birth : jakarta, 1 jan 1990\n" + "number id : 01011990 01\n" + "age : 26\n"
+ "study : Informatics engineering";
try
{
Pattern pattern = Pattern.compile( ".*[0-9].*" );
Matcher matcher = pattern.matcher( s );
int line = 1;
while ( matcher.find() )
{
line++;
System.out.println( "line :" + line );
}
}
catch ( Exception e )
{
e.printStackTrace();
}
}

Regex in Java with matches stored into an ArrayList

I have the following code made with the purpose of storing and displaying all words that begin with letter a and ending with z. First of all I am getting an error from my regex pattern, and secondly I am getting an error from not displaying the content (String) stored into an ArrayList.
import java.io.*;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.*;
public class RegexSimple2{
public static void main(String[] args) {
try{
Scanner myfis = new Scanner("D:\\myfis2.txt");
ArrayList <String> foundaz = new ArrayList<String>();
while(myfis.hasNext()){
String line = myfis.nextLine();
String delim = " ";
String [] words = line.split(delim);
for ( String s: words){
if(!s.isEmpty()&& s!=null){
Pattern pi = Pattern.compile("[a|A][a-z]*[z]");
Matcher ma = pi.matcher(s);
boolean search = false;
while (ma.find()){
search = true;
foundaz.add(s);
}
if(!search){
System.out.println("Words that start with a and end with z have not been found");
}
}
}
}
if(!foundaz.isEmpty()){
for(String s: foundaz){
System.out.println("The word that start with a and ends with z is:" + s + " ");
}
}
}
catch(Exception ex){
System.out.println(ex);
}
}
}
You need to change how you are reading the file in. In addition, change the regex to [aA].*z. The .* matches zero or more of anything. See the minor changes I made below:
import java.io.*;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.*;
public class Test {
public static void main(String[] args) {
try {
BufferedReader myfis = new BufferedReader(new FileReader("D:\\myfis2.txt"));
ArrayList<String> foundaz = new ArrayList<String>();
String line;
while ((line = myfis.readLine()) != null) {
String delim = " ";
String[] words = line.split(delim);
for (String s : words) {
if (!s.isEmpty() && s != null) {
Pattern pi = Pattern.compile("[aA].*z");
Matcher ma = pi.matcher(s);
if (ma.find()) {
foundaz.add(s);
}
}
}
}
if (!foundaz.isEmpty()) {
System.out.println("The words that start with a and ends with z are:");
for (String s : foundaz) {
System.out.println(s);
}
}
} catch (Exception ex) {
System.out.println(ex);
}
}
}
Input was:
apple
applez
Applez
banana
Output was:
The words that start with a and ends with z are:
applez
Applez
import java.io.*;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.*;
public class RegexSimple2
{
public static void main(String[] args) {
try
{
Scanner myfis = new Scanner(new File("D:\\myfis2.txt"));
ArrayList <String> foundaz = new ArrayList<String>();
while(myfis.hasNext())
{
String line = myfis.nextLine();
String delim = " ";
String [] words = line.split(delim);
for (String s : words) {
if (!s.isEmpty() && s != null)
{
Pattern pi = Pattern.compile("[aA].*z");
Matcher ma = pi.matcher(s);
if (ma.find()) {
foundaz.add(s);
}
}
}
}
if(foundaz.isEmpty())
{
System.out.println("No matching words have been found!");
}
if(!foundaz.isEmpty())
{
System.out.print("The words that start with a and ends with z are:\n");
for(String s: foundaz)
{
System.out.println(s);
}
}
}
catch(Exception ex)
{
System.out.println(ex);
}
}
}

Issue Reading from a file and using a 2D array to sort the data

I'm making a province sorter, and the requirement is that I must leave the main class as is, and make a private class called Munge, i've been at this for hours and changed my code hundreds of times, basically it reads from a text file that looks like this
Hamilton, Ontario
Toronto, Ontario
Edmonton, Alberta
Red Deer, Alberta
St John's, Newfoundland
and needs to be output like this
Alberta; Edmonton, Red Deer
Ontario; Hamilton, Toronto
Newfoundland; St John's
my main class is unchangeable and looks like this
public class Lab5 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
if(args.length < 2) {
System.err.println("Usage: java -jar lab5.jar infile outfile");
System.exit(99);
}
Munge dataSorter = new Munge(args[0], args[1]);
dataSorter.openFiles();
dataSorter.readRecords();
dataSorter.writeRecords();
dataSorter.closeFiles();
}
}
and the Munge class i've made looks like this
package lab5;
import java.io.File;
import java.util.Scanner;
import java.util.Formatter;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;
public class Munge
{
private String inFileName, outFileName;
private Scanner inFile;
private Formatter outFile;
private int line = 0;
private String[] data;
public Munge(String inFileName, String outFileName)
{
this.inFileName = inFileName;
this.outFileName = outFileName;
data = new String[100];
}
public void openFiles()
{
try
{
inFile = new Scanner(new File(inFileName));
File file = new File("input.txt");
SortedMap<String, List<String>> map = new TreeMap<String, List<String>>();
Scanner scanner = new Scanner(file).useDelimiter("\\n");
while (scanner.hasNext()) {
String newline = scanner.next();
if (newline.contains(",")) {
String[] parts = newline.split(",");
String city = parts[0].trim();
String province = parts[1].trim();
List<String> cities = map.get(province);
if (cities == null) {
cities = new ArrayList<String>();
map.put(province, cities);
}
if (!cities.contains(city)) {
cities.add(city);
}
}
}
for (String province : map.keySet()) {
StringBuilder sb = new StringBuilder();
sb.append(province).append(": ");
List<String> cities = map.get(province);
for (String city : cities) {
sb.append(city).append(", ");
}
sb.delete(sb.length() - 2, sb.length());
String output = sb.toString();
System.out.println(output);
}
}
catch(FileNotFoundException exception)
{
System.err.println("File not found.");
System.exit(1);
}
catch(SecurityException exception)
{
System.err.println("You do not have access to this file.");
System.exit(1);
}
try
{
outFile = new Formatter(outFileName);
}
catch(FileNotFoundException exception)
{
System.err.println("File not found.");
System.exit(1);
}
catch(SecurityException exception)
{
System.err.println("You do not have access to this file.");
System.exit(1);
}
}
public void readRecords()
{
while(inFile.hasNext())
{
data[line] = inFile.nextLine();
System.out.println(data[line]);
line++;
}
}
public void writeRecords()
{
for(int i = 0; i < line; i++)
{
String tokens[] = data[i].split(", ");
Arrays.sort(tokens);
for(int j = 0; j < tokens.length; j++)
outFile.format("%s\r\n", tokens[j]);
}
}
public void closeFiles()
{
if(inFile != null)
inFile.close();
if(outFile != null)
outFile.close();
}
}
you'll have to excuse my brackets, there formatted correctly in netbeans but i had to move the bottom ones over to keep it in the codeblock
As I think this is homework I'll avoid giving you a solution but give some hints of what to do.
When you have read a line it consists of City, Province. So the first thing you need to do is split the string into two parts. The second part is the province and the first is the city. You need to make a collection for each province and store the city in the correct province collection.
Once you have that you sort the names of the found provinces, and iterate through them. Sort the cities for the province and then output the province name and each city name.
Useful classes could be will be HashMap, TreeMap, List, Collections (has sort methods).
Hope that helps to get you further, otherwise try to be more specific where you are stuck.

Categories