Error while replace string with symbol in Java - java

I'm solving this problem:
problem
And what I did is this:
import java.io.*;
import static java.lang.System.exit;
import java.util.*;
//Driver for Abbreviations
public class AbbreviationsDriver {
//string of message
private static String message = "";
//List of Abbreviations
private static String[] AbbreviationsList;
//Abbreviations list file
private static File AbbreviationsListFile = new File("abbreviations.txt");
//message file
private static File inputMessageFile = new File("sample_msg.txt");
//output message file
private static File outputMessageFile = new File("sample_output.txt");
//main method
public static void main(String[] args) throws FileNotFoundException {
setAbbreviations(readFileList(AbbreviationsListFile));
System.out.println("list of abbriviations:\n" + Arrays.toString(AbbreviationsList));
setMessage(readFile(inputMessageFile));
System.out.println("\nMessage in input file:\n" + message);
writeFile(outputMessageFile,addTags(message, AbbreviationsList));
System.out.println("\nMessage with tag in output file:\n" + addTags(message, AbbreviationsList));
}
//method to add tags
public static String addTags(String toTag, String[] abbreviations){
for(String abbreviation:abbreviations)
if(toTag.contains(abbreviation)){
toTag = toTag.replaceAll(abbreviation, "<" + abbreviation + ">");
}
return toTag;
}
//method to read the file list
public static String[] readFileList(File fileInput){
String input = "";
try{
Scanner inputStream = new Scanner(fileInput);
while(inputStream.hasNextLine()){
input = input + inputStream.nextLine()+ "<String>";
}
inputStream.close();
// System.out.println("list in string: " + input);
return input.split("<String>");
}
catch(Exception exception){
System.out.println("error in getting string array from file:\t" + exception.getMessage());
exit(0);
return new String[] {""};
}
}
//method to read the file
public static String readFile(File fileInput){
String inputFile = "";
try{
Scanner inputStatement = new Scanner(fileInput);
while(inputStatement.hasNextLine()){
inputFile = inputFile + inputStatement.nextLine();
}
inputStatement.close();
return inputFile;
}
catch(Exception exception){
System.out.println("error in getting message from file:\t" + exception.getMessage());
exit(0);
return "";
}
}
//method to write the output file
public static void writeFile(File fileName, String outString){
try{
PrintWriter outputStatement = new PrintWriter(fileName);
outputStatement.print(outString);
outputStatement.close();
}
catch(Exception exception){
System.out.println("error in setting message of file:\t" + exception.getMessage());
exit(0);
}
}
//method to set abbreviations
public static void setAbbreviations(String[] newAbbreviationsList){
AbbreviationsList = newAbbreviationsList;
}
//setter to set message
public static void setMessage(String newMessage){
message = newMessage;
}
//input string
public static String inputString(){
return new Scanner(System.in).nextLine();
}
}
abbreviations.txt is here:
lol
:)
iirc
4
u
ttfn
and sample_msg.txt is here:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
but when I compile and run, the error message comes out:
list of abbriviations:
[lol, :), iirc, 4, u, ttfn]
Message in input file:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 0
:)
^
at java.util.regex.Pattern.error(Pattern.java:1969)
at java.util.regex.Pattern.compile(Pattern.java:1706)
at java.util.regex.Pattern.<init>(Pattern.java:1352)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at java.lang.String.replaceAll(String.java:2223)
at AbbreviationsDriver.addTags(AbbreviationsDriver.java:44)
at AbbreviationsDriver.main(AbbreviationsDriver.java:36)
Process finished with exit code 1
I don't know how to solve this error because I've never seen this error before.
Please help me!

You pass wrong parameter into replaceAll(). First parameter must be a regex. For your purpose, regex is not needed, so use replace() method instead.

You faced the error because ) is treated as a metacharacter in regex and therefore either it needs to be escaped or must be paired with its closing counterpart.
Solution
You need to treat abbreviations with metacharacters and strings without metacharacters differently. For strings with metacharacters (e.g. :) where ) is a metacharacter), you should use String#replace while for the strings without metacharacter you should use String#replaceAll.
When you use String#replaceAll, you should create a capturing group which includes word boundaries e.g. (\bu\b) so that only those u will be processed which appear as a word. Finally, you should replace the capturing group with <$1> where $1 refers to the first (in the code given below, there is only one capturing group) capturing group e.g. (\bu\b) will be replaced by <u>.
Demo:
public class Main {
public static void main(String[] args) {
String[] abbrWithoutMetaChars = { "lol", "iirc", "4", "u", "ttfn" };
String[] abbrWithMetaChars = { ":)" };
// Test string
String str = "How are u today? iirc, this is your first free day. Hope you are having fun! :)";
// Replace all abbr. without meta chars
for (String abbreviation : abbrWithoutMetaChars) {
str = str.replaceAll("(\\b" + abbreviation + "\\b)", "<$1>");
}
// Replace all abbr. with meta chars
for (String abbreviation : abbrWithMetaChars) {
str = str.replace(abbreviation, "<" + abbreviation + ">");
}
System.out.println(str);
}
}
Output:
How are <u> today? <iirc>, this is your first free day. Hope you are having fun! <:)>

The problem is actually tricky. For example, in the list of abbreviations, u should be interpreted as a word and not a letter, since in your expected output you don't surround the letter u in the word your with angle brackets but only the u that appears by itself. Hence your code needs to locate the abbreviation as a single word in the input.
Also, iirc appears in the abbreviations list but in the input you have Iirc (with a capital I) and in the expected output it should appear as <Iirc> and not as <iirc>. In other words you should ignore case when locating the abbreviation but you need to keep the case after surrounding the abbreviation with angle brackets.
Then you have :) in the abbreviations list but ) has special meaning in regular expression syntax so your code also needs to handle that situation.
All the above implies that you need to analyze the contents of the abbreviations list file in order to turn a raw abbreviation into a valid regular expression that you can then use to locate the abbreviation in the input text.
If you assume that the abbreviations list may contain every possible abbreviation, you would probably need a large amount of code to handle each one properly. Rather than do that, I just concentrated on your sample list which divides easily into two groups:
simple words
punctuation only
Note that the second group is also known as emoticons and some emoticons contain both letters and punctuation which my code, below, does not handle. As I said, my solution only pertains to your sample list of abbreviations.
Here is the code and below the code are some notes regarding it. Please not that I took the liberty of not just fixing your code, but refactoring it as well.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
//Driver for Abbreviations
public class AbbreviationsDriver {
//Abbreviations list file
private static Path abbreviationsListPath = Paths.get("abbreviations.txt");
//message file
private static Path inputPath = Paths.get("sample_msg.txt");
//output message file
private static File outputMessageFile = new File("sample_output.txt");
//main method
public static void main(String[] args) throws FileNotFoundException {
List<String> abbreviationsList = readFileList(abbreviationsListPath);
System.out.println("List of abbreviations: " + abbreviationsList);
String message = readFile(inputPath);
System.out.println("\nMessage in input file:\n" + message);
String result = addTags(message, abbreviationsList);
writeFile(outputMessageFile, result);
System.out.println("\nMessage with tag in output file:\n" + result);
}
//method to add tags
public static String addTags(String toTag, List<String> abbreviations) {
for (String abbreviation : abbreviations) {
String regex;
if (abbreviation.contains(")")) {
regex = "(\\Q" + abbreviation + "\\E)";
}
else {
regex = "(?i)(\\b" + abbreviation + "\\b)";
}
toTag = toTag.replaceAll(regex, "<$1>");
}
return toTag;
}
//method to read the file list
public static List<String> readFileList(Path path) {
List<String> list;
try {
list = Files.readAllLines(path);
}
catch (IOException exception) {
list = List.of();
System.out.println("Failed to load: " + path);
exception.printStackTrace();
}
return list;
}
//method to read the file
public static String readFile(Path path) {
String inputFile;
try {
inputFile = Files.readString(path);
}
catch (IOException exception) {
System.out.println("Failed to read: " + path);
exception.printStackTrace();
inputFile = "";
}
return inputFile;
}
//method to write the output file
public static void writeFile(File fileName, String outString) {
try {
PrintWriter outputStatement = new PrintWriter(fileName);
outputStatement.print(outString);
outputStatement.close();
}
catch (Exception exception) {
System.out.println("Failed to write file: " + fileName);
exception.printStackTrace();
}
}
}
I use interface Path rather that class File so that I can use methods of class Files to read the text files that contain the abbreviations list and the input. Hence my code works with interface List rather than with an array of String.
Passing class members to methods as method parameters defeats the purpose of having a class member in the first place. Hence I removed the members message and AbbreviationsList.
The actual work of locating the abbreviations in the input and surrounding them with angle brackets, all occurs in method addTags. Here I handle each separate group of abbreviations. If the abbreviation contains the character ), I quote it by surrounding it with quote markers \Q and \E. (Refer to javadoc of class Pattern). Otherwise the abbreviation is a regular word, so I surround it with the word boundary marker \b. I also enclose each regular expression in parentheses so as to make it a capturing group. Note that the second regular expression begins with (?i) which means to ignore case. Hence iirc will match Iirc.
The replacement string is <$1>. The $1 is replaced with the string that was actually matched so any abbreviation found in the input will be replaced by the matched string surrounded with angle brackets.
Finally, here is the output when running the above code and using your sample data.
List of abbreviations: [lol, :), iirc, 4, u, ttfn]
Message in input file:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
Message with tag in output file:
How are <u> today? <Iirc>, this is your first free day. Hope you are having fun! <:)>

There are several ways to do this. Either you use regular expressions, or you do things the old-fashioned way by parsing word-by-word. Others have pointed out problems with your current code, due to using strings that contain regular expression metacharacters. In particular,
String doesNotWork = "I am :)".replaceAll(":)", "happy"); // invalid regex
This can be solved by quoting the string, so that metacharacters are converted into literals (it returns the string that would be written as "\\Q:)\\E", because \Q and \E are used as delimiters for quoting whole substrings, as opposed to \, which quotes the next only if it is non-alphabetical; and is otherwise used for a host of regex classes):
String worksAsExpected = "I am :)".replaceAll(Pattern.quote(":)"), "happy");
The most efficient way to process text is to do a single pass. This can be achieved by combining literal expressions with |s:
String regex = Stream.of("lol iirc 4".split(" "))
.map(s -> Pattern.quote(s)) // quotes each emoticon
.collect(Collectors.joining("|")); // joins with |
Matcher m = Pattern.compile(regex).matcher(input);
This yields surprisingly compact code, with nothing hardcoded. Finished code:
import java.util.regex.*;
import java.util.stream.*;
public class T {
public static String mark(
String[] needles, String startMark, String endMark, String input) {
String regex = Stream.of(needles)
.map(s -> s.matches("\\p{Alpha}+") ? // quotes each
"\\b" + Pattern.quote(s) + "\\b" : // to avoid yo<u>r
Pattern.quote(s)) // to handle emoticons
.collect(Collectors.joining("|")); // joins with |
Matcher m = Pattern.compile(regex).matcher(input);
StringBuffer output = new StringBuffer();
while (m.find()) {
m.appendReplacement(output, startMark + m.group() + endMark);
}
m.appendTail(output);
return output.toString();
}
public static void main(String ... args) {
System.out.println(mark(
"lol iirc 4 u ttfn :)".split(" "), // abbreviations
"<", ">", // markers to mark them with
"How are u today? iirc, this is your first free day. "
+ "Hope you are having fun! :)"));
}
}
I used #Arvind's trick of placing word-boundary metacharacters (\\b) only on alphabetical needles. This fixes all us in words being marked; but may yield strange results for 4s: writing a number with 4s in it will get it marked. Ultimately, natural language processing is hard. Regular expressions are great for very regular inputs.

Related

Java - extract from line based on regex

Small question regarding a Java job to extract information out of lines from a file please.
Setup, I have a file, in which one line looks like this:
bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
The file contains many of those lines (as describe above)
In each of the lines, there are two particular information I am interested in, the primaryKey, and the country.
In my example, ZAPDBHV7120D41A and USA
For sure, each line of the file has exactly once the primaryKey, and exactly once the country, they are separated by a comma. It is there exactly once. in no particular order (it can appear at the start of the line, middle, end of the line, etc).
The primary key is a combination of alphabet in caps [A, B, C, ... Y, Z] and numbers [0, 1, 2, ... 9]. It has no particular predefined length.
The primary key is always in between primaryKey="({primaryKey},{country},
Meaning, the actual primaryKey is found after the string primaryKey-equal-quote-open parenthesis. And before another comma three letters country comma.
I would like to write a program, in which I can extract all the primary key, as well as all countries from the file.
Input:
bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
bla++blabla()bla=bla+blablaprimaryKey="(AA45555DBMW711DD4100,ARG,bla
[...]
Result:
The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG
Therefore, I tried following:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;
public class RegexExtract {
public static void main(String[] args) throws Exception {
final String csvFile = "my_file.txt";
try (final BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
String line;
while ((line = br.readLine()) != null) {
Pattern.matches("", line); // extract primaryKey and country based on regex
String primaryKey = ""; // extract the primary from above
String country = ""; // extract the country from above
System.out.println("The primaryKey is " + primaryKey);
System.out.println("The country is " + country);
}
}
}
}
But I am having a hard time constructing the regular expression needed to match and extract.
May I ask what is the correct code in order to extract from the line based on above information?
Thank you
Explanations after the code.
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexExtract {
public static void main(String[] args) {
Path path = Paths.get("my_file.txt");
try (BufferedReader br = Files.newBufferedReader(path)) {
Pattern pattern = Pattern.compile("primaryKey=\"\\(([A-Z0-9]+),([A-Z]+)");
String line = br.readLine();
while (line != null) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
String primaryKey = matcher.group(1);
String country = matcher.group(2);
System.out.println("The primaryKey is " + primaryKey);
System.out.println("The country is " + country);
}
line = br.readLine();
}
}
catch (IOException xIo) {
xIo.printStackTrace();
}
}
}
Running the above code produces the following output (using the two sample lines in your question).
The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG
The regular expression looks for the following [literal] string
primaryKey="(
The double quote is escaped since it is within a string literal.
The opening parenthesis is escaped because it is a metacharacter and the double backslash is required since Java does not recognize \( in a string literal.
Then the regular expression groups together the string of consecutive capital letters and digits that follow the previous literal up to (but not including) the comma.
Then there is a second group of capital letters up to the next comma.
Refer to the Regular Expressions lesson in Oracle's Java tutorials.

Need Regular Expression to parse multi-line environmental variables

I want to parse a file that is a list of environmental variables similar to this example:
TPS_LIB_DIR = "$DEF_VERSION_DIR\lib\ver215";
TPS_PH_DIR = "$DEF_VERSION_DIR";
TPS_SCHEMA_DIR = "~TPS_DIR\Supersedes\code;" +
"~TPR_DIR\..\Supersedes\code;" +
"~TPN_DIR\..\..\Supersedes\code;" +
"$TPS_VERSION_DIR";
TPS_LIB_DIR = "C:\prog\lib";
BASE_DIR = "C:\prog\base";
SPARS_DIR = "C:\prog\spars";
SIGNALFILE_DIR = "E:\SIGNAL_FILES";
SIGNALFILE2_DIR = "E:\SIGNAL_FILES2";
SIGNALFILE3_DIR = "E:\SIGNAL_FILES2";
I came up with this regular expression that matches the single line definitions fine, but it will not match the multi-line definitions.
(\w+)\s*=\s*(.*);[\r\n]+
Does anyone know of a regular expression which will parse all lines in this file where the environmental variable name is in group 1 and the value (on right side of =) is in group 2? Even better would be if the multiple paths were in separate groups, but I can handle that part manually.
UPDATE:
Here is what I ended up implementing. The first pattern "Pattern p" matches the individual environmental variable blocks. The second pattern, "Pattern valpattern" parses the one or more values for each environmental variable. Hope someone finds this useful.
private static void parse(File filename) {
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*([\\s\\S]+?\";)");
Pattern valpattern = Pattern.compile("\\s*\"(.+)\"\\s*");
try {
String str = readFile(filename, StandardCharsets.UTF_8);
Matcher matcher = p.matcher(str);
while(matcher.find()) {
String key = matcher.group(1);
Matcher valmatcher = valpattern.matcher(matcher.group(2));
System.out.println(key);
while(valmatcher.find()) {
System.out.println("\t" + valmatcher.group(1).replaceAll(System.getProperty("line.separator"), ""));
}
}
} catch (IOException e) {
System.out.println("Error: ProcessENV.parse -- problem parsing file: " + filename + System.lineSeparator());
e.printStackTrace();
}
}
static String readFile(File file, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(file.toPath());
return new String(encoded, encoding);
}
It is simpler to split on '=' and '";'.
[ c.strip().split(' = ') for c in s.split('";') ]
Or with double comprehension to get the individual paths:
[ [p[0].strip(), * [x.strip() for x in p.strip().split('=')] for c in s.split('";') for p in c.split(" = ")]
Split could be done with re, adding \s* to remove the trailing spaces:
re.split(r'\s*=\s*|";\s*', text, flags=re.MULTILINE):
even elements r[::2] would be vars, odd [1::2] values
then get rid of extra white space in values
You can use the following regex:
(\w+)\s*=\s*([\s\S]+?)";
It will start by matching a Group 1 of Word character, zero or more White Spaces, an equal sign, zero or more White Space, then a Group 2 or more of any characters (non greedy), and finally a a last double quote and a semi colon.
That will match all the lines.

How to extract the parameters from the output of a formated string in Java

I am trying to parse the output of a program and extract the parameters used to generated these results. The output are in the form of sentences generated from the format function in Python e.g.:
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. is genereated from Opening browser '%s' to base url '%s'
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. is genereated from Clicking element '%s'.
I want to extract the initial input parameters in the format function. My function would look something like:
private List<String> extractParameters(String output, String format){
// code would come here
}
The function takes as input the generated string and the format string that was used to generate it (e.g. "Clicking element '%s'.") and returns a sorted list of the parameters that were used (e.g. "xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]")
I started working on a method using regex, but I have many formats to manage and not being a regex expert the solution I am moving towards to is really ugly and non maintainable. So the question is:
Is there any elegant way to achieve my goal in an elegant way in Java?
Regex should do the trick but you should be sure they are optimized and well written. For your above examples I made a simple line analyzer based on regex patterns:
class RegexLineAnalyzer {
private List<Pattern> patterns = new ArrayList<>();
public RegexLineAnalyzer() {
patterns.add(Pattern.compile("^Opening browser '(.+)' to base url '(.+)'", Pattern.CASE_INSENSITIVE));
patterns.add(Pattern.compile("^Clicking element '(.+)'", Pattern.CASE_INSENSITIVE));
// add other patterns
}
public List<String> extractParameters(String line) {
for (Pattern pattern : patterns) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
List<String> parameters = new ArrayList<>(matcher.groupCount());
for (int i = 0; i < matcher.groupCount(); i++) {
parameters.add(matcher.group(i + 1));
}
return parameters;
}
}
return Collections.emptyList();
}
}
I assume that log files are split on lines. How to read and split files by lines efficiently you can find on this page.
Example usage of above analyzer could be like below:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
List<String> lines = new ArrayList<>();
lines.add("Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'.");
lines.add("Clicking element 'xpath=.//a[contains(normalize-space(#class), \"cc-btn cc-dismiss\")]'.");
RegexLineAnalyzer regexLineAnalyzer = new RegexLineAnalyzer();
for (String line : lines) {
System.out.println(line + " => " + regexLineAnalyzer.extractParameters(line));
}
}
}
Prints:
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. => [Google Chrome, https://https://stackoverflow.com]
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. => [xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]]
EDITED
I thought you have a list of patterns you can match to each line. In case you need to guess a pattern and after that analyse it and find arguments you can use a simpler solution based on split function. We have to assume that each line contains even number of ' character. We would have a problem with lines like: Jon's browser is 'IE' or User last name is 'O'Reilly' or we could face User's last name is 'O'Reilly'. Simple implementation could look like below:
class SplitLineAnalyzer {
public List<String> extractParameters(String line) {
final String regex = "'";
final String[] split = line.split(regex);
if (split.length % 2 == 0) {
System.out.println("Line contains unexpected number of parts. Hard to guess pattern for line = " + line);
return Collections.emptyList();
}
List<String> args = new ArrayList<>();
for (int i = 1; i < split.length; i += 2) {
args.add(split[i]);
split[i] = "%s";
}
Arrays.stream(split).reduce((s1, s2) -> s1 + regex + s2).ifPresent(s -> System.out.println("Possible pattern: " + s));
return args;
}
}
Example usage:
public class Main {
public static void main(String[] args) throws Exception {
List<String> lines = new ArrayList<>();
lines.add("Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'.");
lines.add("Clicking element 'xpath=.//a[contains(normalize-space(#class), \"cc-btn cc-dismiss\")]'.");
lines.add("'Firefox' is used by user 'Tom'.");
lines.add("Lines like this' could be broken.");
lines.add("User's first name is 'Jerry'.");
lines.add("User's last name is 'O'Reilly'");
SplitLineAnalyzer regexLineAnalyzer = new SplitLineAnalyzer();
for (String line : lines) {
System.out.println(line + " => " + regexLineAnalyzer.extractParameters(line));
System.out.println("");
}
}
}
Prints:
Possible pattern: Opening browser '%s' to base url '%s'.
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. => [Google Chrome, https://https://stackoverflow.com]
Possible pattern: Clicking element '%s'.
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. => [xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]]
Possible pattern: '%s' is used by user '%s'.
'Firefox' is used by user 'Tom'. => [Firefox, Tom]
Line contains unexpected number of parts. Hard to guess pattern for line = Lines like this' could be broken.
Lines like this' could be broken. => []
Line contains unexpected number of parts. Hard to guess pattern for line = User's first name is 'Jerry'.
User's first name is 'Jerry'. => []
Line contains unexpected number of parts. Hard to guess pattern for line = User's last name is 'O'Reilly'
User's last name is 'O'Reilly' => []

Read CSV file and write to another CSV - ArrayIndexOutOfBoundsException and pattern difficuties

I'm creating a java program that reads data from one csv file and saves with little changes to another csv file:
a) In 3rd column of output file I must extract only price in specific format (e.g. 4.99, 2522.78) from 4th column in input file
b) In 4th colum of otput file I must extract date in formt DD.MM.YYYY from 5th column in input file if it is.
c) The input file in the last three rows hasn't got last column. It causes when I read lines and want read first row with no last column it throws me exception.
There is a litte more, but those are difficulties to overcome. Could you help me? I have pattern but I just don't know how to use it in table like mine.
Code:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RwCSV {
private static final String SOURCE_FILE = "/home/krystian/Pulpit/products.csv";
private static final String RESULT_FILE = "/home/krystian/Pulpit/result3.csv";
private static final String DELIMITER1 = ";";
private static final String DELIMITER2 = "|";
//Pattern pattern;
public static void main(String[] args) {
try (
BufferedReader br = new BufferedReader(new FileReader(SOURCE_FILE));
FileWriter fw = new FileWriter(RESULT_FILE)) {
String line;
while ((line = br.readLine()) != null) {
String[] values = line.split(DELIMITER1);
String[] result = new String[5];
Pattern p = Pattern.compile("\\d+.\\d\\d");
Matcher m = p.matcher(values[3]);
//System.out.println(values[4]);
result[0] = "'"+values[0]+"'";
result[1] = "'"+values[1]+"?id="+values[2]+"'";
result[2] = "'"+values[3]+"'";
result[3] = "'"+values[3]+"'";
result[4] = "'"+values[4]+"'"; //throws exception java.lang.ArrayIndexOutOfBoundsException
for (int i = 0; i < result.length; i++) {
fw.write(result[i].replace("\"", ""));
if (i != result.length - 1) {
fw.write(DELIMITER2);
}
if (values.length<5) {continue;}
}
fw.write("\n");
}
} catch (FileNotFoundException ex) {
System.out.println("File not found.");
} catch (IOException ex) {
ex.printStackTrace(System.out);
}
catch (NullPointerException ex) {
}
}
}
Input file:
"Product Name";"Link";"SKU";"Selling-Price";"description"
"Product #1";"http://mapofmetal.com";"AT-23";"USD 1,232.99";"This field contains no date!"
"Product #2";"http://mapofmetal.com";"BU-322";"USD 8654.56";"Here a date: 20.09.2014"
"Product #3";"http://mapofmetal.com";"FFZWE";"EUR 1255,59";"Another date: 31.4.1999"
"Product #4";"http://mapofmetal.com";234234;"345,99 €";"Again no date in this field."
"Product #5";"http://mapofmetal.com";"UDMD-4";"$34.00";"Here are some special characters: öäüß"
"Product #6";"http://mapofmetal.com";"33-AAU43";"431.333,0 EUR";"American date: 12-23-2003"
"Product #7";"http://mapofmetal.com";"33-AAU44";"431.333,0 EUR";"One more date: 1.10.2014"
"Product #8";"http://mapofmetal.com";"33-AAU45";"34,99";
"Product #9";"http://mapofmetal.com";"UZ733-2";234.99;
"Product #10";"http://mapofmetal.com";"42-H2G2";42;
Output file row pattern (must be changed separator and quote-character):
'Product #2'|'http://mapofmetal.com?id=BU-322'|'8654.56'|'20.09.2014'
About the ArrayIndexOutOfBounds
Your problem seems to be that when the input ends with ;, the 5th element gets discarded. For example:
"abc;def;".split(";") -> ["abc", "def"]
Instead of what you would like, ["abc", "def", ""]
To have that effect, either pass the number elements you expect as a second parameter to .split(), for example:
"abc;def;".split(";", 3) -> ["abc", "def", ""]
Or a negative value:
"abc;def;".split(";", -1) -> ["abc", "def", ""]
This is explained in the docs.
About extracting the price
Extracting the price is tricky because you have multiple formats:
USD 1,232.99
EUR 1255,59
345,99 €
$34.00
34,99
The biggest problem there is the comma, which sometimes should be ignored, other times it's a decimal point.
Here's something that will work with the example you gave, but is likely not exhaustive, and you would need to improve on it depending on the other possible inputs you might have:
String price;
if (values[3].startsWith("EUR ") || values[3].endsWith(" €")) {
// ignore non-digits and non-commas, and replace commas with dots
price = values[3].replaceAll("[^\\d,]", "").replaceAll(",", ".");
} else {
// ignore non-digits and non-dots
price = values[3].replaceAll("[^\\d.]", "");
}
Then there's this format I'm not sure what to make of:
431.333,0 EUR
I think you need better specs for the input format.
It's unnecessarily hard and error-prone to work with such inconsistent input.
Depending on how long you want to use this code there are quick vs. More robust options.
An easy one is to add a try and catch around checking for a result in values [4] and then insert a default value in the catch when not present in the file.
Your products file only has 4 columns starting a "Product #8". So you are trying to access values[4] and there that array index doesn't exist.

Taking specific part of line

Hi I've got a log file containing trace routes and pings.
Ive seperated these by using
if (scanner.nextLine ().startsWith ("64 bytes"){}
so I can work with just the pings for now.
All I'm interested in from the ping is time=XX
example data line =
64 bytes from ziva.zarnet.ac.zw (209.88.89.132): icmp_seq=119 ttl=46 time=199 ms
I have been reading other peoples similar questions and I'm not sure how to apply to mine.
I literally need just the numbers as I will be putting them into a csv file so I can make a graph of the data.
edit: Using robins solution I'm now having my pings being spurted out on screen, except it's doing every other and missing the first.
while (scanner.hasNextLine ()) {
//take only pings.
if (scanner.nextLine ().startsWith ("64 bytes")){
String line = scanner.nextLine ();
String pingAsString = line.substring (line.lastIndexOf ("=") + 1, (line.length () - "ms".length ()));
Double ping = Double.valueOf (pingAsString);
System.out.println ("PING AS STRING = "+ping);
}
}
OK SORTED. THAT JUST NEEDED TO MOVE LINE ASSIGNMENT. CAPS. but made it clear. :D
Try using a RegularExpression to pull out the piece of data you need:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExTest {
public static void main(String[] args) {
String test = "line= 14103 64 bytes from ziva.zarnet.ac.zw (209.88.89.132): icmp_seq=119 ttl=46 time=199 ms";
// build the regular expression string
String regex = ".*time=(\\d+).*";
// compile the regular expresion into a Pattern we can use on the test string
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(test);
// if the regular expression matches, grab the value matching the
// expression in the first set of parentheses: "(\d+)"
if (matcher.matches()) {
System.out.println(matcher.group(1));
}
}
}
Or you can just use the available methods on String if you do not want to perform reg-ex magic
String line = ...
String pingAsString = line.substring( line.lastIndexOf("=")+1, (line.length() - " ms".length() ) );
Integer ping = Integer.valueOf( pingAsString );
Scanner scanner = new Scanner (new File ("./sample.log"));
while (scanner.hasNext ())
{
String line = scanner.nextLine ();
if (line.startsWith ("64 bytes")) {
String ms = line.replaceAll (".*time=([0-9]+) ms", "$1");
System.out.println ("ping = " + ms);
} // else System.out.println ("fail " + line);
}
Your problem is, that you call:
if (scanner.nextLine ().startsWith ("64 bytes")){
which means the line is grabbed, but not assigned to a variable. The result is immediately tested for startingWith, but then you call nextLine again, and get the next line, of course:
String line = scanner.nextLine ();
That is the second line.

Categories