Java - extract from line based on regex

Java - extract from line based on regex - java

Small question regarding a Java job to extract information out of lines from a file please.
Setup, I have a file, in which one line looks like this:
bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
The file contains many of those lines (as describe above)
In each of the lines, there are two particular information I am interested in, the primaryKey, and the country.
In my example, ZAPDBHV7120D41A and USA
For sure, each line of the file has exactly once the primaryKey, and exactly once the country, they are separated by a comma. It is there exactly once. in no particular order (it can appear at the start of the line, middle, end of the line, etc).
The primary key is a combination of alphabet in caps [A, B, C, ... Y, Z] and numbers [0, 1, 2, ... 9]. It has no particular predefined length.
The primary key is always in between primaryKey="({primaryKey},{country},
Meaning, the actual primaryKey is found after the string primaryKey-equal-quote-open parenthesis. And before another comma three letters country comma.
I would like to write a program, in which I can extract all the primary key, as well as all countries from the file.
Input:
bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
bla++blabla()bla=bla+blablaprimaryKey="(AA45555DBMW711DD4100,ARG,bla
[...]
Result:
The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG
Therefore, I tried following:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;
public class RegexExtract {
public static void main(String[] args) throws Exception {
final String csvFile = "my_file.txt";
try (final BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
String line;
while ((line = br.readLine()) != null) {
Pattern.matches("", line); // extract primaryKey and country based on regex
String primaryKey = ""; // extract the primary from above
String country = ""; // extract the country from above
System.out.println("The primaryKey is " + primaryKey);
System.out.println("The country is " + country);
}
}
}
}
But I am having a hard time constructing the regular expression needed to match and extract.
May I ask what is the correct code in order to extract from the line based on above information?
Thank you

Explanations after the code.
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexExtract {
public static void main(String[] args) {
Path path = Paths.get("my_file.txt");
try (BufferedReader br = Files.newBufferedReader(path)) {
Pattern pattern = Pattern.compile("primaryKey=\"\\(([A-Z0-9]+),([A-Z]+)");
String line = br.readLine();
while (line != null) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
String primaryKey = matcher.group(1);
String country = matcher.group(2);
System.out.println("The primaryKey is " + primaryKey);
System.out.println("The country is " + country);
}
line = br.readLine();
}
}
catch (IOException xIo) {
xIo.printStackTrace();
}
}
}
Running the above code produces the following output (using the two sample lines in your question).
The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG
The regular expression looks for the following [literal] string
primaryKey="(
The double quote is escaped since it is within a string literal.
The opening parenthesis is escaped because it is a metacharacter and the double backslash is required since Java does not recognize \( in a string literal.
Then the regular expression groups together the string of consecutive capital letters and digits that follow the previous literal up to (but not including) the comma.
Then there is a second group of capital letters up to the next comma.
Refer to the Regular Expressions lesson in Oracle's Java tutorials.

Related

Error while replace string with symbol in Java

I'm solving this problem:
problem
And what I did is this:
import java.io.*;
import static java.lang.System.exit;
import java.util.*;
//Driver for Abbreviations
public class AbbreviationsDriver {
//string of message
private static String message = "";
//List of Abbreviations
private static String[] AbbreviationsList;
//Abbreviations list file
private static File AbbreviationsListFile = new File("abbreviations.txt");
//message file
private static File inputMessageFile = new File("sample_msg.txt");
//output message file
private static File outputMessageFile = new File("sample_output.txt");
//main method
public static void main(String[] args) throws FileNotFoundException {
setAbbreviations(readFileList(AbbreviationsListFile));
System.out.println("list of abbriviations:\n" + Arrays.toString(AbbreviationsList));
setMessage(readFile(inputMessageFile));
System.out.println("\nMessage in input file:\n" + message);
writeFile(outputMessageFile,addTags(message, AbbreviationsList));
System.out.println("\nMessage with tag in output file:\n" + addTags(message, AbbreviationsList));
}
//method to add tags
public static String addTags(String toTag, String[] abbreviations){
for(String abbreviation:abbreviations)
if(toTag.contains(abbreviation)){
toTag = toTag.replaceAll(abbreviation, "<" + abbreviation + ">");
}
return toTag;
}
//method to read the file list
public static String[] readFileList(File fileInput){
String input = "";
try{
Scanner inputStream = new Scanner(fileInput);
while(inputStream.hasNextLine()){
input = input + inputStream.nextLine()+ "<String>";
}
inputStream.close();
// System.out.println("list in string: " + input);
return input.split("<String>");
}
catch(Exception exception){
System.out.println("error in getting string array from file:\t" + exception.getMessage());
exit(0);
return new String[] {""};
}
}
//method to read the file
public static String readFile(File fileInput){
String inputFile = "";
try{
Scanner inputStatement = new Scanner(fileInput);
while(inputStatement.hasNextLine()){
inputFile = inputFile + inputStatement.nextLine();
}
inputStatement.close();
return inputFile;
}
catch(Exception exception){
System.out.println("error in getting message from file:\t" + exception.getMessage());
exit(0);
return "";
}
}
//method to write the output file
public static void writeFile(File fileName, String outString){
try{
PrintWriter outputStatement = new PrintWriter(fileName);
outputStatement.print(outString);
outputStatement.close();
}
catch(Exception exception){
System.out.println("error in setting message of file:\t" + exception.getMessage());
exit(0);
}
}
//method to set abbreviations
public static void setAbbreviations(String[] newAbbreviationsList){
AbbreviationsList = newAbbreviationsList;
}
//setter to set message
public static void setMessage(String newMessage){
message = newMessage;
}
//input string
public static String inputString(){
return new Scanner(System.in).nextLine();
}
}
abbreviations.txt is here:
lol
:)
iirc
4
u
ttfn
and sample_msg.txt is here:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
but when I compile and run, the error message comes out:
list of abbriviations:
[lol, :), iirc, 4, u, ttfn]
Message in input file:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 0
:)
^
at java.util.regex.Pattern.error(Pattern.java:1969)
at java.util.regex.Pattern.compile(Pattern.java:1706)
at java.util.regex.Pattern.<init>(Pattern.java:1352)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at java.lang.String.replaceAll(String.java:2223)
at AbbreviationsDriver.addTags(AbbreviationsDriver.java:44)
at AbbreviationsDriver.main(AbbreviationsDriver.java:36)
Process finished with exit code 1
I don't know how to solve this error because I've never seen this error before.
Please help me!

You pass wrong parameter into replaceAll(). First parameter must be a regex. For your purpose, regex is not needed, so use replace() method instead.

You faced the error because ) is treated as a metacharacter in regex and therefore either it needs to be escaped or must be paired with its closing counterpart.
Solution
You need to treat abbreviations with metacharacters and strings without metacharacters differently. For strings with metacharacters (e.g. :) where ) is a metacharacter), you should use String#replace while for the strings without metacharacter you should use String#replaceAll.
When you use String#replaceAll, you should create a capturing group which includes word boundaries e.g. (\bu\b) so that only those u will be processed which appear as a word. Finally, you should replace the capturing group with <$1> where $1 refers to the first (in the code given below, there is only one capturing group) capturing group e.g. (\bu\b) will be replaced by <u>.
Demo:
public class Main {
public static void main(String[] args) {
String[] abbrWithoutMetaChars = { "lol", "iirc", "4", "u", "ttfn" };
String[] abbrWithMetaChars = { ":)" };
// Test string
String str = "How are u today? iirc, this is your first free day. Hope you are having fun! :)";
// Replace all abbr. without meta chars
for (String abbreviation : abbrWithoutMetaChars) {
str = str.replaceAll("(\\b" + abbreviation + "\\b)", "<$1>");
}
// Replace all abbr. with meta chars
for (String abbreviation : abbrWithMetaChars) {
str = str.replace(abbreviation, "<" + abbreviation + ">");
}
System.out.println(str);
}
}
Output:
How are <u> today? <iirc>, this is your first free day. Hope you are having fun! <:)>

The problem is actually tricky. For example, in the list of abbreviations, u should be interpreted as a word and not a letter, since in your expected output you don't surround the letter u in the word your with angle brackets but only the u that appears by itself. Hence your code needs to locate the abbreviation as a single word in the input.
Also, iirc appears in the abbreviations list but in the input you have Iirc (with a capital I) and in the expected output it should appear as <Iirc> and not as <iirc>. In other words you should ignore case when locating the abbreviation but you need to keep the case after surrounding the abbreviation with angle brackets.
Then you have :) in the abbreviations list but ) has special meaning in regular expression syntax so your code also needs to handle that situation.
All the above implies that you need to analyze the contents of the abbreviations list file in order to turn a raw abbreviation into a valid regular expression that you can then use to locate the abbreviation in the input text.
If you assume that the abbreviations list may contain every possible abbreviation, you would probably need a large amount of code to handle each one properly. Rather than do that, I just concentrated on your sample list which divides easily into two groups:
simple words
punctuation only
Note that the second group is also known as emoticons and some emoticons contain both letters and punctuation which my code, below, does not handle. As I said, my solution only pertains to your sample list of abbreviations.
Here is the code and below the code are some notes regarding it. Please not that I took the liberty of not just fixing your code, but refactoring it as well.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
//Driver for Abbreviations
public class AbbreviationsDriver {
//Abbreviations list file
private static Path abbreviationsListPath = Paths.get("abbreviations.txt");
//message file
private static Path inputPath = Paths.get("sample_msg.txt");
//output message file
private static File outputMessageFile = new File("sample_output.txt");
//main method
public static void main(String[] args) throws FileNotFoundException {
List<String> abbreviationsList = readFileList(abbreviationsListPath);
System.out.println("List of abbreviations: " + abbreviationsList);
String message = readFile(inputPath);
System.out.println("\nMessage in input file:\n" + message);
String result = addTags(message, abbreviationsList);
writeFile(outputMessageFile, result);
System.out.println("\nMessage with tag in output file:\n" + result);
}
//method to add tags
public static String addTags(String toTag, List<String> abbreviations) {
for (String abbreviation : abbreviations) {
String regex;
if (abbreviation.contains(")")) {
regex = "(\\Q" + abbreviation + "\\E)";
}
else {
regex = "(?i)(\\b" + abbreviation + "\\b)";
}
toTag = toTag.replaceAll(regex, "<$1>");
}
return toTag;
}
//method to read the file list
public static List<String> readFileList(Path path) {
List<String> list;
try {
list = Files.readAllLines(path);
}
catch (IOException exception) {
list = List.of();
System.out.println("Failed to load: " + path);
exception.printStackTrace();
}
return list;
}
//method to read the file
public static String readFile(Path path) {
String inputFile;
try {
inputFile = Files.readString(path);
}
catch (IOException exception) {
System.out.println("Failed to read: " + path);
exception.printStackTrace();
inputFile = "";
}
return inputFile;
}
//method to write the output file
public static void writeFile(File fileName, String outString) {
try {
PrintWriter outputStatement = new PrintWriter(fileName);
outputStatement.print(outString);
outputStatement.close();
}
catch (Exception exception) {
System.out.println("Failed to write file: " + fileName);
exception.printStackTrace();
}
}
}
I use interface Path rather that class File so that I can use methods of class Files to read the text files that contain the abbreviations list and the input. Hence my code works with interface List rather than with an array of String.
Passing class members to methods as method parameters defeats the purpose of having a class member in the first place. Hence I removed the members message and AbbreviationsList.
The actual work of locating the abbreviations in the input and surrounding them with angle brackets, all occurs in method addTags. Here I handle each separate group of abbreviations. If the abbreviation contains the character ), I quote it by surrounding it with quote markers \Q and \E. (Refer to javadoc of class Pattern). Otherwise the abbreviation is a regular word, so I surround it with the word boundary marker \b. I also enclose each regular expression in parentheses so as to make it a capturing group. Note that the second regular expression begins with (?i) which means to ignore case. Hence iirc will match Iirc.
The replacement string is <$1>. The $1 is replaced with the string that was actually matched so any abbreviation found in the input will be replaced by the matched string surrounded with angle brackets.
Finally, here is the output when running the above code and using your sample data.
List of abbreviations: [lol, :), iirc, 4, u, ttfn]
Message in input file:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
Message with tag in output file:
How are <u> today? <Iirc>, this is your first free day. Hope you are having fun! <:)>

There are several ways to do this. Either you use regular expressions, or you do things the old-fashioned way by parsing word-by-word. Others have pointed out problems with your current code, due to using strings that contain regular expression metacharacters. In particular,
String doesNotWork = "I am :)".replaceAll(":)", "happy"); // invalid regex
This can be solved by quoting the string, so that metacharacters are converted into literals (it returns the string that would be written as "\\Q:)\\E", because \Q and \E are used as delimiters for quoting whole substrings, as opposed to \, which quotes the next only if it is non-alphabetical; and is otherwise used for a host of regex classes):
String worksAsExpected = "I am :)".replaceAll(Pattern.quote(":)"), "happy");
The most efficient way to process text is to do a single pass. This can be achieved by combining literal expressions with |s:
String regex = Stream.of("lol iirc 4".split(" "))
.map(s -> Pattern.quote(s)) // quotes each emoticon
.collect(Collectors.joining("|")); // joins with |
Matcher m = Pattern.compile(regex).matcher(input);
This yields surprisingly compact code, with nothing hardcoded. Finished code:
import java.util.regex.*;
import java.util.stream.*;
public class T {
public static String mark(
String[] needles, String startMark, String endMark, String input) {
String regex = Stream.of(needles)
.map(s -> s.matches("\\p{Alpha}+") ? // quotes each
"\\b" + Pattern.quote(s) + "\\b" : // to avoid yo<u>r
Pattern.quote(s)) // to handle emoticons
.collect(Collectors.joining("|")); // joins with |
Matcher m = Pattern.compile(regex).matcher(input);
StringBuffer output = new StringBuffer();
while (m.find()) {
m.appendReplacement(output, startMark + m.group() + endMark);
}
m.appendTail(output);
return output.toString();
}
public static void main(String ... args) {
System.out.println(mark(
"lol iirc 4 u ttfn :)".split(" "), // abbreviations
"<", ">", // markers to mark them with
"How are u today? iirc, this is your first free day. "
+ "Hope you are having fun! :)"));
}
}
I used #Arvind's trick of placing word-boundary metacharacters (\\b) only on alphabetical needles. This fixes all us in words being marked; but may yield strange results for 4s: writing a number with 4s in it will get it marked. Ultimately, natural language processing is hard. Regular expressions are great for very regular inputs.

How to extract number suffix from a filename

In Java I have a filename example ABC.12.txt.gz, I want to extract number 12 from the filename. Currently I am using last index method and extracting substring multiple times.

You could try using pattern matching
import java.util.regex.Pattern;
import java.util.regex.Matcher;
// ... Other features
String fileName = "..."; // Filename with number extension
Pattern pattern = Pattern.compile("^.*(\\d+).*$"); // Pattern to extract number
// Then try matching
Matcher matcher = pattern.matcher(fileName);
String numberExt = "";
if(matcher.matches()) {
numberExt = matcher.group(1);
} else {
// The filename has no numeric value in it.
}
// Use your numberExt here.

You can just separate every numeric part from alphanumeric ones by using a regular expression:
public static void main(String args[]) {
String str = "ABC.12.txt.gz";
String[] parts = str.split("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");
// view the resulting parts
for (String s : parts) {
System.out.println(s);
}
// do what you want with those values...
}
This will output
ABC.
12
.txt.gz
Then take the parts you need and do what you have to do with them.

We can use something like this to extract the number from a string
String fileName="ABC.12.txt.gz";
String numberOnly= fileName.replaceAll("[^0-9]", "");

read rgb values stored in csv file seperated by comma deliminator

I am using file reader to read the csv file, the second column of the csv file is an rgb value such as rgb(255,255,255) but the columns in the csv file is separate by commas. If I use comma deliminator, it will read like "rgb(255," so how do I read the whole rgb value, the code is pasted below. Thanks!
FileReader reader = new FileReader(todoTaskFile);
BufferedReader in = new BufferedReader(reader);
int columnIndex = 1;
String line;
while ((line = in.readLine()) != null) {
if (line.trim().length() != 0) {
String[] dataFields = line.split(",");
//System.out.println(dataFields[0]+dataFields[1]);
if (!taskCount.containsKey(dataFields[columnIndex])) {
taskCount.put(dataFields[columnIndex], 1);
} else {
int oldCount = taskCount.get(dataFields[columnIndex]);
taskCount.put(dataFields[columnIndex],oldCount + 1);
}
}

I would strongly suggest not to use custom methods to parse CSV input. There are special libraries that do it for you.
#Ashraful Islam posted a good way to parse the value from a "cell" (I reused it), but getting this "cell" raw value must be done in a different way. This sketch shows how to do it using apache.commons.csv library.
package csvparsing;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class GetRGBFromCSV {
public static void main(String[] args) throws IOException {
Reader in = new FileReader(GetRGBFromCSV.class.getClassLoader().getResource("sample.csv").getFile());
Iterable<CSVRecord> records = CSVFormat.DEFAULT.withFirstRecordAsHeader().parse(in); // remove ".withFirstRecordAsHeader()"
for (CSVRecord record : records) {
String color = record.get("Color"); // use ".get(1)" to get value from second column if there's no header in csv file
System.out.println(color);
Pattern RGB_PATTERN = Pattern.compile("rgb\\((\\d{1,3}),(\\d{1,3}),(\\d{1,3})\\)", Pattern.CASE_INSENSITIVE);
Matcher m = RGB_PATTERN.matcher(color);
if (m.find()) {
Integer red = Integer.parseInt(m.group(1));
Integer green = Integer.parseInt(m.group(2));
Integer blue = Integer.parseInt(m.group(3));
System.out.println(red + " " + green + " " + blue);
}
}
}
}
This is a custom valid CSV input which would probably make regex-based solutions behave unexpectedly:
Name,Color
"something","rgb(100,200,10)"
"something else","rgb(10,20,30)"
"not the value rgb(1,2,3) you are interested in","rgb(10,20,30)"
There are lots of options which you might forget to take into account when you write your custom parser: quoted and unquoted strings, delimiter within quotes, escaped quotes within quotes, different delimiters (, or ;), multiple columns etc. Third-party csv parser would take care about those things for you. You shouldn't reinvent the wheel.

line = "rgb(25,255,255)";
line = line.replace(")", "");
line = line.replace("rgb(", "");
String[] vals = line.split(",");
cast the values in vals to Integer and then you can use them.

Here is how you can do this :
Pattern RGB_PATTERN = Pattern.compile("rgb\\((\\d{1,3}),(\\d{1,3}),(\\d{1,3})\\)");
String line = "rgb(25,255,255)";
Matcher m = RGB_PATTERN.matcher(line);
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
Here
\\d{1,3} => match 1 to 3 length digit
(\\d{1,3}) => match 1 to 3 length digit and stored the match
Though ( or ) are meta character we have to escape it.

Read CSV file and write to another CSV - ArrayIndexOutOfBoundsException and pattern difficuties

I'm creating a java program that reads data from one csv file and saves with little changes to another csv file:
a) In 3rd column of output file I must extract only price in specific format (e.g. 4.99, 2522.78) from 4th column in input file
b) In 4th colum of otput file I must extract date in formt DD.MM.YYYY from 5th column in input file if it is.
c) The input file in the last three rows hasn't got last column. It causes when I read lines and want read first row with no last column it throws me exception.
There is a litte more, but those are difficulties to overcome. Could you help me? I have pattern but I just don't know how to use it in table like mine.
Code:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RwCSV {
private static final String SOURCE_FILE = "/home/krystian/Pulpit/products.csv";
private static final String RESULT_FILE = "/home/krystian/Pulpit/result3.csv";
private static final String DELIMITER1 = ";";
private static final String DELIMITER2 = "|";
//Pattern pattern;
public static void main(String[] args) {
try (
BufferedReader br = new BufferedReader(new FileReader(SOURCE_FILE));
FileWriter fw = new FileWriter(RESULT_FILE)) {
String line;
while ((line = br.readLine()) != null) {
String[] values = line.split(DELIMITER1);
String[] result = new String[5];
Pattern p = Pattern.compile("\\d+.\\d\\d");
Matcher m = p.matcher(values[3]);
//System.out.println(values[4]);
result[0] = "'"+values[0]+"'";
result[1] = "'"+values[1]+"?id="+values[2]+"'";
result[2] = "'"+values[3]+"'";
result[3] = "'"+values[3]+"'";
result[4] = "'"+values[4]+"'"; //throws exception java.lang.ArrayIndexOutOfBoundsException
for (int i = 0; i < result.length; i++) {
fw.write(result[i].replace("\"", ""));
if (i != result.length - 1) {
fw.write(DELIMITER2);
}
if (values.length<5) {continue;}
}
fw.write("\n");
}
} catch (FileNotFoundException ex) {
System.out.println("File not found.");
} catch (IOException ex) {
ex.printStackTrace(System.out);
}
catch (NullPointerException ex) {
}
}
}
Input file:
"Product Name";"Link";"SKU";"Selling-Price";"description"
"Product #1";"http://mapofmetal.com";"AT-23";"USD 1,232.99";"This field contains no date!"
"Product #2";"http://mapofmetal.com";"BU-322";"USD 8654.56";"Here a date: 20.09.2014"
"Product #3";"http://mapofmetal.com";"FFZWE";"EUR 1255,59";"Another date: 31.4.1999"
"Product #4";"http://mapofmetal.com";234234;"345,99 €";"Again no date in this field."
"Product #5";"http://mapofmetal.com";"UDMD-4";"$34.00";"Here are some special characters: öäüß"
"Product #6";"http://mapofmetal.com";"33-AAU43";"431.333,0 EUR";"American date: 12-23-2003"
"Product #7";"http://mapofmetal.com";"33-AAU44";"431.333,0 EUR";"One more date: 1.10.2014"
"Product #8";"http://mapofmetal.com";"33-AAU45";"34,99";
"Product #9";"http://mapofmetal.com";"UZ733-2";234.99;
"Product #10";"http://mapofmetal.com";"42-H2G2";42;
Output file row pattern (must be changed separator and quote-character):
'Product #2'|'http://mapofmetal.com?id=BU-322'|'8654.56'|'20.09.2014'

About the ArrayIndexOutOfBounds
Your problem seems to be that when the input ends with ;, the 5th element gets discarded. For example:
"abc;def;".split(";") -> ["abc", "def"]
Instead of what you would like, ["abc", "def", ""]
To have that effect, either pass the number elements you expect as a second parameter to .split(), for example:
"abc;def;".split(";", 3) -> ["abc", "def", ""]
Or a negative value:
"abc;def;".split(";", -1) -> ["abc", "def", ""]
This is explained in the docs.
About extracting the price
Extracting the price is tricky because you have multiple formats:
USD 1,232.99
EUR 1255,59
345,99 €
$34.00
34,99
The biggest problem there is the comma, which sometimes should be ignored, other times it's a decimal point.
Here's something that will work with the example you gave, but is likely not exhaustive, and you would need to improve on it depending on the other possible inputs you might have:
String price;
if (values[3].startsWith("EUR ") || values[3].endsWith(" €")) {
// ignore non-digits and non-commas, and replace commas with dots
price = values[3].replaceAll("[^\\d,]", "").replaceAll(",", ".");
} else {
// ignore non-digits and non-dots
price = values[3].replaceAll("[^\\d.]", "");
}
Then there's this format I'm not sure what to make of:
431.333,0 EUR
I think you need better specs for the input format.
It's unnecessarily hard and error-prone to work with such inconsistent input.

Depending on how long you want to use this code there are quick vs. More robust options.
An easy one is to add a try and catch around checking for a result in values [4] and then insert a default value in the catch when not present in the file.

Your products file only has 4 columns starting a "Product #8". So you are trying to access values[4] and there that array index doesn't exist.

Read string after " " and before "(" using split in java

I have txt file with line:
1st line - 20-01-01 Abs Def est (xabcd)
2nd line - 290-01-01 Abs Def est ghj gfhj (xabcd fgjh fgjh)
3rd line - 20-1-1 Absfghfgjhgj (xabcd ghj 5676gyj)
I want to keep 3 diferent String array:
[0]20-01-01 [1]290-01-01 [2] 20-1-1
[0]Abs Def est [1]Abs Def est ghj gfhj [2] Absfghfgjhgj
[0]xabcd [1]xabcd fgjh fgjh [2] xabcd ghj 5676gyj
Using String[] array 1 = myLine.split(" ") i only have piece 20-01-01 but i also want to keep other 2 Strings
EDIT: I want to do this using regular Expressions (text file is large)
This is my piece of code:
Please help, i searching, but does not found anything
Thx.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Comparator;
import java.util.Date;
import java.util.Set;
import java.util.TreeSet;
public class Holiday implements Comparable<Date>{
Date date;
String name;
public Holiday(Date date, String name){
this.date=date;
this.name=name;
}
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(new File("c:/holidays.txt"));
InputStreamReader isr = new InputStreamReader(fis, "windows-1251");
BufferedReader br = new BufferedReader(isr);
TreeSet<Holiday> tr=new TreeSet<>();
System.out.println(br.readLine());
String myLine = null;
while ( (myLine = br.readLine()) != null)
{
String[] array1 = myLine.split(" "); //OR use this
//String array1 = myLine.split(" ")[0];//befor " " read 1-st string
//String array2 = myLine.split("")[1];
//Holiday h=new Holiday(array1, name)
//String array1 = myLine.split(" ");
// check to make sure you have valid data
// String[] array2 = array1[1].split(" ");
System.out.println(array1[0]);
}
}
#Override
public int compareTo(Date o) {
// TODO Auto-generated method stub
return 0;
}
}

Pattern p = Pattern.compile("(.*?) (.*?) (\\(.*\\))");
Matcher m = p.matcher("20-01-01 Abs Def est (abcd)");
if (!m.matches()) throw new Exception("Invalid string");
String s1 = m.group(1); // 20-01-01
String s2 = m.group(2); // Abs Def est
String s3 = m.group(3); // (abcd)

Use a StringTokenizer, which has a " " as a delimiter by default.

You seem to be splitting based on whitespace. Each element of the string array would contain the individual whitespace-separate substrings, which you can then piece back together later on via string concatenation.
For instance,
array1[0] would be 20-01-01
array1[1] would be Abs
array1[2] would be Def
so on and so forth.
Another option is to Java regular expressions, but that may only be useful if your input text file is has a consistent formatting and if there's a lot of lines to process. It is very powerful, but requires some experience.

Match required text data by regular expression.
The regexp below ensure there are exactly 3 words in the middle and 1 word in the bracket.
String txt = "20-01-01 Abs Def est hhh (abcd)";
Pattern p = Pattern.compile("(\\d\\d-\\d\\d-\\d\\d) (\\w+ \\w+ \\w+) ([(](\\w)+[)])");
Matcher matcher = p.matcher(txt);
if (matcher.find()) {
String s1 = matcher.group(1);
String s2 = matcher.group(2);
String s3 = matcher.group(3);
System.out.println(s1);
System.out.println(s2);
System.out.println(s3);
}
However if you need more flexibility you may want to use code provided by Lence Java.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - extract from line based on regex - java

Related

Error while replace string with symbol in Java

How to extract number suffix from a filename

read rgb values stored in csv file seperated by comma deliminator

Read CSV file and write to another CSV - ArrayIndexOutOfBoundsException and pattern difficuties

Read string after " " and before "(" using split in java

Categories

Resources