I'm creating a java program that reads data from one csv file and saves with little changes to another csv file:
a) In 3rd column of output file I must extract only price in specific format (e.g. 4.99, 2522.78) from 4th column in input file
b) In 4th colum of otput file I must extract date in formt DD.MM.YYYY from 5th column in input file if it is.
c) The input file in the last three rows hasn't got last column. It causes when I read lines and want read first row with no last column it throws me exception.
There is a litte more, but those are difficulties to overcome. Could you help me? I have pattern but I just don't know how to use it in table like mine.
Code:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RwCSV {
private static final String SOURCE_FILE = "/home/krystian/Pulpit/products.csv";
private static final String RESULT_FILE = "/home/krystian/Pulpit/result3.csv";
private static final String DELIMITER1 = ";";
private static final String DELIMITER2 = "|";
//Pattern pattern;
public static void main(String[] args) {
try (
BufferedReader br = new BufferedReader(new FileReader(SOURCE_FILE));
FileWriter fw = new FileWriter(RESULT_FILE)) {
String line;
while ((line = br.readLine()) != null) {
String[] values = line.split(DELIMITER1);
String[] result = new String[5];
Pattern p = Pattern.compile("\\d+.\\d\\d");
Matcher m = p.matcher(values[3]);
//System.out.println(values[4]);
result[0] = "'"+values[0]+"'";
result[1] = "'"+values[1]+"?id="+values[2]+"'";
result[2] = "'"+values[3]+"'";
result[3] = "'"+values[3]+"'";
result[4] = "'"+values[4]+"'"; //throws exception java.lang.ArrayIndexOutOfBoundsException
for (int i = 0; i < result.length; i++) {
fw.write(result[i].replace("\"", ""));
if (i != result.length - 1) {
fw.write(DELIMITER2);
}
if (values.length<5) {continue;}
}
fw.write("\n");
}
} catch (FileNotFoundException ex) {
System.out.println("File not found.");
} catch (IOException ex) {
ex.printStackTrace(System.out);
}
catch (NullPointerException ex) {
}
}
}
Input file:
"Product Name";"Link";"SKU";"Selling-Price";"description"
"Product #1";"http://mapofmetal.com";"AT-23";"USD 1,232.99";"This field contains no date!"
"Product #2";"http://mapofmetal.com";"BU-322";"USD 8654.56";"Here a date: 20.09.2014"
"Product #3";"http://mapofmetal.com";"FFZWE";"EUR 1255,59";"Another date: 31.4.1999"
"Product #4";"http://mapofmetal.com";234234;"345,99 €";"Again no date in this field."
"Product #5";"http://mapofmetal.com";"UDMD-4";"$34.00";"Here are some special characters: öäüß"
"Product #6";"http://mapofmetal.com";"33-AAU43";"431.333,0 EUR";"American date: 12-23-2003"
"Product #7";"http://mapofmetal.com";"33-AAU44";"431.333,0 EUR";"One more date: 1.10.2014"
"Product #8";"http://mapofmetal.com";"33-AAU45";"34,99";
"Product #9";"http://mapofmetal.com";"UZ733-2";234.99;
"Product #10";"http://mapofmetal.com";"42-H2G2";42;
Output file row pattern (must be changed separator and quote-character):
'Product #2'|'http://mapofmetal.com?id=BU-322'|'8654.56'|'20.09.2014'
About the ArrayIndexOutOfBounds
Your problem seems to be that when the input ends with ;, the 5th element gets discarded. For example:
"abc;def;".split(";") -> ["abc", "def"]
Instead of what you would like, ["abc", "def", ""]
To have that effect, either pass the number elements you expect as a second parameter to .split(), for example:
"abc;def;".split(";", 3) -> ["abc", "def", ""]
Or a negative value:
"abc;def;".split(";", -1) -> ["abc", "def", ""]
This is explained in the docs.
About extracting the price
Extracting the price is tricky because you have multiple formats:
USD 1,232.99
EUR 1255,59
345,99 €
$34.00
34,99
The biggest problem there is the comma, which sometimes should be ignored, other times it's a decimal point.
Here's something that will work with the example you gave, but is likely not exhaustive, and you would need to improve on it depending on the other possible inputs you might have:
String price;
if (values[3].startsWith("EUR ") || values[3].endsWith(" €")) {
// ignore non-digits and non-commas, and replace commas with dots
price = values[3].replaceAll("[^\\d,]", "").replaceAll(",", ".");
} else {
// ignore non-digits and non-dots
price = values[3].replaceAll("[^\\d.]", "");
}
Then there's this format I'm not sure what to make of:
431.333,0 EUR
I think you need better specs for the input format.
It's unnecessarily hard and error-prone to work with such inconsistent input.
Depending on how long you want to use this code there are quick vs. More robust options.
An easy one is to add a try and catch around checking for a result in values [4] and then insert a default value in the catch when not present in the file.
Your products file only has 4 columns starting a "Product #8". So you are trying to access values[4] and there that array index doesn't exist.
Related
I'm solving this problem:
problem
And what I did is this:
import java.io.*;
import static java.lang.System.exit;
import java.util.*;
//Driver for Abbreviations
public class AbbreviationsDriver {
//string of message
private static String message = "";
//List of Abbreviations
private static String[] AbbreviationsList;
//Abbreviations list file
private static File AbbreviationsListFile = new File("abbreviations.txt");
//message file
private static File inputMessageFile = new File("sample_msg.txt");
//output message file
private static File outputMessageFile = new File("sample_output.txt");
//main method
public static void main(String[] args) throws FileNotFoundException {
setAbbreviations(readFileList(AbbreviationsListFile));
System.out.println("list of abbriviations:\n" + Arrays.toString(AbbreviationsList));
setMessage(readFile(inputMessageFile));
System.out.println("\nMessage in input file:\n" + message);
writeFile(outputMessageFile,addTags(message, AbbreviationsList));
System.out.println("\nMessage with tag in output file:\n" + addTags(message, AbbreviationsList));
}
//method to add tags
public static String addTags(String toTag, String[] abbreviations){
for(String abbreviation:abbreviations)
if(toTag.contains(abbreviation)){
toTag = toTag.replaceAll(abbreviation, "<" + abbreviation + ">");
}
return toTag;
}
//method to read the file list
public static String[] readFileList(File fileInput){
String input = "";
try{
Scanner inputStream = new Scanner(fileInput);
while(inputStream.hasNextLine()){
input = input + inputStream.nextLine()+ "<String>";
}
inputStream.close();
// System.out.println("list in string: " + input);
return input.split("<String>");
}
catch(Exception exception){
System.out.println("error in getting string array from file:\t" + exception.getMessage());
exit(0);
return new String[] {""};
}
}
//method to read the file
public static String readFile(File fileInput){
String inputFile = "";
try{
Scanner inputStatement = new Scanner(fileInput);
while(inputStatement.hasNextLine()){
inputFile = inputFile + inputStatement.nextLine();
}
inputStatement.close();
return inputFile;
}
catch(Exception exception){
System.out.println("error in getting message from file:\t" + exception.getMessage());
exit(0);
return "";
}
}
//method to write the output file
public static void writeFile(File fileName, String outString){
try{
PrintWriter outputStatement = new PrintWriter(fileName);
outputStatement.print(outString);
outputStatement.close();
}
catch(Exception exception){
System.out.println("error in setting message of file:\t" + exception.getMessage());
exit(0);
}
}
//method to set abbreviations
public static void setAbbreviations(String[] newAbbreviationsList){
AbbreviationsList = newAbbreviationsList;
}
//setter to set message
public static void setMessage(String newMessage){
message = newMessage;
}
//input string
public static String inputString(){
return new Scanner(System.in).nextLine();
}
}
abbreviations.txt is here:
lol
:)
iirc
4
u
ttfn
and sample_msg.txt is here:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
but when I compile and run, the error message comes out:
list of abbriviations:
[lol, :), iirc, 4, u, ttfn]
Message in input file:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 0
:)
^
at java.util.regex.Pattern.error(Pattern.java:1969)
at java.util.regex.Pattern.compile(Pattern.java:1706)
at java.util.regex.Pattern.<init>(Pattern.java:1352)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at java.lang.String.replaceAll(String.java:2223)
at AbbreviationsDriver.addTags(AbbreviationsDriver.java:44)
at AbbreviationsDriver.main(AbbreviationsDriver.java:36)
Process finished with exit code 1
I don't know how to solve this error because I've never seen this error before.
Please help me!
You pass wrong parameter into replaceAll(). First parameter must be a regex. For your purpose, regex is not needed, so use replace() method instead.
You faced the error because ) is treated as a metacharacter in regex and therefore either it needs to be escaped or must be paired with its closing counterpart.
Solution
You need to treat abbreviations with metacharacters and strings without metacharacters differently. For strings with metacharacters (e.g. :) where ) is a metacharacter), you should use String#replace while for the strings without metacharacter you should use String#replaceAll.
When you use String#replaceAll, you should create a capturing group which includes word boundaries e.g. (\bu\b) so that only those u will be processed which appear as a word. Finally, you should replace the capturing group with <$1> where $1 refers to the first (in the code given below, there is only one capturing group) capturing group e.g. (\bu\b) will be replaced by <u>.
Demo:
public class Main {
public static void main(String[] args) {
String[] abbrWithoutMetaChars = { "lol", "iirc", "4", "u", "ttfn" };
String[] abbrWithMetaChars = { ":)" };
// Test string
String str = "How are u today? iirc, this is your first free day. Hope you are having fun! :)";
// Replace all abbr. without meta chars
for (String abbreviation : abbrWithoutMetaChars) {
str = str.replaceAll("(\\b" + abbreviation + "\\b)", "<$1>");
}
// Replace all abbr. with meta chars
for (String abbreviation : abbrWithMetaChars) {
str = str.replace(abbreviation, "<" + abbreviation + ">");
}
System.out.println(str);
}
}
Output:
How are <u> today? <iirc>, this is your first free day. Hope you are having fun! <:)>
The problem is actually tricky. For example, in the list of abbreviations, u should be interpreted as a word and not a letter, since in your expected output you don't surround the letter u in the word your with angle brackets but only the u that appears by itself. Hence your code needs to locate the abbreviation as a single word in the input.
Also, iirc appears in the abbreviations list but in the input you have Iirc (with a capital I) and in the expected output it should appear as <Iirc> and not as <iirc>. In other words you should ignore case when locating the abbreviation but you need to keep the case after surrounding the abbreviation with angle brackets.
Then you have :) in the abbreviations list but ) has special meaning in regular expression syntax so your code also needs to handle that situation.
All the above implies that you need to analyze the contents of the abbreviations list file in order to turn a raw abbreviation into a valid regular expression that you can then use to locate the abbreviation in the input text.
If you assume that the abbreviations list may contain every possible abbreviation, you would probably need a large amount of code to handle each one properly. Rather than do that, I just concentrated on your sample list which divides easily into two groups:
simple words
punctuation only
Note that the second group is also known as emoticons and some emoticons contain both letters and punctuation which my code, below, does not handle. As I said, my solution only pertains to your sample list of abbreviations.
Here is the code and below the code are some notes regarding it. Please not that I took the liberty of not just fixing your code, but refactoring it as well.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
//Driver for Abbreviations
public class AbbreviationsDriver {
//Abbreviations list file
private static Path abbreviationsListPath = Paths.get("abbreviations.txt");
//message file
private static Path inputPath = Paths.get("sample_msg.txt");
//output message file
private static File outputMessageFile = new File("sample_output.txt");
//main method
public static void main(String[] args) throws FileNotFoundException {
List<String> abbreviationsList = readFileList(abbreviationsListPath);
System.out.println("List of abbreviations: " + abbreviationsList);
String message = readFile(inputPath);
System.out.println("\nMessage in input file:\n" + message);
String result = addTags(message, abbreviationsList);
writeFile(outputMessageFile, result);
System.out.println("\nMessage with tag in output file:\n" + result);
}
//method to add tags
public static String addTags(String toTag, List<String> abbreviations) {
for (String abbreviation : abbreviations) {
String regex;
if (abbreviation.contains(")")) {
regex = "(\\Q" + abbreviation + "\\E)";
}
else {
regex = "(?i)(\\b" + abbreviation + "\\b)";
}
toTag = toTag.replaceAll(regex, "<$1>");
}
return toTag;
}
//method to read the file list
public static List<String> readFileList(Path path) {
List<String> list;
try {
list = Files.readAllLines(path);
}
catch (IOException exception) {
list = List.of();
System.out.println("Failed to load: " + path);
exception.printStackTrace();
}
return list;
}
//method to read the file
public static String readFile(Path path) {
String inputFile;
try {
inputFile = Files.readString(path);
}
catch (IOException exception) {
System.out.println("Failed to read: " + path);
exception.printStackTrace();
inputFile = "";
}
return inputFile;
}
//method to write the output file
public static void writeFile(File fileName, String outString) {
try {
PrintWriter outputStatement = new PrintWriter(fileName);
outputStatement.print(outString);
outputStatement.close();
}
catch (Exception exception) {
System.out.println("Failed to write file: " + fileName);
exception.printStackTrace();
}
}
}
I use interface Path rather that class File so that I can use methods of class Files to read the text files that contain the abbreviations list and the input. Hence my code works with interface List rather than with an array of String.
Passing class members to methods as method parameters defeats the purpose of having a class member in the first place. Hence I removed the members message and AbbreviationsList.
The actual work of locating the abbreviations in the input and surrounding them with angle brackets, all occurs in method addTags. Here I handle each separate group of abbreviations. If the abbreviation contains the character ), I quote it by surrounding it with quote markers \Q and \E. (Refer to javadoc of class Pattern). Otherwise the abbreviation is a regular word, so I surround it with the word boundary marker \b. I also enclose each regular expression in parentheses so as to make it a capturing group. Note that the second regular expression begins with (?i) which means to ignore case. Hence iirc will match Iirc.
The replacement string is <$1>. The $1 is replaced with the string that was actually matched so any abbreviation found in the input will be replaced by the matched string surrounded with angle brackets.
Finally, here is the output when running the above code and using your sample data.
List of abbreviations: [lol, :), iirc, 4, u, ttfn]
Message in input file:
How are u today? Iirc, this is your first free day. Hope you are having fun! :)
Message with tag in output file:
How are <u> today? <Iirc>, this is your first free day. Hope you are having fun! <:)>
There are several ways to do this. Either you use regular expressions, or you do things the old-fashioned way by parsing word-by-word. Others have pointed out problems with your current code, due to using strings that contain regular expression metacharacters. In particular,
String doesNotWork = "I am :)".replaceAll(":)", "happy"); // invalid regex
This can be solved by quoting the string, so that metacharacters are converted into literals (it returns the string that would be written as "\\Q:)\\E", because \Q and \E are used as delimiters for quoting whole substrings, as opposed to \, which quotes the next only if it is non-alphabetical; and is otherwise used for a host of regex classes):
String worksAsExpected = "I am :)".replaceAll(Pattern.quote(":)"), "happy");
The most efficient way to process text is to do a single pass. This can be achieved by combining literal expressions with |s:
String regex = Stream.of("lol iirc 4".split(" "))
.map(s -> Pattern.quote(s)) // quotes each emoticon
.collect(Collectors.joining("|")); // joins with |
Matcher m = Pattern.compile(regex).matcher(input);
This yields surprisingly compact code, with nothing hardcoded. Finished code:
import java.util.regex.*;
import java.util.stream.*;
public class T {
public static String mark(
String[] needles, String startMark, String endMark, String input) {
String regex = Stream.of(needles)
.map(s -> s.matches("\\p{Alpha}+") ? // quotes each
"\\b" + Pattern.quote(s) + "\\b" : // to avoid yo<u>r
Pattern.quote(s)) // to handle emoticons
.collect(Collectors.joining("|")); // joins with |
Matcher m = Pattern.compile(regex).matcher(input);
StringBuffer output = new StringBuffer();
while (m.find()) {
m.appendReplacement(output, startMark + m.group() + endMark);
}
m.appendTail(output);
return output.toString();
}
public static void main(String ... args) {
System.out.println(mark(
"lol iirc 4 u ttfn :)".split(" "), // abbreviations
"<", ">", // markers to mark them with
"How are u today? iirc, this is your first free day. "
+ "Hope you are having fun! :)"));
}
}
I used #Arvind's trick of placing word-boundary metacharacters (\\b) only on alphabetical needles. This fixes all us in words being marked; but may yield strange results for 4s: writing a number with 4s in it will get it marked. Ultimately, natural language processing is hard. Regular expressions are great for very regular inputs.
My code does read and write the file, but it is not on a new line for every value and instead prints every value in one line.
// 2 points
static void Q1(String inputFilename, String outputFilename) {
// You are given a csv file (inputFilename) with all the data on a single line. Separate the
// values by commas and write each value on a separate line in a new file (outputFilename)
String data = "";
try {
for(String s :Files.readAllLines(Paths.get(inputFilename))){
data = data + s;
}
Files.write(Paths.get(outputFilename), data.getBytes());
} catch (IOException e) {
e.printStackTrace();
}
}
As such the grader says:
Incorrect on input: [data/oneLine0.csv, output0.txt]
Expected output : overwrought plastic bomb
wrapped litter basket
obstetric matter of law
diabetic stretching
spatial marathi
continental prescott
reproductive john henry o'hara
hollow beta blocker
stereotyped national aeronautics and space administration
irremediable st. olaf
brunet fibrosis
embarrassed dwarf elm
superficial harrier
disparaging whetstone
consecrate agony
impacted lampoon
nefarious textile
some other organisation
Your output : overwrought plastic bomb,wrapped litter basket,obstetric matter of law,diabetic stretching,spatial marathi,continental prescott,reproductive john henry o'hara,hollow beta blocker,stereotyped national aeronautics and space administration,irremediable st. olaf,brunet fibrosis,embarrassed dwarf elm,superficial harrier,disparaging whetstone,consecrate agony,impacted lampoon,nefarious textile,some other organisation
String data = "";
try {
// input file has all data on one line, for loop isn't necessary here
// input file has elements separated by comma characters
for(String s : Files.readAllLines(Paths.get(inputFilename))){
data = data + s;
}
String[] separated = data.split(",");// does not handle embedded commas well
data = "";
// output file should have each comma separated value on its own line
for (String t : separated) {
data = data + t + System.getProperty("line.separator");
}
Files.write(Paths.get(outputFilename), data.getBytes());
}
First of all, you need to remove the comma from the CSV file. I'd suggest using
s = s.replace(",",""); Additionally, you must append a \n to each string to make it appear on a new line. So, you should add s += "\n"; This yields the code:
// 2 points
static void Q1(String inputFilename, String outputFilename) {
// You are given a csv file (inputFilename) with all the data on a single line. Separate the
// values by commas and write each value on a separate line in a new file (outputFilename)
String data = "";
try {
for(String s :Files.readAllLines(Paths.get(inputFilename))){
s.replace(",","");
s += "\n";
data = data + s;
}
Files.write(Paths.get(outputFilename), data.getBytes());
} catch (IOException e) {
e.printStackTrace();
}
}
I am using file reader to read the csv file, the second column of the csv file is an rgb value such as rgb(255,255,255) but the columns in the csv file is separate by commas. If I use comma deliminator, it will read like "rgb(255," so how do I read the whole rgb value, the code is pasted below. Thanks!
FileReader reader = new FileReader(todoTaskFile);
BufferedReader in = new BufferedReader(reader);
int columnIndex = 1;
String line;
while ((line = in.readLine()) != null) {
if (line.trim().length() != 0) {
String[] dataFields = line.split(",");
//System.out.println(dataFields[0]+dataFields[1]);
if (!taskCount.containsKey(dataFields[columnIndex])) {
taskCount.put(dataFields[columnIndex], 1);
} else {
int oldCount = taskCount.get(dataFields[columnIndex]);
taskCount.put(dataFields[columnIndex],oldCount + 1);
}
}
I would strongly suggest not to use custom methods to parse CSV input. There are special libraries that do it for you.
#Ashraful Islam posted a good way to parse the value from a "cell" (I reused it), but getting this "cell" raw value must be done in a different way. This sketch shows how to do it using apache.commons.csv library.
package csvparsing;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class GetRGBFromCSV {
public static void main(String[] args) throws IOException {
Reader in = new FileReader(GetRGBFromCSV.class.getClassLoader().getResource("sample.csv").getFile());
Iterable<CSVRecord> records = CSVFormat.DEFAULT.withFirstRecordAsHeader().parse(in); // remove ".withFirstRecordAsHeader()"
for (CSVRecord record : records) {
String color = record.get("Color"); // use ".get(1)" to get value from second column if there's no header in csv file
System.out.println(color);
Pattern RGB_PATTERN = Pattern.compile("rgb\\((\\d{1,3}),(\\d{1,3}),(\\d{1,3})\\)", Pattern.CASE_INSENSITIVE);
Matcher m = RGB_PATTERN.matcher(color);
if (m.find()) {
Integer red = Integer.parseInt(m.group(1));
Integer green = Integer.parseInt(m.group(2));
Integer blue = Integer.parseInt(m.group(3));
System.out.println(red + " " + green + " " + blue);
}
}
}
}
This is a custom valid CSV input which would probably make regex-based solutions behave unexpectedly:
Name,Color
"something","rgb(100,200,10)"
"something else","rgb(10,20,30)"
"not the value rgb(1,2,3) you are interested in","rgb(10,20,30)"
There are lots of options which you might forget to take into account when you write your custom parser: quoted and unquoted strings, delimiter within quotes, escaped quotes within quotes, different delimiters (, or ;), multiple columns etc. Third-party csv parser would take care about those things for you. You shouldn't reinvent the wheel.
line = "rgb(25,255,255)";
line = line.replace(")", "");
line = line.replace("rgb(", "");
String[] vals = line.split(",");
cast the values in vals to Integer and then you can use them.
Here is how you can do this :
Pattern RGB_PATTERN = Pattern.compile("rgb\\((\\d{1,3}),(\\d{1,3}),(\\d{1,3})\\)");
String line = "rgb(25,255,255)";
Matcher m = RGB_PATTERN.matcher(line);
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
Here
\\d{1,3} => match 1 to 3 length digit
(\\d{1,3}) => match 1 to 3 length digit and stored the match
Though ( or ) are meta character we have to escape it.
This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Regular expression not matching subwords in phrase
My program displays the matching results, but I want to sort the results as complete match (100%), half a match and so on.
My text file contains the following line:
Red car
Red
Car
So If I search for: “red car”. I get the following results
Red car
Red
Car
So what I want to do is to sort the found results as follows:
"red car" 100% match
"red" 40% match
"car" 40% match
Any help is appreciated.
Any help is appreciated. My code is as follows:
public static void main(String[] args) {
// TODO code application logic here
String strLine;
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("C:\\textfile.txt"");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
Scanner input = new Scanner (System.in);
System.out.print("Enter Your Search: "); // String key="red or yellow";
String key = input.nextLine();
while ((strLine = br.readLine()) != null) {
Pattern p = Pattern.compile(key); // regex pattern to search for
Matcher m = p.matcher(strLine); // src of text to search
boolean b = false;
while(b = m.find()) {
System.out.println( " " + m.group()); // returns index and match
// Print the content on the console
}
}
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
Assuming you are searching for "Red" or "Yellow", and or is the only logical operator you need (no 'and' or 'xor') and you don't want to use any wildcards or regular-expressions in what you search for, then I would simply loop through, trying to match each String in turn against the line. In pseudo-code, something like:
foreach (thisLine: allLinesInTheFile) {
numOfCharsMatching = 0
foreach (thisString: allSearchStrings) {
if (thisLine.contains(thisString) {
numOfCharsMatching = numOfCharsMatching + thisString.length
}
}
score = ( numOfCharsMatching / thisLine.length ) * 100
}
If you don't want spaces to count in your score, then you'd need to remove them from the thisString.length (and not allow them in your search terms)
One other problem is the numOfCharsMatching will be incorrect if matches can overlap (i.e. if searching for 'row' or 'brown' in 'brown row' it will say that there are 11 characters matching, longer than the length of the string. You could use a BitSet to track which characters have been involved in a match, something like:
foreach (thisLine: allLinesInTheFile) {
whichCharsMatch = new BitSet()
foreach (thisString: allSearchStrings) {
if (thisLine.contains(thisString) {
whichCharsMatch.set(startPositionOfMatch, endPositionOfMatch, true)
}
}
score = ( numOfCharsMatching / thisLine.length ) * 100
}
Have a look at the BitSet javadoc, particularly the set and cardinality methods
I want to filter a string.
Basically when someone types a message, I want certain words to be filtered out, like this:
User types: hey guys lol omg -omg mkdj*Omg*ndid
I want the filter to run and:
Output: hey guys lol - mkdjndid
And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.
Use replaceAll with a regex built from the bad word:
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
This passes your test case:
public static void main( String[] args ) {
List<String> badWords = Arrays.asList( "omg", "black", "white" );
String message = "hey guys lol omg -omg mkdj*Omg*ndid";
for ( String badWord : badWords ) {
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
}
System.out.println( message );
}
try:
input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")
Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering.
I don't quite get what you are trying to do.
I ran into this same problem and solved it in the following way:
1) Have a google spreadsheet with all words that I want to filter out
2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)
3) Replace all l33tsp33k characters with their respective alphabet letter
4) Replace all special characters but letters from the sentence
5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.
6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet
6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.
We are using this in our own games and it's working like a charm. Hope you guys enjoy.
https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/
public static HashMap<String, String[]> words = new HashMap<String, String[]>();
public static void loadConfigs() {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
String line = "";
int counter = 0;
while((line = reader.readLine()) != null) {
counter++;
String[] content = null;
try {
content = line.split(",");
if(content.length == 0) {
continue;
}
String word = content[0];
String[] ignore_in_combination_with_words = new String[]{};
if(content.length > 1) {
ignore_in_combination_with_words = content[1].split("_");
}
words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("Loaded " + counter + " words to filter out");
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
* #param input
* #return
*/
public static ArrayList<String> badWordsFound(String input) {
if(input == null) {
return new ArrayList<>();
}
// remove leetspeak
input = input.replaceAll("1","i");
input = input.replaceAll("!","i");
input = input.replaceAll("3","e");
input = input.replaceAll("4","a");
input = input.replaceAll("#","a");
input = input.replaceAll("5","s");
input = input.replaceAll("7","t");
input = input.replaceAll("0","o");
ArrayList<String> badWords = new ArrayList<>();
input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");
for(int i = 0; i < input.length(); i++) {
for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++) {
String wordToCheck = input.substring(i, i + fromIOffset);
if(words.containsKey(wordToCheck)) {
// for example, if you want to say the word bass, that should be possible.
String[] ignoreCheck = words.get(wordToCheck);
boolean ignore = false;
for(int s = 0; s < ignoreCheck.length; s++ ) {
if(input.contains(ignoreCheck[s])) {
ignore = true;
break;
}
}
if(!ignore) {
badWords.add(wordToCheck);
}
}
}
}
for(String s: badWords) {
Server.getSlackManager().queue(s + " qualified as a bad word in a username");
}
return badWords;
}