Removing punctuation is not working in Java with string replacement - java

So, what I'm trying to do is compile a single word list with no repeats out of 8 separate dictionary word lists. Some of the dictionaries have punctuation in them to separate the words. Below is what I have that pertains to the punctuation removal. I've tried several different solutions that I've found on stack overflow regarding regex expressions, as well as the one I've left in place in my code. For some reason, none of them are removing the punctuation from the source dictionaries. Can someone tell me what it is I've done wrong here and possibly how to fix it? I'm at a loss and had a coworker check it and he says this ought to be working as well.
int i = 1;
boolean checker = true;
Scanner inputWords;
PrintWriter writer = new PrintWriter(
"/home/htarbox/Desktop/fullDictionary.txt");
String comparison, punctReplacer;
ArrayList<String> compilation = new ArrayList<String>();
while (i <9)
{
inputWords = new Scanner(new File("/home/htarbox/Desktop/"+i+".txt"));
while(inputWords.hasNext())
{
punctReplacer = inputWords.next();
punctReplacer.replaceAll("[;.:\"()!?\\t\\n]", "");
punctReplacer.replaceAll(",", "");
punctReplacer.replaceAll("\u201C", "");
punctReplacer.replaceAll("\u201D", "");
punctReplacer.replaceAll("’", "'");
System.out.println(punctReplacer);
compilation.add(punctReplacer);
}
}
inputWords.close();
}
i = 0;

The line
punctReplacer.replaceAll(",", "");
returns a new String with your replacement (which you're ignoring). It doesn't modify the existing String. As such you need:
punctReplacer = punctReplacer.replaceAll(",", "");
Strings are immutable. Once created you can't change them, and any String manipulation method will return you a new String

As strings are immutable you have to reset your variable:
punctReplacer = punctReplacer.replaceAll("[;.:\"()!?\\t\\n]", "");
(btw, immutable means that you cannot change the value once it has been set, so with String you always have to reset the variable if you want to change it)

Related

Iterate through a dictionary array

I have a String array containing a poem which has deliberate spelling mistakes. I am trying to iterate through the String array to identify the spelling mistakes by comparing the String array to a String array containing a dictionary. If possible I would like a suggestion that allows me to continue using nested for loops
for (int i = 0; i < poem2.length; i++) {
boolean found = false;
for (int j = 0; j < dictionary3.length; j++) {
if (poem2[i].equals(dictionary3[j])) {
found = true;
break;
}
}
if (found==false) {
System.out.println(poem2[i]);
}
}
The output is printing out the correctly spelt words as well as the incorrectly spelt ones and I am aiming to only print out the incorrectly spelt ones. Here is how I populate the 'dictionary3' and 'poem2' arrays:
char[] buffer = null;
try {
BufferedReader br1 = new BufferedReader(new
java.io.FileReader(poem));
int bufferLength = (int) (new File(poem).length());
buffer = new char[bufferLength];
br1.read(buffer, 0, bufferLength);
br1.close();
} catch (IOException e) {
System.out.println(e.toString());
}
String text = new String(buffer);
String[] poem2 = text.split("\\s+");
char[] buffer2 = null;
try {
BufferedReader br2 = new BufferedReader(new java.io.FileReader(dictionary));
int bufferLength = (int) (new File(dictionary).length());
buffer2 = new char[bufferLength];
br2.read(buffer2, 0, bufferLength);
br2.close();
} catch (IOException e) {
System.out.println(e.toString());
}
String dictionary2 = new String(buffer);
String[] dictionary3 = dictionary2.split("\n");
Your basic problem is in line
String dictionary2 = new String(buffer);
where you ware trying to convert characters representing dictionary stored in buffer2 but you used buffer (without 2 suffix). Such style of naming your variables may suggest that you either need a loop, or in this case separate method which will return for selected file array of words it holds (you can also add as method parameter delimiter on which string should be split).
So your dictionary2 held characters from buffer which represented poem, not dictionary data.
Another problem is
String[] dictionary3 = dictionary2.split("\n");
because you are splitting here only on \n but some OS like Windows use \r\n as line separator sequence. So your dictionary array may contain words like foo\r instead of foo which will cause poem2[i].equals(dictionary3[j] to always fail.
To avoid this problem you can split on \\R (available since Java 8) or \r?\n|\r.
There are other problems in your code like closing resource within try section. If any exception will be thrown before, close() will never be invoked leaving unclosed resources. To solve it close resources in finally section (which is always executed after try - regardless if exception will be thrown or not), or better use try-with-resources.
BTW you can simplify/clarify your code responsible for reading words from files
List<String> poem2 = new ArrayList<>();
Scanner scanner = new Scanner(new File(yourFileLocation));
while(scanner.hasNext()){//has more words
poem2.add(scanner.next());
}
For dictionary instead of List you should use Set/HashSet to avoid duplicates (usually sets also have better performance when checking if they contain some elements or not). Such collections already provide methods like contains(element) so you wouldn't need that inner loop.
I copied your code and ran it, and I noticed two issues. Good news is, both are very quick fixes.
#1
When I printed out everything in dictionary3 after it is read in, it is the exact same as everything in poem2. This line in your code for reading in the dictionary is the problem:
String dictionary2 = new String(buffer);
You're using buffer, which was the variable you used to read in the poem. Therefore, buffer contains the poem and your poem and dictionary end up the same. I think you want to use buffer2 instead, which is what you used to read in the dictionary:
String dictionary2 = new String(buffer2);
When I changed that, the dictionary and poem appear to have the proper entries.
#2
The other problem, as Pshemo pointed out in their answer (which is completely correct, and a very good answer!) is that you are splitting on \n for the dictionary. The only thing I would say differently from Pshemo here is that you should probably split on \\s+ just like you did for the poem, to stay consistent. In fact, when I debugged, I noticed that the dictionary words all have "\r" appended to the end, probably because you were splitting on \n. To fix this, change this line:
String[] dictionary3 = dictionary2.split("\n");
To this:
String[] dictionary3 = dictionary2.split("\\s+");
Try changing those two lines, and let us know if that resolves your issue. Best of luck!
Convert your dictionary to an ArrayList and use Contains instead.
Something like this should work:
if(dictionary3.contains(poem2[i])
found = true;
else
found = false;
With this method you can also get rid of that nested loop, as the contains method handles that for you.
You can convert your Dictionary to an ArrayList with the following method:
new ArrayList<>(Arrays.asList(array))

Creating new directory not possible with String concat

I am trying to create a new directory using Java but I realized that the mkdir() don't work with strings that are made up of concat() method or using the '+' operand.
For example:
String keyword = "golden+retriever";
String folderName = removeChar(keyword);
String strDirectory = "C:/Users/Administrator/Desktop/"+folderName;
File newFolder = new File(strDirectory);
newFolder.mkdir();
The above code does not create the folder but it will work correctly if I were to use the directory without the '+' operand like this:
String strDirectory = "C:/Users/Administrator/Desktop/goldenretriever";
File newFolder = new File(strDirectory);
newFolder.mkdir();
Why is it so? Is there any ways to successfully create a directory using the '+' operand or the concat() method?
Update:
The '+' in the String is not a typo. The removeChar() method simply removes the '+' in order to create a folder without special characters.
Below is the code for removeChar():
public static String removeChar(String s)
{
StringBuffer buff = new StringBuffer(s.length());
buff.setLength(s.length());
int current = 0;
for (int i=0; i<s.length(); i++)
{
char cur = s.charAt(i);
if(cur != '+')
{
buff.setCharAt(current++, cur);
}
}
return buff.toString();
}
Please make sure that your path is proper otherwise windows allowed + as name of folder.
see for example if I have URL like : C:/TestWS/Test/TestingDir/ then your keyword
String keyword = "golden+retriever";
Java doesn't create all directories automatically so please make sure that this path already exist before creating final keyword dir : C:/TestWS/Test/TestingDir/
If TestingDir is not exist and trying to create "golden+retriever"; then java will not be going to create any directory.
What's wrong with your code is that the removeChar returns excess whitespace after the keyword. What you can do is trim it first before creating the file.
Try:
File newFolder = new File(strDirectory.trim());
And also I recommend you check if the folder exists first before creating it
if(!newFolder.exists()) {
newFolder.mkdir();
}
Although the original poster has gone with another more efficient solution to solve the problem, I'd like to point out why their original code was failing.
They created a new StringBuffer and made a copy of s at the top. They set the initial length to be equal to the length of the original string s. If they removed any number of pluses the new string is actually smaller than the original. In order to make sure the StringBuffer is of the correct length with no unused trailing characters you would have to set the length accordingly when finished.
To fix the original code would require changing:
return buff.toString();
To:
buff.setLength(current);
return buff.toString();
One could have reworked the code so the new StringBuffer buff started out blank and new characters are simply appended as needed. That could have been done with something like:
public static String removeChar(String s)
{
StringBuffer buff = new StringBuffer();
for (int i=0; i<s.length(); i++)
{
char cur = s.charAt(i);
if(cur != '+')
buff.append(cur);
}
return buff.toString();
}

Concatenate a new string at the beginning of an existing string

i want to concatenate a new string to the start of an existing string, for example,
the current string="" and i want always to concatenate the new string to start of my old string:
String msg="Java One",temp;
for(int i=msg.length()-2;i>0;i--){
here i make a loop starting from the end of msg after the end finishes temp should contains "Java One" but in this order
e
ne
one
a one
va one
}
and so on
I want always to concatenate the new string to start of my old string
This is very simple, but not very efficient:
String oldString = "";
for (...) {
// Prepare your new string
String newString = ... ;
// Add the new string at the beginning of the old string
oldString = newString + oldString;
}
You can use String#substring(int,int) to get different substrings in each iteration.
for(int i=msg.length()-1;i>=0;i--){
System.out.println(msg.substring(i,msg.length()));
}
Of course you can store each generated substring and do what you wish with it.
Note that this approach is likely to be more efficient, because though new String objects will be created, it is likely to use the same underlying char[] object for all of them.
Also note that we are iterating from msg.length()-1 (and not -2, as the original code in the question) and while i >= 0 (and not i > 0, as in the original question)

How do I concatenate input in java?

I am trying to concatenate and trying to parse at the same time. I am right now making a excel like program where I can say a1 = "Hello" + "World" and in the cell of A1 have it say HelloWorld. I just need to know how to parse the adding sign and connect those two words. Please tell me if you need more code to understand this, like the runner.
This is my parseInput class :
public class ParseInput {
private static String inputs;
static int col;
private static int row;
private static String operation;
private static Value field;
public static void parseInput(String input){
//splits the input at each regular expression match. \w is used for letters and \d && \D for integers
inputs = input;
Scanner tokens = new Scanner(inputs);
String none0 = tokens.next();
#SuppressWarnings("unused")
String none1 = tokens.next();
operation = tokens.nextLine().substring(1);
String[] holder = new String[2];
String regex = "(?<=[\\w&&\\D])(?=\\d)";
holder = none0.split(regex);
row = Integer.parseInt(holder[1]);
col = 0;
int counter = -1;
char temp = holder[0].charAt(0);
char check = 'a';
while(check <= temp){
if(check == temp){
col = counter +1;
}
counter++;
check = (char) (check + 1);
}
System.out.println(col);
System.out.println(row);
System.out.println(operation);
setField(Value.parseValue(operation));
Spreadsheet.changeCell(row, col, field);
}
public static Value getField() {
return field;
}
public static void setField(Value field) {
ParseInput.field = field;
}
}
This is actually a pretty complicated problem unless you can constrain input to a very small subset of what Excel accepts. If not then you'll probably want to look into something like ANTLR. However, assuming the above input then you'll want to do something like:
Split the string on the equal sign into s1 and s2
Split s2 on the plus sign into s3 and s4.
Trim all the strings, remove the quotes around s3 and s4.
Concatenate s3 and s4 and assign to your datastore indexed by s1.
Depending on how complex your concatenation needs are you can either use string concatenation or a StringBuilder:
result = "" + s3 + s4; // string concatenation
result = new StringBuilder().append(s3).append(s4).toString(); // StringBuilder
Let me know if you have any questions about any of the steps detailed above.
Details on (1) above, assuming input is a1 = "Hello" + "World":
String[] strings = input.split("=");
String s1 = strings[0].trim(); // a1
String s2 = strings[1].trim(); // "Hello" + "World"
strings = s2.split("+");
String s3 = strings[0].trim().replaceAll("^\"", "").replaceAll("\"$", "") // Hello
String s4 = strings[1].trim().replaceAll("^\"", "").replaceAll("\"$", ""); // World
String field = s3 + s4;
String colString = s1.replaceAll("[\\d]", ""); // a
String rowString = s1.replaceAll("[\\D]", ""); // 1
int col = colString.charAt(0) - 'a'; // 0
int row = Integer.parseInt(rowString);
Spreadsheet.changeCell(row, col, field);
I suggest you to implement your custom grammar using a parser generator like JavaCC.
Here you can find a simple tutorial.
I believe this is the better solution because in this way you can handle every expression you need.
Are you sure you want to use all the classes you are using? To parse something like "a=b+c+d.." (assuming you are not trying to validate), easiest and possibly the most efficient way is to use split API in Java lang String
Then join whatever is required using StringBuilder
You need to design and implement a parser and an evaluator. And before that, you need to design the language that your parser/evaluator is going to evaluate.
How to do it.
If your language is really simple, you can get away with parsing it by hand, using something like StringTokenizer to do the tokenization,
Otherwise, you are probably best off learning to use a Java "parser generator" such as JavaCC or ANTLR.
Either way, you need to do some background reading to understand all of the terminology. You could start with Wikipedia and/or the tutorial material from one of the parser generators. Alternatively, there are good textbooks on this topic.
In addition to what Abdullah said, if you really want to save every single ounce of memory you can, you should use the StringBuilder instead of the String concatenation. I believe i read somewhere before that the String concatenation make a new string object for each concatenations while the StringBuilder will add them all to a single String. Shouldn't matter too much though.
In my early life I made an equation evaluator in your style. It cost me huge code and complexity, because of my unawareness about Expression trees. But now with this you will be able to add more capabilities to your parser easily and with native JAVA codes. You will get tons of example of using Expression Trees.

Need help parsing strings in Java

I am reading in a csv file in Java and, depending on the format of the string on a given line, I have to do something different with it. The three different formats contained in the csv file are (using random numbers):
833
"79, 869"
"56-57, 568"
If it is just a single number (833), I want to add it to my ArrayList. If it is two numbers separated by a comma and surrounded by quotations ("79, 869)", I want to parse out the first of the two numbers (79) and add it to the ArrayList. If it is three numbers surrounded by quotations (where the first two numbers are separated by a dash, and the third by a comma ["56-57, 568"], then I want to parse out the third number (568) and add it to the ArrayList.
I am having trouble using str.contains() to determine if the string on a given line contains a dash or not. Can anyone offer me some help? Here is what I have so far:
private static void getFile(String filePath) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String str;
while ((str = reader.readLine()) != null) {
if(str.endsWith("\"")){
if (str.contains(charDash)){
System.out.println(str);
}
}
}
}
Thanks!
I recommend using the version of indexOf that actually takes a char rather than a string, since this method is much faster. (It is a simple loop, without a nested loop.)
I.e.
if (str.indexOf('-')!=-1) {
System.out.println(str);
}
(Note the single quotes, so this is a char, rather than a string.)
But then you have to split the line and parse the individual values. At present, you are testing if the whole line ends with a quote, which is probably not what you want.
The following code works for me (note: I wrote it with no optimization in mind - it's just for testing purposes):
public static void main(String args[]) {
ArrayList<String> numbers = GetNumbers();
}
private static ArrayList<String> GetNumbers() {
String str1 = "833";
String str2 = "79, 869";
String str3 = "56-57, 568";
ArrayList<String> lines = new ArrayList<String>();
lines.add(str1);
lines.add(str2);
lines.add(str3);
ArrayList<String> numbers = new ArrayList<String>();
for (Iterator<String> s = lines.iterator(); s.hasNext();) {
String thisString = s.next();
if (thisString.contains("-")) {
numbers.add(thisString.substring(thisString.indexOf(",") + 2));
} else if (thisString.contains(",")) {
numbers.add(thisString.substring(0, thisString.indexOf(",")));
} else {
numbers.add(thisString);
}
}
return numbers;
}
Output:
833
79
568
Although it gets a lot of hate these days, I still really like the StringTokenizer for this kind of stuff. You can set it up to return the tokens and, at least to me, it makes the processing trivial without interacting with regexes
you'd have to create it using ",- as your tokens, then just kick it off in a loop.
st=new StringTokenizer(line, "\",-", true);
Then you set up a loop:
while(st.hasNextToken()) {
String token=st.nextToken();
Each case becomes it's own little part of the loop:
// Use punctuation to set flags that tell you how to interpret the numbers.
if(token == "\"") {
isQuoted = !isQuoted;
} else if(token == ",") {
...
} else if(...) {
...
} else { // The punctuation has been dealt with, must be a number group
// Apply flags to determine how to parse this number.
}
I realize that StringTokenizer is outdated now, but I'm not really sure why. Parsing regular expressions can't be faster and the syntax is--well split is a pretty sweet syntax I gotta admit.
I guess if you and everyone you work with is really comfortable with Regular Expressions you could replace that with split and just iterate over the resultant array but I'm not sure how to get split to return the punctuation--probably that "+" thing from other answers but I never trust that some character I'm passing to a regular expression won't do something utterly unexpected.
will
if (str.indexOf(charDash.toString()) > -1){
System.out.println(str);
}
do the trick?
which by the way is fastest than contains... because it implements indexOf
Will this work?
if(str.contains("-")) {
System.out.println(str);
}
I wonder if the charDash variable is not what you are expecting it to be.
I think three regexes would be your best bet - because with a match, you also get the bit you're interested in. I suck at regex, but something along the lines of:
.*\-.*, (.+)
.*, (.+)
and
(.+)
ought to do the trick (in order, because the final pattern matches anything including the first two).

Categories