How to match the word exactly with regex? - java

I might be asking this question incorrectly but what I would like to do is the following:
Given a large String which could be many 100s of lines long match and replace a word exactly and make sure it does not replace and match any part of any other String.
For example :
Strings to Find = Mac Apple Microsoft Matt Damon I.B.M. Hursley
Replacement Strings = MacO AppleO MicrosoftO MattDamonP I.B.M.O HursleyL
Input String (with some of the escape characters included for clarity) =
"A file to test if it finds different\r\n
bits and bobs like Mac, Apple and Microsoft.\n
I.B.M. in Hursley does sum cool stuff!Wow look it's "Matt Damon"\r\n
Testing something whichwillerrorMac"\n
OUTPUT
"A file to test if it finds different
bits and bobs like MacO, AppleO and MicrosoftO.
I.B.M.O in HursleyL do sum cool stuff!Wow look it's "Matt DamonP"
Testing something whichwillerrorMac"
I have tried using Regex using word boundaries, although this picks up 'whichwhillerrorMacO' on the last line.
I have also tried using the StringTokenizer class and various delimiters to try and replace words, but some of the words I am trying to replace contains these delimiters.
Is there a regex that would solve this problem?

Replacing \b(Mac|Apple)\b with \$1O\ will not touch whichwillerrorMac - it will match whichwill-Mac though.

Related

Why is my String array length 3 instead of 2?

I'm trying to understand regex. I wanted to make a String[] using split to show me how many letters are in a given string expression?
import java.util.*;
import java.io.*;
public class Main {
public static String simpleSymbols(String str) {
String result = "";
String[] alpha = str.split("[\\+\\w\\+]");
int alphaLength = alpha.length;
// System.out.print(alphaLength);
String[] charCount = str.split("[a-z]");
int charCountLength = charCount.length;
System.out.println(charCountLength);
}
}
My input string is "+d+=3=+s+". I split the string to count the number of letters in string. The array length should be two but I'm getting three. Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
So, a few things pop out to me:
First, your regex looks correct. If you're ever worried about how your regex will perform, you can use https://regexr.com/ to check it out. Just put your regex on the top and enter your string in the bottom to see if it is matching correctly
Second, upon close inspection, I see you're using the split function. While it is convenient for quickly splitting strings, you need to be careful as to what you are splitting on. In this case, you're removing all of the strings that you were initially looking at, which would make it impossible to find. If you print it out, you would notice that the following shows (for an input string of +d+=3=+s+):
+
+=3=+
+
Which shows that you accidentally cut out what you were looking to find in the first place. Now, there are several ways of fixing this, depending on what your criteria is.
Now, if what you wanted was just to separate on all +s and it doesn't matter that you find only what is directly bounded by +s, then split works awesome. Just do str.split("+"), and this will return you a list of the following (for +d+=3=+s+):
d
=3=
s
However, you can see that this poses a few problems. First, it doesn't strip out the =3= that we don't want, and second, it does not truly give us values that are surrounded by a +_+ format, where the underscore represents the string/char you're looking for.
Seeing as you're using +w, you intend to find words that are surrounded by +s. However, if you're just looking to find one character, I would suggest using another like [a-z] or [a-zA-Z] to be more specific. However, if you want to find multiple alphabetical characters, your pattern is fine. You can also add a * (0 or more) or a + (1 or more) at the end of the pattern to dictate what exactly you're looking for.
I won't give you the answer outright, but I'll give you a clue as to what to move towards. Try using a pattern and a matcher to find the regex that you listed above and then if you find a match, make sure to store it somewhere :)
Also, for future reference, you should always start a function name with a lower case, at least in Java. Only constants and class names should start in a capital :)
I am trying to use split to count the number of letters in that string. The array length should be two, but I'm getting three.
The regex in the split functions is used as delimiters and will not be shown in results. In your case "str.split([a-z])" means using alphabets as delimiters to separate your input string, which makes three substrings "(+)|d|(+=3=+)|s|(+)".
If you really want to count the number of letters using "split", use 'str.split("[^a-z]")'. But I would recommend using "java.util.regex.Matcher.find()" in order to find out all letters.
Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
Similarly, check the functions in "java.util.regex.Matcher".

Regex to match a fixed sub string in a String

I am trying to write a regular expression to verify the presence of a specific number in a fixed position in a String.
String: 109300300330066611111111100000000017000656052086116020170111Name 1
Number to find: 111111111 (Staring from position 17)
I have written the following regular expression:
^.{16}(?<Ones>111111111)(.*)
My understanding is:
Let first 16 characters be whatever they are
Use the Named Capturing Group to grab the specific word
Let the rest of the characters be whatever they are
I am new to regex, is there any issue with the above approach?
Can it be done in other/better way?
I am using Java 8.
Without more details of why you're doing what you're doing, there's just one possible improvement I can see. You repeated any character 16 times at the beginning of the string rather than writing out 16 .s, which is nice and readable, but then, it would be nice to do the same for the repeated 1s:
^.{16}(?<Ones>1{9})(.*)
Otherwise, the string of 1s is hard to understand without the coder manually counting how many there are in the regex.
If you want to hard-code the ones and you know the starting position and you just wnat to know if it is there, using a regex seems unnecessary. you can use this:
String s = "109300300330066611111111100000000017000656052086116020170111Name 1";
if (s.indexOf("111111111").equals(16) doSomething();
Another possible solution without regex:
if(s.substring(16,25).equals("111111111") doSomething();
Otherwise your regex looks good.

Java Splitting a string with multiple delimiters, some of which are 2-character sequences

long-time reader here but first-time poster! I am working on a college project that involves using Java to manipulate transcriptions of traditional music melodies written in the text-based abc notation standard (see here for a quick explainer on the abc standard, if you are interested).
I want to take the body of a whole tune transcription which is represented as a String, and split it into individual bars (i.e. into an array of Strings, one String for each bar). The abc standard has a number of different symbols and combinations of symbols that are used to delimit bars. These symbols are:
|
|]
||
[|
|:
:|
::
My idea was to use a regular expression with the String.split() method to break the tuneBody String below into the arrayOfBars array of Strings. My regex is below, and is intended to try to find any of the above symbols that can be used to delimit a bar in the music.
import java.util.Arrays;
public class TroubleshootRegex
{
//Split the tuneBody into individual bars
public static void main(String[] args)
{
//The musical notes from an abc tune transcription
String tuneBody = "|:G3 GAB|A3 ABd|edd gdd|edB dBA|\nGAG GAB|ABA ABd|edd gdd|BAF G3:|\nB2B d2d|ege dBA|B2B dBG|ABA AGA|\nBAB d^cd|ege dBd|gfg aga|bgg g3:|";
//The body of the tune after being split into individual bars
String[] arrayOfBars;
//This regex is my attempt to look for all the possible bar delimiters defined in the abc standard
String abcBarDelimiters = "[\\|]|\\|\\||\\[\\||\\|:|:\\||::|\\|]";
arrayOfBars = tuneBody.split(abcBarDelimiters);
System.out.println(Arrays.toString(arrayOfBars));
}
}
Unfortunately, when I run the above, I end up with a couple of issues. One of the issues is that I get an empty string at the start of the array, but a bit of research shows me that that's a known issue so I'll figure out a way to work around that. The bigger issue though that I can't seem to figure out on my own is that I end up with a colon included in the first bar of the music, whereas this should be filtered out as part of the initial delimiter when splitting the string if everything worked as intended. i.e. I want the initial "|:" delimiter from tuneBody to be removed during the string splitting. Here's the output:
[, :G3 GAB, A3 ABd, edd gdd, edB dBA,
GAG GAB, ABA ABd, edd gdd, BAF G3,
B2B d2d, ege dBA, B2B dBG, ABA AGA,
BAB d^cd, ege dBd, gfg aga, bgg g3]
I'm assuming that means that I probably have some kind of problem in my regex, but for the life of me I can't seem to figure out what the actual problem is, and I'm starting to go cross-eyed looking at it! It seems that it is matching the single pipe character at the start as a delimiter, rather than matching the character sequence |:
I'd be massively grateful if anyone who actually knows a bit about regexes can tell me why mine doesn't seem to do what I want, or how to get it to see the |: sequence as a whole as a delimiter, rather than a delimiter followed by a colon.
Thanks in advance!
One of the issues is that I get an empty string at the start of the array, but a bit of research shows me that that's a known issue so I'll figure out a way to work around that.
The problem is that your string starts with a delimiter so it will create an empty string as the first element of the split. The same would happen if you have two consecutive delimiters, e.g. ...|::|.... To solve that you could remove the empty strings you don't want, e.g. by using a list instead of an array.
The bigger issue though that I can't seem to figure out on my own is that I end up with a colon included in the first bar of the music, whereas this should be filtered out as part of the initial delimiter when splitting the string if everything worked as intended. i.e. I want the initial "|:" delimiter from tuneBody to be removed during the string splitting.
I'm not entirely sure here (but pretty sure): the problem is that the single pipe is the first option in your regex and thus it matches the pipe in |:. To fix that it should be sufficient to put the single pipe at the end.
You can also simplify your regex since you don't need character classes. Thus this should work:
String abcBarDelimiters = "\\|\\||\\[\\||\\|:|:\\||::|\\|\\]|\\|";
For going more easy on the regex beginners eyes, try the following:
public static void main(String[] args) {
//The musical notes from an abc tune transcription
String tuneBody = "|:G3 GAB|A3 ABd|edd gdd|edB dBA|\nGAG GAB|ABA ABd|edd gdd|BAF G3:|\nB2B d2d|ege dBA|B2B dBG|ABA AGA|\nBAB d^cd|ege dBd|gfg aga|bgg g3:|";
//The body of the tune after being split into individual bars
String re1 = "\\|[\\]\\||:]?"; // |, |], |:
String re2 = "\\[\\|"; // [|
String re3 = ":[\\|:]"; // :|, ::
String abcBarDelimiters = "(" + re1 + "|" + re2 + "|" + re3 + ")";
String[] arrayOfBars = tuneBody.split(abcBarDelimiters);
System.out.println(Arrays.toString(arrayOfBars));
}
... and as Thomas already said, the empty string at the beginning is due to the input starting with a delimiter.

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?
The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.
I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

Categories