Using regex to remove quote - java

I saw a good sample, but I cannot adapt it for my problem.
I would like to remove only enclosing field " from a CSV line like :
" kkl ";"aa bb D";;12 "AA";;"SSS"-;" gg 12";" vv";"sdqs ";
expected result :
kkl ;aa bb D;;12 "AA";;"SSS"-; gg 12; vv;sdqs ;
I use Pattern and Matcher tools

This solution assumes that there is no escaped quote \" in the quoted string
.replaceAll("(?<=^|;)\"([^\"]*?)\"(?=;|$)", "$1")
I assume that you also want to strip off the " in these case: "sdfkjhksdf", ;;;"dffff"
Another solution uses possessive quantifier, whose effect relies on the assumption that " doesn't appear inside the quoted portion.
.replaceAll("(?<=^|;)(?:\"(.*?)\"){1}+(?=;|$)", "$1")

Small modification to #nhahtdh's regex in order to keep it from greedily matching outside of a CSV boundary:
.replaceAll("(?<=^|;)\"([^;]*)\"(?=;|$)", "$1");

Related

Regular expression for UK postcode also matches UUID

I am having problems with the following UK Postcode regex
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})
It works for UK postcodes as intended e.g.
AB11AB
However, it also seems to match UUIDs as well e.g.
c25d4f64-2336-4a5d-b94c-14dc12xxxa58
Is there anyway to ignore UUIDs from the regular expression ?
Please find example here
https://regex101.com/r/dI6gD9/19
Option 1
Maybe, we would just add start and end anchors and fail the UUIDs, and change the capturing groups to non, if that'd be OK:
^(?:[Gg][Ii][Rr]\s+0[Aa]{2})|(?:(?:([A-Za-z][0-9]{1,2})|(?:(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(?:(?:[A-Za-z][0-9][A-Za-z])|(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s*[0-9][A-Za-z]{2})$
The expression can be most likely simplified (e.g., non-capturing groups), I have also added extra spaces, just in case.
DEMO 1
Option 2
Another option would be to add word boundaries, then it would become almost improbable that it would match a UUID in our data, that I'm guessing, and we can also add an i flag:
(?i)(?:\bgir\b\s+\b0a{2}\b)|\b(?:[a-z][0-9]{1,2}|[a-z][a-hj-y][0-9]{1,2}|[a-z][0-9][a-z]|[a-z][a-hj-y][0-9][a-z]?)\s*[0-9][a-z]{2}\b
DEMO 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?:[Gg][Ii][Rr]\\s+0[Aa]{2})|(?:(?:([A-Za-z][0-9]{1,2})|(?:(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(?:(?:[A-Za-z][0-9][A-Za-z])|(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\\s*[0-9][A-Za-z]{2})$";
final String string = "c25d4f64-2336-4a5d-b94c-14dc12xxxa58\n"
+ "AB11AB";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
Your regex is fine, you just need to match it with the start and end of the string. Just append a ^ to the start and a $ to the end of the pattern.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})$
https://regex101.com/r/jwLqLx/1
You are using the correct regex, that is issued by the UK government.
Below i added examples of how to use it:
Match full string:
When matching to a full string don't use the global flag, because then it will find the occurrences within a string, rather than testing a string to fully match the regex.
So don't use the global and multi-line flags
Notice the gm part in
/your_regex/gm
Try it in this example on regex101.com, where I have already disabled the global and multi-line flag for you.
Match in log file:
For log files, add the word identifier around your regex
Notice the \b parts in
/\byour_regex\b/gm
Try it in this example which shows this behaviour in an example log file.

Java Regex complex ID expression filtering

I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.
My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);

RegEx pattern with unusual unicode character and word boundaries

I'm stuck with a problem concerning RegEx patterns and I hope somebody would explain it to me:
The task is to match object names and remove them from a description that's stored in one of the object's field. I tried the following expression:
final String description= object.getDescrition();
final Matcher descriptionMatcher=
Pattern.compile("\\b" + object.getName() + "\\b", Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE)
.matcher(description);
All works fine until the code encounters a "registered trademark" symbol added to the name: String name = ObjectName®
If I remove the last word boundary, it is matched again. What is the reason for this behaviour and how can I improve this code to possibly find every such special case?
Note: the trademark sign is not separated from the object name via space.
In this case, change your pattern to:
"\\b\\Q" + object.getName() + "\\E(?<=\\b|®)"
if you need to deal with more complex cases, use alternations in lookarounds instead of word boundaries. Example:
"(?<=\\s|^)\\Q" + object.getName() + "\\E(?=\\s|$)"
or
"(?<=\\s|^)" + Pattern.quote(object.getName()) + "(?=\\s|$)"
The ® character is not considered a word character, therefore your Pattern will not match.
A quick and dirty solution would be to alternate it with the word boundary, if you only have this case:
Pattern.compile("\\b" + object.getName() + "\\b|®"

Regex replace word while preserving spaces/punctuation

I am trying to go through a document and change all instances of a name using regular expressions in Java. My code looks something like this:
Pattern replaceWordPattern = Pattern.compile("(^|\\s)" + replaceWord + "^|\\W");
followed by:
String line = matcher.replaceAll("Alice");
The problem is that this does not preserve the spaces or punctuation or other non-word characters that followed. If I had "Jack jumped" it becomes "Alicejumped". Does anyone know a way to fix this?
\W consumes the space after the replaceWord. Replace ^|\\W with word boundary \\b which does not consume symbols. Consider doing same for the first delimiter group, as I suspect you do not want to consume anything there too.
Pattern replaceWordPattern = Pattern.compile("\\b" + replaceWord + "\\b");
If semantic of word boundaries is not suitable for you, consider using lookahead and lookbehind constructs which do not consume input too.
You're missing brackets on the second non-whitespace character expression:
Pattern replaceWordPattern = Pattern.compile("[^|\\s]" + replaceWord + "[^|\\W]");

regular expression for key=(value) syntax

I am currently writing a java program with regular expression but I am struggling as I am pretty new in regex.
KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!##\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]/ ]*";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)";
The target syntax is key(value)+key(value)+key(value). Key is alphanumeric and value is allowed to be any combination.
This has been okay so far. However, I have a problem with '(', ')' in value. If I place '(' or ')' in the value, value includes all the rest.
e.g. number(abc(kk)123)+status(open) returns key:number, value:abc(kk)123)+status(open
It is supposed to be two pairs of key-value.
Can you guys suggest to improve the expression above?
Not possible with regular expressions at all, sorry. If you want to count opening and closing parantheses, regular expressions are, in general, not good enough. The language you are trying to parse is not a regular language.
Of course, there may be ways around that limitation. We cannot know that if you give us as little context as you did.
Get the matched group from index 1 and 2
([a-zA-Z0-9]+)\((.*?)\)(?=\+|$)
Here is online demo
The above regex pattern looks of for )+ as delimiter between keys and values.
Note: The above regex pattern will not work if value contains )+ for example number(abc(kk)+123+4+4)+status(open)
Sample code:
String str = "number(abc(kk)123)+status(open)";
Pattern p = Pattern.compile("([a-zA-Z0-9]+)\\((.*?)\\)(?=\\+|$)");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(1) + ":" + m.group(2));
}
output:
number:abc(kk)123
status:open
Someone posted an answer with a working solution regex: ([a-zA-z0-9]+)\((.*?)\)(?=\+|$) - This works great. When I tested on online regex tester site and came back, the post had gone. Is it right solution? I am wondering why the answer has been deleted.
See this golfed regex:
([^\W_]+)\((.*?)\)(?![^+])
You can use a shorthanded character class [^\W_] instead of [a-zA-Z0-9].
You can use a negative lookahead assertion (?![^+]) to match without backtracking.
However, this is not a practical solution as )+ within inner elements will break: number(abc(kk)+5+123+4+4)+status(open)
This is the case where Java, which has the regex implementation that doesn't support recursion, is disadvantaged. As I mentioned in this thread, the practical approach would be to use a workaround (copy-paste regex), or build your own finite state machine to parse it.
Also, you have a typographical error in your original regex. [a-zA-z0-9]+ has a range "A-z". You meant to type "A-Z".
I'll do a little assumption that you're able to add a + at the end of your chunk
i.e. number(abc(kk)123)+status(open)+
If it is possible you'll have it work with:
KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!##\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]\\(\\)/ ]*?";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)+";
The changes are on line 2 adding the ( ) with escaping and replacing * by *?
The ? turn off the greedy matching and try to keep the shortest match (reluctant operator).
On line 3 adding a + at the end of the mask to help separate the key(value) fields.

Categories