I have data coming in a txt file delimited by pipes. The unfortunate thing is 2 fields can have multiple values. To separate these multiples, the sender used pipes again, but put quotes around it. My regex worked for months until a certain rare situation...
Regex currently:
([^\|]*)\|"?([^"]*)"?\|([^\|]*)\|"?([^"]*)"?
And it worked for the following situation which happens most of the time:
abc|"part1|part2"|abc|"tool1|tool2"
But this case is where the ([^"]*) jumps ahead and takes all from the blank to the end of the quotes:
abc||abc|"tool1|tool2"
So I realize I must account for when there is a pipe next instead of a quote.
Just not sure how.............
P.S. For those PIG people that might be looking at this, I removed a backslash from each escape, to make it look more like Java, but in PIG you need 2, fyi.
In your expression you need to specify that the part between |s can be either quoted or not quoted. You can do it as follows:
(("[^"]*")|((?!")[^|]*))
Now you can repeat this part several times with |s in between, as you need.
Related
I have a CSV splitter with following regex for splitting a string with comma.
String[] splitData = splitCSV.split(",(?=(?:[^\"]*\"[^\"]*\"^\")*[^\"]*$)");
It works so far for String like 123, "foo", "bar", "no, split, here" but when it encounters an inch sign(") like following it cannot do the splitting.
"123, 1.0" xyz"
I need it to split into 123 and 1.0" xyz
Hope someone can provide a solution for this. Thank you.
A couple of points here:
You should be using an existing CSV processing library, not creating your own with a regex. There are many available for Java, see this question as a starting point. This is a solved problem; there's no reason to reinvent it.
The scenario you mention would be invalid* data. A quote should be escaped within a string, usually by using two quotes together. Having one unescaped quote makes the file invalid; and furthermore there is usually no reliable way to tell what the file "should" be once you have these sorts of errors. What to do about it:
If the file is within your control, correct it. Use a standard escape format for quotes within a string.
If the file is not within your control, you should handle errors separately rather than including this in your core processing. Either preprocess the file looking for errors, or use error handling available in a CSV library to do something with the lines that come back as having an incorrect format. If the errors are limited to a predictable issue that you know ahead of time, you might be able to correct them. But in most cases errors like this lead you to have to reject the lines.
*Technically there is no CSV standard, so anything goes. But this would be a data error in any reasonable format. And in the real world this almost always occurs because someone didn't think the file format through, not because they intentionally planned it this way.
What you have here is an unusual dialect of CSV.
Although there is no formalised standard for CSV, there are broadly two approaches to quotes:
Quotes are not special. That is: 7" single, 12" album is two items: 7" single and 12" album. In this dialect, items containing , are problematic.
Quotes are special. That is: "you, me","me you" is two items: you, me and me, you. In this dialect, you can put quotes around an entry in order to have a , within an item. However it makes items containing " problematic, as you have found.
The typical answer to the " problem in the second approach, is to escape quotes. So the item 7" single would appear in the CSV as "7\" single". This of course means that \ becomes a problem, but that's easily solved the same way. AC\DC 7" single appears in the CSV as "AC\\DC 7\" single".
If you can adopt one of these conventional approaches, then do so. Then you can either use an existing CSV library, or roll your own. Although a regex can consume these formats, my opinion is that it's not the clearest way to write code to consume CSV: I've found that a more explicit state machine (e.g. a switch (state) statement) is nice and clear.
If you can't change your input format, the puzzle you have to solve is, when you encounter a ", is it a metacharacter (part of a pair of quotes surrounding an item) or is it a real character that's part of the item?
As owner of the format, it's up to you to decide what the rule is. Perhaps a " should only be considered a metacharacter if it's next to a ,. But even that causes problems if you allow a mixture of quoted and unquoted items:
"A Town Called Malice", The Jam, 7", £6.99
So, you must come up with your own rules, that work in your domain, and write explicit code to handle that situation. One approach is to pre-process the input into canonical CSV so that it's again suitable for a conventional CSV parser.
The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.
The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)
The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!
In my code, I have tried a couple of ways to match regex:
if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}
All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?
The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.
I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.
The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group
(?>(.+?)#011)
Once matched, there's no more backtracking. Done and looking forward for the next group.
Negating a string
The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:
((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).
An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:
((?: (?!#011). )+)#011
It's all pretty complicated and most probably less performant than simply using the atomic group.
Optimizations
Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)
( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.
A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.
I'm trying to make sure that a string contains between 0 and 3 lines, and that for a given line that is present that it contains 0 to 100 characters. It would need to be a valid expression for JavaScript and Java. Like many people doing RegEx I'm copying from various spots on the Internet.
Working backwards I think ^.{0,100}$ gets me the "line contains 0 to 100 characters", but trying to group that as (^.{0,100}$){0,3} doesn't work.
The new line character is probably part of my problem, so I ended up with something like .{0,100}(?:\n.{0,100}){0,2} trying to say "a line of 0 to 100 characters optionally followed by 0 to 2 instances of a new line and 0 to 100 more characters", but that also failed.
Up until now I got those expressions from other people. Using an online test tool I finally monkeyed this together: ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ which appears to work.
So, my question is, am I missing any pitfalls in ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ given what I'm after? Furthermore, even if that does work is it the best expression to use?
I think what you have will work fine. You can make the line break part a little more compact if you want, and you don't need ^ and $ if you are using matches():
String regex = ".{0,100}(?:[\r\n]+.{0,100}){0,2}";
EDIT
After some more thoughts I realized the newline suggestion above will match 4 (or more) lines as long as a couple of them are empty. So, we are back to your suggested example. Oh well, at least the start and end characters can be omitted.
String regex = ".{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}";
I'm not very good at regular expressions but would this work?
^.{0,100}\n?(.{0,100}\n)?.{0,100}?$
Again I'm still new to reg exp, so if there is an error(which is likely) please tell me.
If the code we have looks like
for(...){
}
after reformatting I'd like it to look like
for(...)
{
}
as well for all functions, methods, classes etc.
I found something similar in other article in stackoverflow but it was a regular expression and needed to type every time in the vim console. And I am looking for something to put in the vimrc file (if possible) and to work every time I open it.
Well this is the one I've found:
:%s/^(\s*).*\zs{\s*$/\r\1{/
in http://stackoverflow.com/questions/4463211/is-there-a-way-to-reformat-braces-automatically-with-vim but the thing is it adds a new line even if the bracket is on the right place... and still don't know how to map it to key combination.
(edited with a more accurate pattern)
This should do the trick:
nnoremap <F9> :%s/^\(\s*\).\+\zs{\ze\s*$/\r\1{<cr>
But it it doesn't really sound "safe" to me.
Instead, you could do:
nnoremap <F9> :%s/^\(\s*\).\+\zs{\ze\s*$/\r\1{/c<cr>
which will ask for a confirmation for each match.
Or record a macro and play it back using :global.
edit
Your pattern, :%s/^(\s*).*\zs{\s*$/\r\1{/, is wrong because:
the capture parentheses are not properly escaped, (\s*) instead of \(\s*\)
.* would match any number of any character, including 0 which is why the substitution also works on lines with a single {.
Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$