Reading text from a file using regular expressions

Reading text from a file using regular expressions - java

I have a text file containing information that has numbers and characters that are broken into 3 columns and I can't figure out what regular expressions I'm needing. The columns are broken by ; and after the third column is written then it skips to the next line and goes on. I know majority of my code is working properly and I've narrowed down the problem to this section of code.
I've tried looking up java regular expressions and I can't seem to find what I'm trying to accomplish.
while ((line = br.readLine()) != null) {
// Searches the file that matches a specific value
if (!line.isEmpty() || line.matches("Need regular expression here that skips over the two columns and reads the last")) {
if (isValid(line)) {
System.out.println(line + "IS Valid");
} else {
System.out.println(line + "IS NOT VALID");
}
}
}
In the console after reading the file it should say
"12345";"12";"tacobell#yahoo.com"; IS valid
"123456";"31";"Taco . bell#yahoo.com"; IS NOT VALID
It must contain the whole line when writing out to the console not just the third column.

^[^;]*;[^;]*;([^ ]*);$
That will give you a match only if the third column contains no spaces (so it will match "12345";"12";"tacobell#yahoo.com";, but it will not match "123456";"31";"Taco . bell#yahoo.com";).
The parentheses are a capture group, so you can extract that column by grabbing group #1 (not group #0) from the capture results.
The ^ at the beginning means that this pattern has to start at the beginning of a line, and the $ at the end means that this pattern has to end at the end of a line. If that's not the case for your input, you will have to adjust it. For example, if you had trailing whitespace after the last column, you might do:
^[^;]*;[^;]*;([^ ]*);[ ]*$
If you had trailing whitespace and the last semicolon was optional, you'd do:
^[^;]*;[^;]*;([^ ]*);?[ ]*$
One last thing: I'm using [ ] to indicate whitespace, but that only includes the basic space character. It doesn't include tabs, newlines, or any other type of whitespace. It's better to use \s if you want to include all of those, but in Java string syntax you have to escape the backslash, so it would look like this:
Pattern.compile("^[^;]*;[^;]*;([^ ]*);?\\s*$")
This is the reason why well-designed programming languages have a specialized regular expression syntax. It gets even crazier if you want to match a literal backslash:
Pattern.compile("\\\\")
In Javascript, this would just be:
/\\/

Related

Using regular expression, how to remove matching sequence at the beginning and ending of the text but keeping what's in the middle?

my problem is very simple but I can't figure out the correct regular expression I should use.
I have the following variable (Java) :
String text = "\033[1mYO\033[0m"; // this is ANSI for bold text in the Terminal
My goal is to remove the ANSI codes with a single regular expression (I just want to keep the plain text at the middle). I cannot modify the text in any way and those ANSI codes will always be at the same place (so one at the beginning, one at the end, though sometimes it's possible that there is none).
With this regular expression, I will remove them using replaceAll method :
String plainText = text.replaceAll(unknownRegex, "");
Any idea on what the unknown regex could be?

Well, you use a single regex that has the ansi codes optionally at the beginning and end, captures anything in between and replaces the entire string with the value of the group: text.replaceAll("^(?:\\\\\\d+\\[1m)?(.*?)(?:\\\\\\d+\\[0m)?$", "$1"). (this might not capture every ansi code - adjust if needed).
Breaking the expression down (note that the example above escapes backslashes for Java strings so they are doubled):
^ is the start of the string
(?:\\\d+\[1m)? matches an optional \<at least 1 digit>[1m
(.*?) matches any text but as little as possible, and captures it into group 1
(?:\\\d+\[0m)? atches an optional \<at least 1 digit>[0m
$ is the end of the input
In the replacement $1 refers to the value of capturing group 1 which is (.*?) in the expression.

Found the answer thanks to a comment that disappeared.
Actually, i just need to make a group to get what's in the middle of the string and using it ($1) to replace the whole thing :
String plainText = text.replaceAll("\\033\\[.*m(.+)\\033\\[.*m", "$1")
Not sure if this will remove every ANSI codes but that is enough for what I want to do.

How do I remove delimiter restovers from a scanner? (Java)

I admit, not the best title.
I'm having the following problem. I need to use my scanner and parse every word (without the delimiters) to separate strings.
Example: Poker; Blackjack; LasVegas, NewYork to Poker Blackjack LasVegas NewYork
Now, for the first part, I would just use a delimiter like so: sc.useDelimiter("; ") which would work fine.
Second part is where I get trouble. If I switch to sc.useDelimiter(", ") after I'm done with Blackjack, I would still include that first ; and a whitespace so the string would output ; LasVegas.
I tried going over it by first resetting the delimiter and eating up the first token which is kind of a bad way of solving it, but then the string would still turn out to be "whitespace"LasVegas instead of LasVegas.
Would really appreciate some help.

There are a number of ways to deal with this, depending on your actual requirements1:
Don't change the delimiter. The token after "Blackjack" will be "LasVegas, NewYork to Poker Blackjack LasVegas NewYork". Create another scanner to parse that token. (Or use String::split.)
Use a delimiter regex that can will match either delimiter; e.g. "[;,]\\s*".
Parse like this:
String line = scanner.nextLine();
String[] parts = line.split(";\\s*");
String[] parts2 = parts[2].split(",\\s*");
This is assuming that ; is a primary delimiter and , is a secondary delimiter.
Change the input file syntax so that it uses only one delimiter character. (This assumes that you are free to do that, AND that an alternative syntax would "make more sense".)
1 - Obviously, we cannot infer the syntax of the file that you are trying to parse from a single line of input. Or, in general, from a single example input file.

Using a regular expression to match both types of punctuation, including any trailing whitespace, should do the trick.
sc.useDelimiter("[;,]\\s*");
^^^^ Followed by 0 or more whitespace chars
^^^^ Either of these
This will fail to capture the last token (NewYork in this case) if there is no semicolon or comma after it. If these 4-tuples of games & cities come in this format (where no delimiter comes after the last token) then you can additionally match a newline character:
sc.useDelimiter("\\n|[;,]\\s*");
^^^^^^^^ semi/comma delimiters
^ OR
^^^ New-line character

How to filtrate a long string (dynamic) with regex?

I have stored the response from a web-application in a string. The string contains several URL:s, and it is dynamic. Could be anything from 10-1000 URL:s.
I work with performance engineering, but this time I have to code a plugin in java, and I am far from an expert in programming.
The problem I have is that in my response-string, I have a lot of gibberish that I don't need, and I don't know how to filtrate it. In my print/request I only want to send the URLS.
I've come this far:
responseData = "http://xxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-65354-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment1_4_av.ts?null=" +
"#EXTINF:10.000, " +
"http://xxxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-65365-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment2_4_av.ts?null=" +
"#EXTINF:fgsgsmoregiberish, " +
"http://xxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-6353-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment2_4_av.ts?null=";
pattern = "^(http://.*\\.ts)";
pr = Pattern.compile(pattern);
math = pr.matcher(responseData);
if (math.find()) {
System.out.println(math.group());
// in this print, I get everything from the response. I only want the URLS (dynamic. could be different names, but they all start with http and end with .ts).
}
else {
System.out.println("No Math");
}

Depending of how looks your URLs, you can use this naive pattern that works for your examples and stops before the ? (written in java style):
\\bhttps?://[^?\\s]+
to ensure there is .ts at the end, you can change it to:
\\bhttps?://[^?\\s]+\\.ts
or
\\bhttps?://[^?\\s]+\\.ts(?=[\\s?]|\\z)
to check that the end of the path is reached.
Note that these patterns don't deal with URLs that contain spaces between double quotes.

Just make you regex lazy with .*? instead of greedy .*, i.e.:
pr = Pattern.compile("(https?.*?\\.ts)");
Regex demo:
https://regex101.com/r/nQ5pA7/1
Regex Explanantion:
(https?.*?\.ts)
Match the regex below and capture its match into backreference number 1 «(https?.*?\.ts)»
Match the character string “http” literally (case sensitive) «http»
Match the character “s” literally (case sensitive) «s?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.»
Match the character string “ts” literally (case sensitive) «ts»

Use the following regex pattern:
(((http|ftp|https):\/{2})+(([0-9a-z_-]+\.)+([a-z]{2,4})(:[0-9]+)?((\/([~0-9a-zA-Z\#\+\%#\.\/_-]+))?(\?[0-9a-zA-Z\+\%#\/&\[\];=_-]+)?)?))\b
Explanation:
contains http or https or ftp with // : ((http|ftp|https):\/{2})
now add '+' sign to add next part in the same string
URL name with one . : ([0-9a-z_-]+.)
domain name : ([a-z]{2,4})
any digit occurs no or one time (here ? denote non or one time) : (:[0-9]+)?
rest url occurs non or one time : '(/([~0-9a-zA-Z#+\%#./_-]+))?(\?[0-9a-zA-Z+\%#/&[];=_-]+)?)'

regex certain character can exist or not but nothing after that

I'm new to regex and I'm trying to do a search on a couple of string.
I wanted to check if a certain character, in this case its ":" (without the quote) exist on the strings.
If : does not exist in the string it would still match, but if : exist there should be nothing after that only space and new line will be allowed.
I have this pattern, but it does not seem to work as I want it.
(.*)(:?\s*\n*)
Thank you.

If I understand your question correctly, ^[^:]*(:\s*)?$
Let's break this down a bit:
^ Starting anchor; without this, the match can restart itself every time it sees another colon, or non-whitespace following a colon.
[^:]* Match any number of characters that AREN'T colon characters; this way, if the entire string is non-colon characters, the string is treated as a valid match.
(:\s*)? If at any point we do see a colon, all following characters must be white space until the end of the string; the grouping parens and following ? act to make this an all-or-nothing conditional statement.
$ Ending anchor; without this, the regex won't know that if it sees a colon the following whitespace MUST persist until the end of the string.

here is a pattern which should work
/^([^:]*|([^:]*:\s*))$/
you can use the pipe to manage alternatives

Another way is :
^[^:]*(|:[\n]*)$
^[^:]* => starts with anything except :
(|:[\n]*)$ => ends either with exactly nothing OR ':' followed by line breaks

Regular Expression to select first five CSVs from a string

I have a CSV string like apple404, orange pie, wind\,cool, sun\\mooon, earth, in Java. To be precise each value of the csv string could be any thing provided commas and backslash are escaped using a back slash.
I need a regular expression to find the first five values. After some goggling I came up with the following. But it wont allow escaped commas within the values.
Pattern pattern = Pattern.compile("([^,]+,){0,5}");
Matcher matcher = pattern.matcher("apple404, orange pie, wind\\,cool, sun\\\\mooon, earth,");
if (matcher.find()) {
System.out.println(matcher.group());
} else {
System.out.println("No match found.");
}
Does anybody know how to make it work for escaped commas within values?

Following negative look-behind based regex will work:
Pattern pattern = Pattern.compile("(?:.*?(?<!(?:(?<!\\\\)\\\\)),){0,5}");
However for full fledged CSV parsing better use a dedicated CSV parser like JavaCSV.

You can use String.split() here. By specifying the limit as 6 the first five elements (index 0 to 4) would always be the first five column values from your CSV string. If in case any extra column values are present they would all overflow to index 5.
The regex (?<!\\\\), makes sure the CSV string is only split at a , comma not preceded with a \.
String[] cols = "apple404, orange pie, wind\\,cool, sun\\\\mooon, earth, " +
"mars, venus, pluto".split("(?<!\\\\),", 6);
System.out.println(cols.length); // 6
System.out.println(Arrays.toString(cols));
// [apple404, orange pie, wind\,cool, sun\\mooon, earth, mars, venus, pluto]
System.out.println(cols[4]); // 5th = earth
System.out.println(cols[5]); // 6th discarded = mars, venus, pluto

This regular expression works well. It also properly recognizes not only backslash-escaped commas, but also backslash-escaped backslashes. Also, the matches it produces do not contain the commas.
/(?:\\\\|\\,|[^,])*/g
(I am using standard regular expression notation with the understanding that you would replace the delimiters with quote marks and double all backslashes when representing this regular expression within a Java string literal.)
example input
"apple404, orange pie, wind\,cool, sun\\,mooon, earth"
produces this output
"apple404"
" orange pie"
" wind\,cool"
" sun\\"
"mooon"
Note that the double backslash after "sun" is escaped and therefore does not escape the following comma.
The way this regular expression works is by atomizing the input into longest sequences first, beginning with double backslashes (treating them as one possible multi-byte character value alternative), followed by escaped commas (a second possible multi-byte character alternative), followed by any non-comma value. Any number of these atoms are matched, followed by a literal comma.
In order to obtain the first N fields, one may simply splice the array of matches from the previous answer or surround the main expression in additional parentheses, include an optional comma in order to match the contents between fields, anchor it to the beginning of the string to prevent the engine from returning further groups of N fields, and quantify it (with N = 5 here):
/^((?:\\\\|\\,|[^,])*,?){0,5}/g
Once again, I am using standard regular expression notation, but here I will also do the trivial exercise of quoting this as a Java string:
"^((?:\\\\\\\\|\\\\,|[^,])*,?){0,5}"
This is the only solution on this page so far which actually answers both parts of the precise requirements specified by the OP, "...commas and backslash are escaped using a back slash." For the input fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,field6\\,, it properly matches only the first five fields fi\,eld1\\,field2\\,field3\\,field4\\,field5\\,.
Note: my first answer made the same assumption that is implicitly part of the OP's original code and example data, which required a comma to be following every field. The problem was that if input is exactly 5 fields or less, and the last field not followed by a comma (equivalently, by an empty field), then final field would not be matched. I did not like this, and so I updated both of my answers so that they do not require following commas.
The shortcoming with this answer is that it follows the OP's assumption that values between commas contain "anything" plus escaped commas or escaped backslashes (i.e., no distinction between strings in double quotes, etc., but only recognition of escaped commas and backslashes). My answer fulfills the criteria of that imaginary scenario. But in the real world, someone would expect to be able to use double quotes around a CSV field in order to include commas within a field without using backslashes.
So I echo the words of #anubhava and suggest that a "real" CSV parser should always be used when handling CSV data. Doing otherwise is just being a script kiddie and not in any way truly "handling" CSV data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading text from a file using regular expressions - java

Related

Using regular expression, how to remove matching sequence at the beginning and ending of the text but keeping what's in the middle?

How do I remove delimiter restovers from a scanner? (Java)

How to filtrate a long string (dynamic) with regex?

regex certain character can exist or not but nothing after that

Regular Expression to select first five CSVs from a string

Categories

Resources