regex match patern before and after a underscore - java

I have a file name convention {referenceId}_{flavor name}.mp4 in kaltura.
or if you are familiar with kaltura then tell me the slugRegex i could use for this naming convention that would support pre-encoded file ingestion
I have to extract referenceId and filename from it.
I'm using
/(?P)_(?P)[.]\w{3,}/

var filename = "referenceId_flavor-name.mp4";
var parts = filename.match(/([^_]+)_([^.]+)\.(\w{3})/i);
// parts is an array with 4 elements
// ["referenceId_flavor-name.mp4", "referenceId", "flavor-name", "mp4];

var file = 'refID_name.mp4',
parts = file.match(/^([^_]+)_(.+)\.mp4/, file);
Returns array:
[
'refID_name.mp4', //the whole match is always match 0
'refID', //sub-match 1
'name' //sub-match 2
]

/**
* Parse file name according to defined slugRegex and set the extracted parsedSlug and parsedFlavor.
* The following expressions are currently recognized and used:
* - (?P<referenceId>\w+) - will be used as the drop folder file's parsed slug.
* - (?P<flavorName>\w+) - will be used as the drop folder file's parsed flavor.
* - (?P<userId>\[\w\#\.]+) - will be used as the drop folder file entry's parsed user id.
* #return bool true if file name matches the slugRegex or false otherwise
*/
private function parseRegex(DropFolderContentFileHandlerConfig $fileHandlerConfig, $fileName, &$parsedSlug, &$parsedFlavor, &$parsedUserId)
{
$matches = null;
$slugRegex = $fileHandlerConfig->getSlugRegex();
if(is_null($slugRegex) || empty($slugRegex))
{
$slugRegex = self::DEFAULT_SLUG_REGEX;
}
$matchFound = preg_match($slugRegex, $fileName, $matches);
KalturaLog::debug('slug regex: ' . $slugRegex . ' file name:' . $fileName);
if ($matchFound)
{
$parsedSlug = isset($matches[self::REFERENCE_ID_WILDCARD]) ? $matches[self::REFERENCE_ID_WILDCARD] : null;
$parsedFlavor = isset($matches[self::FLAVOR_NAME_WILDCARD]) ? $matches[self::FLAVOR_NAME_WILDCARD] : null;
$parsedUserId = isset($matches[self::USER_ID_WILDCARD]) ? $matches[self::USER_ID_WILDCARD] : null;
KalturaLog::debug('Parsed slug ['.$parsedSlug.'], Parsed flavor ['.$parsedFlavor.'], parsed user id ['. $parsedUserId .']');
}
if(!$parsedSlug)
$matchFound = false;
return $matchFound;
}
is the code that deals with the regex. I used /(?P<referenceId>.+)_(?P<flavorName>.+)[.]\w{3,}/ and following this tutorial enter link description here

Related

Karate DSL assert on nested json

{"serviceName":"Legal Entity account for given input account.","requestTime":1545426348945,"responseTime":1545426348949,"timeTaken":4,"responseCode":0,"responseMessage":"Success","pageSize":100,"pageNumber":0,"accounts":{"transferDate":1549429200000,"migrationWave":"5","searchedLEAccount":{"accountNumber":"41477514","cbdNumber":"12345678","bic":"CHASGBXxX","poolAccount":"Y","sweepMasterAccount":"Y","status":"DORMANT","branchId":"000000071","branchName":"LONDON","leAccountType":"OLD"},"linkedLEAccount":{"accountNumber":"6541245045","cbdNumber":"854321","bic":"CHASLUY","status":"DORMANT","branchId":"000000055","branchName":"S.A","leAccountType":"NEW"}}}
I am trying to grab all accountNumber and validate if they are numbers. What am I doing wrong?
When method Post
Then status 200
And match response != null
And match response contains {serviceName: 'Legal Entity account for given input account.' }
And match response.accounts.searchedLEAccount contains { accountNumber: '#notnull' }
And match response.accounts.searchedLEAccount contains { accountNumber: '#present' }
And match response.accounts.searchedLEAccount contains { accountNumber: '#number' }
In one line:
* match each $..accountNumber == '#regex \\d+'
Tip: read the docs carefully and understand Json-Path.
Here's the full example which you can paste into a new Scenario and see working:
* def response =
"""
{
"serviceName":"Legal Entity account for given input account.",
"requestTime":1545426348945,
"responseTime":1545426348949,
"timeTaken":4,
"responseCode":0,
"responseMessage":"Success",
"pageSize":100,
"pageNumber":0,
"accounts":{
"transferDate":1549429200000,
"migrationWave":"5",
"searchedLEAccount":{
"accountNumber":"41477514",
"cbdNumber":"12345678",
"bic":"CHASGBXxX",
"poolAccount":"Y",
"sweepMasterAccount":"Y",
"status":"DORMANT",
"branchId":"000000071",
"branchName":"LONDON",
"leAccountType":"OLD"
},
"linkedLEAccount":{
"accountNumber":"6541245045",
"cbdNumber":"854321",
"bic":"CHASLUY",
"status":"DORMANT",
"branchId":"000000055",
"branchName":"S.A",
"leAccountType":"NEW"
}
}
}
"""
* match each $..accountNumber == '#regex \\d+'

Why preserveOriginal doesn't work as described in java doc?

I have the following configuration:
#AnalyzerDef(name = "autocompleteNGramAnalyzer",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = WordDelimiterFilterFactory.class,
params = #Parameter(name = "preserveOriginal", value = "1"))
preserveOriginal doc:
/** * Causes original words are preserved and added to the subword
list (Defaults to false) * * "500-42" => "500" "42"
"500-42" */
According this one I have added following word:
500-42
I rebuild index, reopen Luke and see following:
only 500 and 42 tokens where are no 500-42
Why?
Your WordDelimiterFilterFactory only works on tokens that are provided to it, which may not be the original text.
In your case, you use a StandardTokenizer, so by the time WordDelimiterFilterFactory starts processing the string, it has already been split into two tokens (500 and 42).

Java regex for google maps url?

I want to parse all google map links inside a String. The format is as follows :
1st example
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z
https://www.google.com/maps/place//#38.8976763,-77.0387185,17z
https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z
https://www.google.com/maps/place/#38.8976763,-77.0387185,17z
https://google.com/maps/place/#38.8976763,-77.0387185,17z
http://google.com/maps/place/#38.8976763,-77.0387185,17z
https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z
These are all valid google map URLs (linking to White House)
Here is what I tried
String gmapLinkRegex = "(http|https)://(www\\.)?google\\.com(\\.\\w*)?/maps/(place/.*)?#(.*z)[^ ]*";
Pattern patternGmapLink = Pattern.compile(gmapLinkRegex , Pattern.CASE_INSENSITIVE);
Matcher m = patternGmapLink.matcher(s);
while (m.find()) {
logger.info("group0 = {}" , m.group(0));
String place = m.group(4);
place = StringUtils.stripEnd(place , "/"); // remove tailing '/'
place = StringUtils.stripStart(place , "place/"); // remove header 'place/'
logger.info("place = '{}'" , place);
String latLngZ = m.group(5);
logger.info("latLngZ = '{}'" , latLngZ);
}
It works in simple situation , but still buggy ...
for example
It need post-process to grab optional place information
And it cannot extract one line with two urls such as :
s = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z " +
" and http://google.com/maps/place/#38.8976763,-77.0387185,17z";
It should be two urls , but the regex matches the whole line ...
The points :
The whole URL should be matched in group(0) (including the tailing data part in 1st example),
in the 1st example , if the zoom level : 17z is removed , it is still a valid gmap URL , but my regex cannot match it.
Easier to extract optional place info
Lat / Lng extraction is must , zoom level is optional.
Able to parse multiple urls in one line
Able to process maps.google.com(.xx)/maps , I tried (www|maps\.)? but seems still buggy
Any suggestion to improve this regex ? Thanks a lot !
The dot-asterisk
.*
will always allow anything to the end of the last url.
You need "tighter" regexes, which match a single URL but not several with anything in between.
The "[^ ]*" might include the next URL if it is separated by something other than " ", which includes line break, tab, shift-space...
I propose (sorry, not tested on java), to use "anything but #" and "digit, minus, comma or dot" and "optional special string followed by tailored charset, many times".
"(http|https)://(www\.)?google\.com(\.\w*)?/maps/(place/[^#]*)?#([0123456789\.,-]*z)(\/data=[\!:\.\-0123456789abcdefmsx]+)?"
I tested the one above on a perl-regex compatible engine (np++).
Please adapt yourself, if I guessed anything wrong. The explicit list of digits can probably be replaced by "\d", I tried to minimise assumptions on regex flavor.
In order to match "URL" or "URL and URL", please use a variable storing the regex, then do "(URL and )*URL", replacing "URL" with regex var. (Asuming this is possible in java.) If the question is how to then retrieve the multiple matches: That is java, I cannot help. Let me know and I delete this answer, not to provoke deserved downvotes ;-)
(Edited to catch the data part in, previously not seen, first example, first line; and the multi URLs in one line.)
I wrote this regex to validate google maps links:
"(http:|https:)?\\/\\/(www\\.)?(maps.)?google\\.[a-z.]+\\/maps/?([\\?]|place/*[^#]*)?/*#?(ll=)?(q=)?(([\\?=]?[a-zA-Z]*[+]?)*/?#{0,1})?([0-9]{1,3}\\.[0-9]+(,|&[a-zA-Z]+=)-?[0-9]{1,3}\\.[0-9]+(,?[0-9]+(z|m))?)?(\\/?data=[\\!:\\.\\-0123456789abcdefmsx]+)?"
I tested with the following list of google maps links:
String location1 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location2 = "https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z";
String location3 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location4 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298";
String location5 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z";
String location6 = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location7 = "https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location8 = "https://www.google.com/maps/place/#38.8976763,-77.0387185,17z";
String location9 = "https://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location10 = "http://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location11 = "https://www.google.com/maps/place/#/data=!4m2!3m1!1s0x3135abf74b040853:0x6ff9dfeb960ec979";
String location12 = "https://maps.google.com/maps?q=New+York,+NY,+USA&hl=no&sll=19.808054,-63.720703&sspn=54.337928,93.076172&oq=n&hnear=New+York&t=m&z=10";
String location13 = "https://www.google.com/maps";
String location14 = "https://www.google.fr/maps";
String location15 = "https://google.fr/maps";
String location16 = "http://google.fr/maps";
String location17 = "https://www.google.de/maps";
String location18 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location19 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location20 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location21 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location22 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location23 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location24 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location25 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location26 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location27 = "http://google.com/maps/bylatlng?lat=21.01196022&lng=105.86298748";
String location28 = "https://www.google.com/maps/place/C%C3%B4ng+vi%C3%AAn+Th%E1%BB%91ng+Nh%E1%BA%A5t,+354A+%C4%90%C6%B0%E1%BB%9Dng+L%C3%AA+Du%E1%BA%A9n,+L%C3%AA+%C4%90%E1%BA%A1i+H%C3%A0nh,+%C4%90%E1%BB%91ng+%C4%90a,+H%C3%A0+N%E1%BB%99i+100000,+Vi%E1%BB%87t+Nam/#21.0121535,105.8443773,13z/data=!4m2!3m1!1s0x3135ab8ee6df247f:0xe6183d662696d2e9";

How to dynamically update absolute path

Given the below incoming path, e.g.
C:\cresttest\parent_3\child_3_1\child_3_1_.txt
How can one update and add new dir in between above path to construct below result
C:\cresttest\NEW_PATH\parent_3\child_3_1\child_3_1_.txt
Currently I am using multiple subString to identify the incoming path, but incoming path are random and dynamic. Using substring and placing my new path requires more line of code or unnecessary processing, is there any API or way to easily update and add my new dir in between the absolute path?
By using java.nio.file.Path, you could to the following:
Path incomingPath = Paths.get("C:\\cresttest\\parent_3\\child_3_1\\child_3_1_.txt");
//getting C:\cresttest\, adding NEW_PATH to it
Path subPathWithAddition = incomingPath.subpath(0, 2).resolve("NEW_PATH");
//Concatenating C:\cresttest\NEW_PATH\ with \parent_3\child_3_1\child_3_1_.txt
Path finalPath = subPathWithAddition.resolve(incomingPath.subpath(2, incomingPath.getNameCount()));
You could then get the path URI by calling finalPath.toUri()
Note: this doesn't depend on any names in your path, it depends on the directory depth though, which you could edit in the subpath calls.
Note 2: you could probably reduce the amount of Path instances you make to one, I made three to improve readability.
You may simply insert a path at the second backslash like this:
String path="C:\\cresttest\\parent_3\\child_3_1\\child_3_1_.txt";
final String slash="\\\\";
path=path.replaceFirst(slash+"[^"+slash+"]+"+slash, "$0NEW_PATH"+slash);
System.out.println(path);
Demo
This replaces the first occurrence of \\arbitrarydirname\\ with itself (referred to via $0) followed by NEWPATH\\.
The separator’s source code representation looks a bit odd ("\\\\") as a backslash has to be escaped twice when writing regular expression in a Java String literal.
If you want your operation to be platform independent, you may replace that line with
final String slash = Pattern.quote(FileSystems.getDefault().getSeparator());
Of course, then, the input path must be in the right format for the platform as well.
You can use this simple regex replace:
path = path.replaceAll(":.\\w+", "$0\\\\NEW_PATH");
Your code would be simpler if you used / instead of \ for your path delimiters. eg, compare:
String path = "C:\\cresttest\\parent_3\\child_3_1\\child_3_1_.txt";
path = path.replaceAll(":.\\w+", "$0\\\\NEW_PATH");
with
String path = "C:/cresttest/parent_3.child_3_1/child_3_1_.txt";
path = path.replaceAll(":.\\w+", "$0/NEW_PATH");
Java can handle either delimiter on windows, but on linux only / works, so to make your code portable and more readable, prefer using /.
Just for fun, not sure whether this is what you wanted
public static String addFolderToPath(String originalPath, String newFolderName, int position){
String returnString = "";
String[] pathArray = originalPath.split("\\\\");
for(int i = 0; i<pathArray.length; i++){
returnString = returnString.concat(i==position ? "\\" + newFolderName : "");
returnString = returnString.concat(i!=0 ? "\\" + pathArray[i] : "" + pathArray[i]);
}
return returnString;
}
Call:
System.out.println(addFolderToPath("c:\\abc\\def\\ghi\\jkl", "test", 1));
System.out.println(addFolderToPath("c:\\abc\\def\\ghi\\jkl", "test", 2));
System.out.println(addFolderToPath("c:\\abc\\def\\ghi\\jkl", "test", 3));
System.out.println(addFolderToPath("c:\\abc\\def\\ghi\\jkl", "test", 4));
Run:
c:\test\abc\def\ghi\jkl
c:\abc\test\def\ghi\jkl
c:\abc\def\test\ghi\jkl
c:\abc\def\ghi\test\jkl

string.matches(".*") returns false

In my program, I have a string (obtained from an external library) which doesn't match any regular expression.
String content = // extract text from PDF
assertTrue(content.matches(".*")); // fails
assertTrue(content.contains("S P E C I A L")); // passes
assertTrue(content.matches("S P E C I A L")); // fails
Any idea what might be wrong? When I print content to stdout, it looks ok.
Here is the code for extracting text from the PDF (I am using iText 5.0.1):
PdfReader reader = new PdfReader(source);
PdfTextExtractor extractor = new PdfTextExtractor(reader,
new SimpleTextExtractingPdfContentRenderListener());
return extractor.getTextFromPage(1);
By default, the . does not match line breaks. So my guess is that your content contains a line break.
Also note that matches will match the entire string, not just a part of it: it does not do what contains does!
Some examples:
String s = "foo\nbar";
System.out.println(s.matches(".*")); // false
System.out.println(s.matches("foo")); // false
System.out.println(s.matches("foo\nbar")); // true
System.out.println(s.matches("(?s).*")); // true
The (?s) in the last example will cause the . to match line breaks as well. So (?s).* will match any string.

Categories