string.matches(".*") returns false - java

In my program, I have a string (obtained from an external library) which doesn't match any regular expression.
String content = // extract text from PDF
assertTrue(content.matches(".*")); // fails
assertTrue(content.contains("S P E C I A L")); // passes
assertTrue(content.matches("S P E C I A L")); // fails
Any idea what might be wrong? When I print content to stdout, it looks ok.
Here is the code for extracting text from the PDF (I am using iText 5.0.1):
PdfReader reader = new PdfReader(source);
PdfTextExtractor extractor = new PdfTextExtractor(reader,
new SimpleTextExtractingPdfContentRenderListener());
return extractor.getTextFromPage(1);

By default, the . does not match line breaks. So my guess is that your content contains a line break.
Also note that matches will match the entire string, not just a part of it: it does not do what contains does!
Some examples:
String s = "foo\nbar";
System.out.println(s.matches(".*")); // false
System.out.println(s.matches("foo")); // false
System.out.println(s.matches("foo\nbar")); // true
System.out.println(s.matches("(?s).*")); // true
The (?s) in the last example will cause the . to match line breaks as well. So (?s).* will match any string.

Related

How to avoid backslash before comma in CSVFormat

I am creating a CSV file using CSVFormat in java, the problem i am facing in both header and values is whenever the string is long and there is a comma the api is inserting a \ before the comma always. As a result the header is not forming correctly and the values in the csv file is taking next cell for the . I am posting the code what i have done
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT.withHeader("\""+SampleEnum.MY_NAME.getHeader()+"\"", "\""+SampleEnum.MY_TITLE.getHeader()+"\"",
"\""+SampleEnum.MY_ID.getHeader()+"\"", "\""+SampleEnum.MY_NUMBER.getHeader()+"\"", "\""+SampleEnum.MY_EXTERNAL_KEY.getHeader()+"\"",
"\""+SampleEnum.DATE.getHeader()+"\"","\""+SampleEnum.MY_ACTION.getHeader()+"\"",
"\"\"\""+SampleEnum.MY__DEFI.getHeader()+"\"\"\"", SampleEnum.MY_ACTION.getHeader(),
SampleEnum.CCHK.getHeader(), SampleEnum.DISTANCE_FROM_LOCATION.getHeader(),
SampleEnum.TCOE.getHeader(), SampleEnum.HGTR.getHeader(),SampleEnum._BLANK.getHeader(),
SampleEnum.LOCATION_MAP.getHeader(), SampleEnum.SUBMISSION_ID.getHeader())
.withDelimiter(',').withEscape('\\').withQuote('"').withTrim().withQuoteMode(QuoteMode.MINIMAL)
)) {
sampleModel.forEach(sf -> {
try {
csvPrinter.printRecord(sf.getMyName(),
sf.getMyTitle(),
sf.getMyID(),
sf.getMyNo(),
So now the problem is i am getting output like this
"\"Name:\"","\"Title\"","\"ID #:\"","\"Store #:\"","\"Store #: External Key\"","\"Date:\"","\"\"\"It's performance issue in detail to include dates,times, circumstances, etc.\"\"\""
I am getting \ before each commas , and when this will come in the value the next portion of the text will shift to the next cell .
my Required output is
"Name:","Title:","Employee ID #:","Store #:","Store #: CurrierKey","Date:","Stage of Disciplinary Action:","""Describe your view about the company, times, circumstances, etc.""",
I am trying
https://commons.apache.org/proper/commons-csv/jacoco/org.apache.commons.csv/CSVFormat.java.html
this link, but i am unable to understand the fix. Please help .
This happens because you are using QuoteMode.NONE which has the following Javadoc:
Never quotes fields. When the delimiter occurs in data, the printer prefixes it with the escape character. If the escape character is not set, format validation throws an exception.
You can use QuoteMode.MINIMAL to only quotes fields which contain special characters (e.g. the field delimiter, quote character or a character of the line separator string).
I suggest that you use CSVFormat.DEFAULT and then configure everything yourself if you cannot use one of the other formats. Check if the backslash (\) is really the right escape character for your use case. Normally it would be a double quote ("). Also, you probably want to remove all the double quotes from your header definition as they get added automatically (if necessary) based on your configuration.
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withHeader("AAAA", "BB\"BB", "CC,CC", "DD'DD")
.withDelimiter(',')
.withEscape('\\') // <- maybe you want '"' instead
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.MINIMAL)
)) {
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "ZZ'ZZ");
}
System.out.println(out);
AAAA,"BB\"BB","CC,CC",DD'DD
WWWW,"XX\"XX","YY,YY",ZZ'ZZ
After your edit, it seems like you want all fields to be quoted with a double quote as escape character. Therefore, you can use QuoteMode.ALL and .withEscape('"') like this:
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withHeader("AAAA", "BB\"BB", "CC,CC", "\"DD\"", "1")
.withDelimiter(',')
.withEscape('"')
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.ALL)
)) {
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "\"DD\"", "2");
}
System.out.println(out);
"AAAA","BB""BB","CC,CC","""DD""","1"
"WWWW","XX""XX","YY,YY","""DD""","2"
In your comment, you state that you only want double quotes when required and triple quotes for one field only. Then, you can use QuoteMode.MINIMAL and .withEscape('"') as suggested in the first example. The triple quotes get generated when you surround your input of that field with double quotes (once because there is a special character and the field needs to be quoted, the second one because you added your explicit " and the third one is there to escape your explicit quote).
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withHeader("AAAA", "BB\"BB", "CC,CC", "\"DD\"", "1")
.withDelimiter(',')
.withEscape('"')
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.MINIMAL)
)) {
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "\"DD\"", "2");
}
System.out.println(out);
AAAA,"BB""BB","CC,CC","""DD""",1
WWWW,"XX""XX","YY,YY","""DD""",2
As per the chat you want total control when the header has quotes and when not. There is no combination of QuoteMode and escape character that will give the desired result. Consequently, I suggest that you manually construct the header:
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withDelimiter(',').withEscape('"')
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.MINIMAL))
) {
out.append(String.join(",", "\"AAAA\"", "\"BBBB\"", "\"CC,CC\"", "\"\"\"DD\"\"\"", "1"));
out.append("\n");
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "\"DD\"", "2");
}
System.out.println(out);
"AAAA","BBBB","CC,CC","""DD""",1
WWWW,"XX""XX","YY,YY","""DD""",2

using antlr4 for checking user input in a java code

I have a .g4 file for my grammar and it works fine.
In my java program, user input must follow some rules which are the rules in the .g4 file.
how can I use it in my java code to check if the user input is valid?
BTW, my IDE is IntelliJ IDEA.
here is my antlr code:
grammar CFG;
/*
* Parser Rules
*/
cfg: (rull NewLine)+;
rull: Variable TransitionOperator sententialForm (Or sententialForm)*;
sententialForm: ((Variable | Literal)+) | Landa;
/*
* Lexer Rules
*/
Literal: [a-z];
Variable: [A-Z];
TransitionOperator: '->';
Or: '|';
OpenParenthesis: '(';
CloseParenthesis: ')';
// Star: '*';
// Plus: '+';
Landa: 'λ';
WhiteSpace: ' ' -> skip;
NewLine: '\n';
That's pretty easy to do: set up your parsing pipeline as usual:
using Antlr4.Runtime;
using Antlr4.Runtime.Tree;
public void MyParseMethod() {
String input = "your text to parse here";
ICharStream stream = CharStreams.fromstring(input);
ITokenSource lexer = new CFGLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
MyGrammarParser parser = new CFGParser(tokens);
// parser.BuildParseTree = true;
IParseTree tree = parser.cfg();
}
(here written in C#) and once the parse run is done check getNumberOfSyntaxErrors() to see if there was an error in the input. For more finegrained handling set up your own error listener and collect the produced errors.

Java regex for google maps url?

I want to parse all google map links inside a String. The format is as follows :
1st example
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z
https://www.google.com/maps/place//#38.8976763,-77.0387185,17z
https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z
https://www.google.com/maps/place/#38.8976763,-77.0387185,17z
https://google.com/maps/place/#38.8976763,-77.0387185,17z
http://google.com/maps/place/#38.8976763,-77.0387185,17z
https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z
These are all valid google map URLs (linking to White House)
Here is what I tried
String gmapLinkRegex = "(http|https)://(www\\.)?google\\.com(\\.\\w*)?/maps/(place/.*)?#(.*z)[^ ]*";
Pattern patternGmapLink = Pattern.compile(gmapLinkRegex , Pattern.CASE_INSENSITIVE);
Matcher m = patternGmapLink.matcher(s);
while (m.find()) {
logger.info("group0 = {}" , m.group(0));
String place = m.group(4);
place = StringUtils.stripEnd(place , "/"); // remove tailing '/'
place = StringUtils.stripStart(place , "place/"); // remove header 'place/'
logger.info("place = '{}'" , place);
String latLngZ = m.group(5);
logger.info("latLngZ = '{}'" , latLngZ);
}
It works in simple situation , but still buggy ...
for example
It need post-process to grab optional place information
And it cannot extract one line with two urls such as :
s = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z " +
" and http://google.com/maps/place/#38.8976763,-77.0387185,17z";
It should be two urls , but the regex matches the whole line ...
The points :
The whole URL should be matched in group(0) (including the tailing data part in 1st example),
in the 1st example , if the zoom level : 17z is removed , it is still a valid gmap URL , but my regex cannot match it.
Easier to extract optional place info
Lat / Lng extraction is must , zoom level is optional.
Able to parse multiple urls in one line
Able to process maps.google.com(.xx)/maps , I tried (www|maps\.)? but seems still buggy
Any suggestion to improve this regex ? Thanks a lot !
The dot-asterisk
.*
will always allow anything to the end of the last url.
You need "tighter" regexes, which match a single URL but not several with anything in between.
The "[^ ]*" might include the next URL if it is separated by something other than " ", which includes line break, tab, shift-space...
I propose (sorry, not tested on java), to use "anything but #" and "digit, minus, comma or dot" and "optional special string followed by tailored charset, many times".
"(http|https)://(www\.)?google\.com(\.\w*)?/maps/(place/[^#]*)?#([0123456789\.,-]*z)(\/data=[\!:\.\-0123456789abcdefmsx]+)?"
I tested the one above on a perl-regex compatible engine (np++).
Please adapt yourself, if I guessed anything wrong. The explicit list of digits can probably be replaced by "\d", I tried to minimise assumptions on regex flavor.
In order to match "URL" or "URL and URL", please use a variable storing the regex, then do "(URL and )*URL", replacing "URL" with regex var. (Asuming this is possible in java.) If the question is how to then retrieve the multiple matches: That is java, I cannot help. Let me know and I delete this answer, not to provoke deserved downvotes ;-)
(Edited to catch the data part in, previously not seen, first example, first line; and the multi URLs in one line.)
I wrote this regex to validate google maps links:
"(http:|https:)?\\/\\/(www\\.)?(maps.)?google\\.[a-z.]+\\/maps/?([\\?]|place/*[^#]*)?/*#?(ll=)?(q=)?(([\\?=]?[a-zA-Z]*[+]?)*/?#{0,1})?([0-9]{1,3}\\.[0-9]+(,|&[a-zA-Z]+=)-?[0-9]{1,3}\\.[0-9]+(,?[0-9]+(z|m))?)?(\\/?data=[\\!:\\.\\-0123456789abcdefmsx]+)?"
I tested with the following list of google maps links:
String location1 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location2 = "https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z";
String location3 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location4 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298";
String location5 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z";
String location6 = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location7 = "https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location8 = "https://www.google.com/maps/place/#38.8976763,-77.0387185,17z";
String location9 = "https://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location10 = "http://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location11 = "https://www.google.com/maps/place/#/data=!4m2!3m1!1s0x3135abf74b040853:0x6ff9dfeb960ec979";
String location12 = "https://maps.google.com/maps?q=New+York,+NY,+USA&hl=no&sll=19.808054,-63.720703&sspn=54.337928,93.076172&oq=n&hnear=New+York&t=m&z=10";
String location13 = "https://www.google.com/maps";
String location14 = "https://www.google.fr/maps";
String location15 = "https://google.fr/maps";
String location16 = "http://google.fr/maps";
String location17 = "https://www.google.de/maps";
String location18 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location19 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location20 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location21 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location22 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location23 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location24 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location25 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location26 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location27 = "http://google.com/maps/bylatlng?lat=21.01196022&lng=105.86298748";
String location28 = "https://www.google.com/maps/place/C%C3%B4ng+vi%C3%AAn+Th%E1%BB%91ng+Nh%E1%BA%A5t,+354A+%C4%90%C6%B0%E1%BB%9Dng+L%C3%AA+Du%E1%BA%A9n,+L%C3%AA+%C4%90%E1%BA%A1i+H%C3%A0nh,+%C4%90%E1%BB%91ng+%C4%90a,+H%C3%A0+N%E1%BB%99i+100000,+Vi%E1%BB%87t+Nam/#21.0121535,105.8443773,13z/data=!4m2!3m1!1s0x3135ab8ee6df247f:0xe6183d662696d2e9";

Java jcabi xpath returns unescaped text

Consider the following:
String s = "<tag>This has a <a href=\"#\">link<a>.</tag>";
final XML xml = new XMLDocument(s);
String extractedText = xml.xpath("//tag/text()").get(0);
System.out.println(extractedText); // Output: This has a link.
System.out.println(s.contains(extractedText)); // Output: false!
System.out.println(s.contains("This has a <a href=\"#\">link<a>.")); // Output: true
I have an XML file given as a string with some escaped HTML. Using the jcabi library, I get the text of the relevant elements (in this case everything in <tag>s). However, what I get isn't actually what's in the original string--I'm expecting < and > but am getting < and > instead. The original string paradoxically does not contain the substring that I extracted from it.
How can I get the actual text and not an unescaped version?

regex match patern before and after a underscore

I have a file name convention {referenceId}_{flavor name}.mp4 in kaltura.
or if you are familiar with kaltura then tell me the slugRegex i could use for this naming convention that would support pre-encoded file ingestion
I have to extract referenceId and filename from it.
I'm using
/(?P)_(?P)[.]\w{3,}/
var filename = "referenceId_flavor-name.mp4";
var parts = filename.match(/([^_]+)_([^.]+)\.(\w{3})/i);
// parts is an array with 4 elements
// ["referenceId_flavor-name.mp4", "referenceId", "flavor-name", "mp4];
var file = 'refID_name.mp4',
parts = file.match(/^([^_]+)_(.+)\.mp4/, file);
Returns array:
[
'refID_name.mp4', //the whole match is always match 0
'refID', //sub-match 1
'name' //sub-match 2
]
/**
* Parse file name according to defined slugRegex and set the extracted parsedSlug and parsedFlavor.
* The following expressions are currently recognized and used:
* - (?P<referenceId>\w+) - will be used as the drop folder file's parsed slug.
* - (?P<flavorName>\w+) - will be used as the drop folder file's parsed flavor.
* - (?P<userId>\[\w\#\.]+) - will be used as the drop folder file entry's parsed user id.
* #return bool true if file name matches the slugRegex or false otherwise
*/
private function parseRegex(DropFolderContentFileHandlerConfig $fileHandlerConfig, $fileName, &$parsedSlug, &$parsedFlavor, &$parsedUserId)
{
$matches = null;
$slugRegex = $fileHandlerConfig->getSlugRegex();
if(is_null($slugRegex) || empty($slugRegex))
{
$slugRegex = self::DEFAULT_SLUG_REGEX;
}
$matchFound = preg_match($slugRegex, $fileName, $matches);
KalturaLog::debug('slug regex: ' . $slugRegex . ' file name:' . $fileName);
if ($matchFound)
{
$parsedSlug = isset($matches[self::REFERENCE_ID_WILDCARD]) ? $matches[self::REFERENCE_ID_WILDCARD] : null;
$parsedFlavor = isset($matches[self::FLAVOR_NAME_WILDCARD]) ? $matches[self::FLAVOR_NAME_WILDCARD] : null;
$parsedUserId = isset($matches[self::USER_ID_WILDCARD]) ? $matches[self::USER_ID_WILDCARD] : null;
KalturaLog::debug('Parsed slug ['.$parsedSlug.'], Parsed flavor ['.$parsedFlavor.'], parsed user id ['. $parsedUserId .']');
}
if(!$parsedSlug)
$matchFound = false;
return $matchFound;
}
is the code that deals with the regex. I used /(?P<referenceId>.+)_(?P<flavorName>.+)[.]\w{3,}/ and following this tutorial enter link description here

Categories