Why preserveOriginal doesn't work as described in java doc? - java

I have the following configuration:
#AnalyzerDef(name = "autocompleteNGramAnalyzer",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = WordDelimiterFilterFactory.class,
params = #Parameter(name = "preserveOriginal", value = "1"))
preserveOriginal doc:
/** * Causes original words are preserved and added to the subword
list (Defaults to false) * * "500-42" => "500" "42"
"500-42" */
According this one I have added following word:
500-42
I rebuild index, reopen Luke and see following:
only 500 and 42 tokens where are no 500-42
Why?

Your WordDelimiterFilterFactory only works on tokens that are provided to it, which may not be the original text.
In your case, you use a StandardTokenizer, so by the time WordDelimiterFilterFactory starts processing the string, it has already been split into two tokens (500 and 42).

Related

Java regex for google maps url?

I want to parse all google map links inside a String. The format is as follows :
1st example
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z
https://www.google.com/maps/place//#38.8976763,-77.0387185,17z
https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z
https://www.google.com/maps/place/#38.8976763,-77.0387185,17z
https://google.com/maps/place/#38.8976763,-77.0387185,17z
http://google.com/maps/place/#38.8976763,-77.0387185,17z
https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z
These are all valid google map URLs (linking to White House)
Here is what I tried
String gmapLinkRegex = "(http|https)://(www\\.)?google\\.com(\\.\\w*)?/maps/(place/.*)?#(.*z)[^ ]*";
Pattern patternGmapLink = Pattern.compile(gmapLinkRegex , Pattern.CASE_INSENSITIVE);
Matcher m = patternGmapLink.matcher(s);
while (m.find()) {
logger.info("group0 = {}" , m.group(0));
String place = m.group(4);
place = StringUtils.stripEnd(place , "/"); // remove tailing '/'
place = StringUtils.stripStart(place , "place/"); // remove header 'place/'
logger.info("place = '{}'" , place);
String latLngZ = m.group(5);
logger.info("latLngZ = '{}'" , latLngZ);
}
It works in simple situation , but still buggy ...
for example
It need post-process to grab optional place information
And it cannot extract one line with two urls such as :
s = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z " +
" and http://google.com/maps/place/#38.8976763,-77.0387185,17z";
It should be two urls , but the regex matches the whole line ...
The points :
The whole URL should be matched in group(0) (including the tailing data part in 1st example),
in the 1st example , if the zoom level : 17z is removed , it is still a valid gmap URL , but my regex cannot match it.
Easier to extract optional place info
Lat / Lng extraction is must , zoom level is optional.
Able to parse multiple urls in one line
Able to process maps.google.com(.xx)/maps , I tried (www|maps\.)? but seems still buggy
Any suggestion to improve this regex ? Thanks a lot !
The dot-asterisk
.*
will always allow anything to the end of the last url.
You need "tighter" regexes, which match a single URL but not several with anything in between.
The "[^ ]*" might include the next URL if it is separated by something other than " ", which includes line break, tab, shift-space...
I propose (sorry, not tested on java), to use "anything but #" and "digit, minus, comma or dot" and "optional special string followed by tailored charset, many times".
"(http|https)://(www\.)?google\.com(\.\w*)?/maps/(place/[^#]*)?#([0123456789\.,-]*z)(\/data=[\!:\.\-0123456789abcdefmsx]+)?"
I tested the one above on a perl-regex compatible engine (np++).
Please adapt yourself, if I guessed anything wrong. The explicit list of digits can probably be replaced by "\d", I tried to minimise assumptions on regex flavor.
In order to match "URL" or "URL and URL", please use a variable storing the regex, then do "(URL and )*URL", replacing "URL" with regex var. (Asuming this is possible in java.) If the question is how to then retrieve the multiple matches: That is java, I cannot help. Let me know and I delete this answer, not to provoke deserved downvotes ;-)
(Edited to catch the data part in, previously not seen, first example, first line; and the multi URLs in one line.)
I wrote this regex to validate google maps links:
"(http:|https:)?\\/\\/(www\\.)?(maps.)?google\\.[a-z.]+\\/maps/?([\\?]|place/*[^#]*)?/*#?(ll=)?(q=)?(([\\?=]?[a-zA-Z]*[+]?)*/?#{0,1})?([0-9]{1,3}\\.[0-9]+(,|&[a-zA-Z]+=)-?[0-9]{1,3}\\.[0-9]+(,?[0-9]+(z|m))?)?(\\/?data=[\\!:\\.\\-0123456789abcdefmsx]+)?"
I tested with the following list of google maps links:
String location1 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location2 = "https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z";
String location3 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location4 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298";
String location5 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z";
String location6 = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location7 = "https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location8 = "https://www.google.com/maps/place/#38.8976763,-77.0387185,17z";
String location9 = "https://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location10 = "http://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location11 = "https://www.google.com/maps/place/#/data=!4m2!3m1!1s0x3135abf74b040853:0x6ff9dfeb960ec979";
String location12 = "https://maps.google.com/maps?q=New+York,+NY,+USA&hl=no&sll=19.808054,-63.720703&sspn=54.337928,93.076172&oq=n&hnear=New+York&t=m&z=10";
String location13 = "https://www.google.com/maps";
String location14 = "https://www.google.fr/maps";
String location15 = "https://google.fr/maps";
String location16 = "http://google.fr/maps";
String location17 = "https://www.google.de/maps";
String location18 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location19 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location20 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location21 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location22 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location23 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location24 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location25 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location26 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location27 = "http://google.com/maps/bylatlng?lat=21.01196022&lng=105.86298748";
String location28 = "https://www.google.com/maps/place/C%C3%B4ng+vi%C3%AAn+Th%E1%BB%91ng+Nh%E1%BA%A5t,+354A+%C4%90%C6%B0%E1%BB%9Dng+L%C3%AA+Du%E1%BA%A9n,+L%C3%AA+%C4%90%E1%BA%A1i+H%C3%A0nh,+%C4%90%E1%BB%91ng+%C4%90a,+H%C3%A0+N%E1%BB%99i+100000,+Vi%E1%BB%87t+Nam/#21.0121535,105.8443773,13z/data=!4m2!3m1!1s0x3135ab8ee6df247f:0xe6183d662696d2e9";

How to handle foreign addresses in Java application that calls Nominatim Webservice

The following code produces a string that has question marks as the display name when I insert an Iranian address(?????, ???????). However if I put the same url into my browser, it returns Tehran, Iran instead of question marks. I know that it has something to do with encoding but how do I get the English text as the browser returns in my java application?
String rawAddress = "Tehran";
String address = URLEncoder.encode(rawAddress, "utf-8");
String geocodeURL = "http://nominatim.openstreetmap.org/search?format=json&limit=1&polygon=0&addressdetails=0&email=myemail#gmail.com&languagecodes=en&q=";
String formattedUrl = geocodeURL + address;
URL theGeocodeUrl = new URL(formattedUrl);
System.out.println("HERE " +theGeocodeUrl.toString());
InputStream is = theGeocodeUrl.openStream();
final ObjectMapper mapper = new ObjectMapper();
final List<Object> dealData = mapper.readValue(is, List.class);
System.out.println(dealData.get(0).toString());
I tried the following code but it produced this: تهران, �ايران‎ for the display name which should be Tehran, Iran.
System.out.println(new String(dealData.get(0).toString().getBytes("UTF-8")));
Use "accept-language" in the URL parameter for Nominatim to specify the preferred language of Nominatim's results, overriding whatever default the HTTP header may set. From the documentation:
accept-language= <browser language string>
Preferred language order for showing search results, overrides the
value specified in the "Accept-Language" HTTP header. Either uses
standard rfc2616 accept-language string or a simple comma separated
list of language codes.

How to read data from CSV if contains more than excepted separators?

I use CsvJDBC for read data from a CSV. I get CSV from web service request, so not loaded from file. I adjust these properties:
Properties props = new java.util.Properties();
props.put("separator", ";"); // separator is a semicolon
props.put("fileExtension", ".txt"); // file extension is .txt
props.put("charset", "UTF-8"); // UTF-8
My sample1.txt contains these datas:
code;description
c01;d01
c02;d02
my sample2.txt contains these datas:
code;description
c01;d01
c02;d0;;;;;2
It is optional for me deleted headers from CSV. But not optional for me change semi-colon separator.
EDIT: My query for resultSet: SELECT * FROM myCSV
I want to read code column in sample1.txt and sample2.txt with:
resultSet.getString(1)
and read full description column with many semi-colons (d0;;;;;2). Is it possible with CsvJdbc driver or need to change driver?
Thank you any advice!
This is a problem that occurs when you have messy, invalid input, which you need to try to interpret, that's being read by a too-high-level package that only handles clean input. A similar example is trying to read arbitrary HTML with an XML parser - close, but no cigar.
You can guess where I'm going: you need to pre-process your input.
The preprocessing may be very easy if you can make some assumptions about the data - for example, if there are guaranteed to be no quoted semi-colons in the first column.
You could try supercsv. We have implemented such a solution in our project. More on this can be found in http://supercsv.sourceforge.net/
and
Using CsvBeanReader to read a CSV file with a variable number of columns
Finally this problem solved without a CSVJdbc or SuperCSV driver. These drivers works fine. There are possible query data form CSV file and many features content. In my case I don't need query data from CSV. Unfortunately, sometimes the description column content one or more semi-colons and which it is my separator.
First I check code in answer of #Maher Abuthraa and modified to:
private String createDescriptionFromResult(ResultSet resultSet, int columnCount) throws SQLException {
if (columnCount > 2) {
StringBuilder data_list = new StringBuilder();
for (int ii = 2; ii <= columnCount; ii++) {
data_list.append(resultSet.getString(ii));
if (ii != columnCount)
data_list.append(";");
}
// data_list has all data from all index you are looking for ..
return data_list.toString();
} else {
// use standard way
return resultSet.getString(2);
}
}
The loop started from 2, because 1 column is code and only description column content many semi-colons. The CSVJdbc driver split columns by separator ; and these semi-colons disappears from columns data. So, I explicit add semi-colons to description, except the last column, because it is not relevant in my case.
This code work fine. But not solved my all problem. When I adjusted two columns in header of CSV I get error in row, which content more than two semi-colons. So I try adjust ignore of headers or add many column name (or simple ;) to a header. In superCSV ignore of headers option work fine.
My colleague opinion was: you are don't need CSV driver, because try load CSV which not would be CSV, if separator is sometimes relevant data.
I think my colleague has right and I loaded CSV data whith following code:
InputStream in = null;
try {
in = new ByteArrayInputStream(csvData);
List lines = IOUtils.readLines(in, "UTF-8");
Iterator it = lines.iterator();
String line = "";
while (it.hasNext()) {
line = (String) it.next();
String description = null;
String code = null;
String[] columns = line.split(";");
if (columns.length >= 2) {
code = columns[0];
String[] dest = new String[columns.length - 1];
System.arraycopy(columns, 1, dest, 0, columns.length - 1);
description = org.apache.commons.lang.StringUtils.join(dest, ";");
(...)
ok.. my solution to go and read all fields if columns are more than 2 ... like:
int ccc = meta.getColumnCount();
if (ccc > 2) {
ArrayList<String> data_list = new ArrayList<String>();
for (int ii = 1; ii < ccc; ii++) {
data_list.add(resultSet.getString(i));
}
//data_list has all data from all index you are looking for ..
} else {
//use standard way
resultSet.getString(1);
}
If the table is defined to have as many columns as there could be semi-colons in the source, ignoring the initial column definitions, then the excess semi-colons would be consumed by the database driver automatically.
The most likely reason for them to appear in the final column is because the parser returns the balance of the row to the terminator in the field.
Simply increasing the number of columns in the table to match the maximum possible in the input will avoid the need for custom parsing in the program. Try:
code;description;dummy1;dummy2;dummy3;dummy4;dummy5
c01;d01
c02;d0;;;;;2
Then, the additional ';' delimiters will be consumed by the parser correctly.

regex match patern before and after a underscore

I have a file name convention {referenceId}_{flavor name}.mp4 in kaltura.
or if you are familiar with kaltura then tell me the slugRegex i could use for this naming convention that would support pre-encoded file ingestion
I have to extract referenceId and filename from it.
I'm using
/(?P)_(?P)[.]\w{3,}/
var filename = "referenceId_flavor-name.mp4";
var parts = filename.match(/([^_]+)_([^.]+)\.(\w{3})/i);
// parts is an array with 4 elements
// ["referenceId_flavor-name.mp4", "referenceId", "flavor-name", "mp4];
var file = 'refID_name.mp4',
parts = file.match(/^([^_]+)_(.+)\.mp4/, file);
Returns array:
[
'refID_name.mp4', //the whole match is always match 0
'refID', //sub-match 1
'name' //sub-match 2
]
/**
* Parse file name according to defined slugRegex and set the extracted parsedSlug and parsedFlavor.
* The following expressions are currently recognized and used:
* - (?P<referenceId>\w+) - will be used as the drop folder file's parsed slug.
* - (?P<flavorName>\w+) - will be used as the drop folder file's parsed flavor.
* - (?P<userId>\[\w\#\.]+) - will be used as the drop folder file entry's parsed user id.
* #return bool true if file name matches the slugRegex or false otherwise
*/
private function parseRegex(DropFolderContentFileHandlerConfig $fileHandlerConfig, $fileName, &$parsedSlug, &$parsedFlavor, &$parsedUserId)
{
$matches = null;
$slugRegex = $fileHandlerConfig->getSlugRegex();
if(is_null($slugRegex) || empty($slugRegex))
{
$slugRegex = self::DEFAULT_SLUG_REGEX;
}
$matchFound = preg_match($slugRegex, $fileName, $matches);
KalturaLog::debug('slug regex: ' . $slugRegex . ' file name:' . $fileName);
if ($matchFound)
{
$parsedSlug = isset($matches[self::REFERENCE_ID_WILDCARD]) ? $matches[self::REFERENCE_ID_WILDCARD] : null;
$parsedFlavor = isset($matches[self::FLAVOR_NAME_WILDCARD]) ? $matches[self::FLAVOR_NAME_WILDCARD] : null;
$parsedUserId = isset($matches[self::USER_ID_WILDCARD]) ? $matches[self::USER_ID_WILDCARD] : null;
KalturaLog::debug('Parsed slug ['.$parsedSlug.'], Parsed flavor ['.$parsedFlavor.'], parsed user id ['. $parsedUserId .']');
}
if(!$parsedSlug)
$matchFound = false;
return $matchFound;
}
is the code that deals with the regex. I used /(?P<referenceId>.+)_(?P<flavorName>.+)[.]\w{3,}/ and following this tutorial enter link description here

dynamically read/add value to the parameter of conf file with Properties

I have a message like below in my conf file.
text.message = Richard has to go to School in 01/06/2012 / 1days.
All highlighted field will be variable.
I want to read this text.me string and insert the value from my java using Properties.
I know how to read the whole string using Prop, but don't know how to read like above String which will be like.
text.message = #name# has to go to #place# in #date# / #days#.
how can I read the above string from the conf using Properties and insert data dynamically?
It can be either date or days in the string. How I can turn on and off between those parameters?
Thanks ahead.
You can use the MessageFormat API for this.
Kickoff example:
text.message = {0} has to go to {1} in {2,date,dd/MM/yyyy} / {3}
with
String message = properties.getProperty("text.message");
String formattedMessage = MessageFormat.format(message, "Richard", "School", new Date(), "1days");
System.out.println(formattedMessage); // Richard has to go to School in 31/05/2012 / 1days
You can use the MessageFormat class, which replaces dynamic placeholders in a string with the desired values.
For example, the following code...
String pattern = "{0} has to go to {1} in {2,date} / {3,number,integer} days.";
String result = MessageFormat.format(pattern, "Richard", "school", new Date(), 5);
System.out.println(result);
...will produce the following output:
Richard has to go to school in 31-May-2012 / 5 days.
You can simply get the pattern from your Properties object, then apply the MessageFormat translation.

Categories