Regular expression, omit bracket - java

Need a help in figuring out the regular expression where I need to remove all the data between {{ and }}?
Below is the coupus:
{{for|the American actor|Russ Conway (actor)}}
{{Use dmy dates|date=November 2012}}
{{Infobox musical artist <!-- See Wikipedia:WikiProject_Musicians -->
| birth_name = Trevor Herbert Stanford
| birth_date = {{birth date|1925|09|2|df=y}}
| birth_place = [[Bristol]], [[England]], UK
| death_date = {{death date and age|2000|11|16|1925|09|02|df=y}}
| death_place = [[Eastbourne]], [[Sussex]], England, UK
| origin =
}}
record|hits]].<ref name="British Hit Singles & Albums"/>
{{reflist}}
==External links==
*[http://www.russconway.co.uk/ Russ Conway]
*{{YouTube|TnIpQhDn4Zg|Russ Conway playing Side Saddle}}
{{Authority control|VIAF=41343596}}
<!-- Metadata: see [[Wikipedia:Persondata]] -->
{{Persondata
| NAME =Conway, Russ
}}
{{DEFAULTSORT:Conway, Russ}}
[[Category:1925 births]]
Below is the output with all the curly braces are removed along with the text within it:
record|hits]].<ref name="British Hit Singles & Albums"/>
==External links==
*[http://www.russconway.co.uk/ Russ Conway]
*
<!-- Metadata: see [[Wikipedia:Persondata]] -->
[[Category:1925 births]]
P.S - I have omitted the space in the output, I will take care of that.

This would take care of nested {{ }}
Matcher m=Pattern.compile("\\{[^{}]*\\}").matcher(input);
while(m.find())
{
input=m.replaceAll("");
m.reset(input);
}

string.replaceAll("\\{\\{[\\s\\S]*?\\}\\}","");
will produce:
record|hits]].<ref name="British Hit Singles & Albums"/>
==External links==
*[http://www.russconway.co.uk/ Russ Conway]
*
<!-- Metadata: see [[Wikipedia:Persondata]] -->
[[Category:1925 births]]

Related

Customizing libPhoneNumber from Google

I'm using libphonenumber from Google. I want to customize some data of this library U cloned the Project then I modify in resources/PhoneNumberMetadata.xml.
Then I changed the Mobile Number related to Egypt from 10 to 15 Number.
<territory id="EG" countryCode="20" internationalPrefix="00" nationalPrefix="0"
mobileNumberPortableRegion="true">
<availableFormats>
<numberFormat pattern="(\d)(\d{7,8})" nationalPrefixFormattingRule="$NP$FG">
<leadingDigits>[23]</leadingDigits>
<format>$1 $2</format>
</numberFormat>
<numberFormat pattern="(\d{2})(\d{6,7})" nationalPrefixFormattingRule="$NP$FG">
<leadingDigits>
1[35]|
[4-6]|
8[2468]|
9[235-7]
</leadingDigits>
<format>$1 $2</format>
</numberFormat>
<numberFormat pattern="(\d{3})(\d{3})(\d{4})" nationalPrefixFormattingRule="$NP$FG">
<leadingDigits>[189]</leadingDigits>
<format>$1 $2 $3</format>
</numberFormat>
</availableFormats>
<generalDesc>
<nationalNumberPattern>
[189]\d{8,9}|
[24-6]\d{8}|
[135]\d{7}
</nationalNumberPattern>
</generalDesc>
<!-- Subscriber numbers starting with 5 are also permitted for the area codes 040, with 5, 6
and 7 for the area code 050, with 5 and 7 for 082, with 6 for 084, with 7 for 086 and
092 and with 5 and 6 for 96. -->
<fixedLine>
<possibleLengths national="8,9" localOnly="6,7"/>
<exampleNumber>234567890</exampleNumber>
<nationalNumberPattern>
(?:
15\d|
57[23]
)\d{5,6}|
(?:
13[23]|
(?:
2[2-4]|
3
)\d|
4(?:
0[2-5]|
[578][23]|
64
)|
5(?:
0[2-7]|
5\d
)|
6[24-689]3|
8(?:
2[2-57]|
4[26]|
6[237]|
8[2-4]
)|
9(?:
2[27]|
3[24]|
52|
6[2356]|
7[2-4]
)
)\d{6}
</nationalNumberPattern>
</fixedLine>
<mobile>
<possibleLengths national="15"/>
<exampleNumber>100123456712345</exampleNumber>
<nationalNumberPattern>1[0-25]\d{13}</nationalNumberPattern>
</mobile>
<tollFree>
<possibleLengths national="10"/>
<exampleNumber>8001234567</exampleNumber>
<nationalNumberPattern>800\d{7}</nationalNumberPattern>
</tollFree>
<premiumRate>
<possibleLengths national="10"/>
<exampleNumber>9001234567</exampleNumber>
<nationalNumberPattern>900\d{7}</nationalNumberPattern>
</premiumRate>
</territory>
Then I build the Project then I take the Jar in my Project to depend on the New Jar but still see that Mobile Number is 10 Numbers not 15
that is the Code I wrote
public static void main(String argc[])
{
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
try {
Phonenumber.PhoneNumber egyNumber = phoneUtil.parse("152234567891234", "EG");
boolean isValidNumber = phoneUtil.isValidNumber(egyNumber);
System.out.println(isValidNumber);
} catch (NumberParseException e) {
e.printStackTrace();
}
}
This code return FALSE but it should return TRUE.
Note: The Library use Binary File for each Country, but I think it's encoded.
i found that Google provides a way to customize the Metadata related to any Country
through some steps you can find them here
https://github.com/google/libphonenumber/blob/master/making-metadata-changes.md

javacc (ph-javacc-maven-plugin) generates java switch with case `\`

I'm a newly to javacc. I tried to parse an existing javacc grammar (its the JSR341, EL 3.0 Grammar). It generates (almost) correct java. However, the generated code contains an illegal switch statement. I'm using the ph-javacc-maven-plugin.
private int jjMoveStringLiteralDfa0_0(){
switch(curChar)
{
case '#':
return jjMoveStringLiteralDfa1_0(0x8L);
case '$':
return jjMoveStringLiteralDfa1_0(0x4L);
case '\': // should be '\\'
return jjStartNfaWithStates_0(0, 4, 2);
default :
return jjMoveNfa_0(7, 0);
}
}
This is the offending grammar section from JS341 (although I'm not sure its the grammar itself) that's causing the problem:
<DEFAULT> TOKEN :
{
< LITERAL_EXPRESSION:
((~["\\", "$", "#"])
| ("\\" ("\\" | "$" | "#"))
| ("$" ~["{", "$"])
| ("#" ~["{", "#"])
)+
| "$"
| "#"
>
|
< START_DYNAMIC_EXPRESSION: "${" > {stack.push(DEFAULT);}:
IN_EXPRESSION
|
< START_DEFERRED_EXPRESSION: "#{" > {stack.push(DEFAULT);}:
IN_EXPRESSION
}
<DEFAULT> SKIP : { "\\" }
I played around with the options (JAVA_UNICODE_ESCAPE, UNICODE_INPUT) and grammar. But without result.
Question: how do I make javacc generate valid Java switch statement, i.e., with '\\' instead of '\'?
The observed behaviour is an issue and will be solved in parser-generator-cc 1.1.0.

SparkSQL + Java: Pojo to Tabular Format while working with Datasets

I'm pretty new to Spark SQL. While implementing one of training tasks I faced the following issue and could not find an answer (all the following examples are a bit dumb, but should be still ok for demonstration purposes).
My app reads a parquet file and creates a dataset basing on its content:
DataFrame input = sqlContext.read().parquet("src/test/resources/integration/input/source.gz.parquet");
Dataset<Row> dataset = input.as(RowEncoder$.MODULE$.apply(input.schema()));
The dataset.show() call results in:
+------------+----------------+--------+
+ Names + Gender + Age +
+------------+----------------+--------+
| Jack, Jill | Male, Female | 30, 25 |
Then I convert the dataset into a new dataset with the Person type inside:
public static Dataset<Person> transformToPerson(Dataset<Row> rawData) {
return rawData
.flatMap((Row sourceRow) -> {
// code to parse an input row and split person data goes here
Person person1 = new Person(name1, gender1, age1);
Person person2 = new Person(name2, gender2, age2);
return Arrays.asList(person1, person2);
}, Encoders.bean(Person.class));
}
where
public abstract class Human implements Serializable {
protected String name;
protected String gender;
// getters/setters go here
// default constructor + constructor with the name and gender params
}
public class Person extends Human {
private String age;
// getters/setters for the age param go here
// default constructor + constructor with the age, name and gender params
// overriden toString() method which returns the string: (<name>, <gender>, <age>)
}
Finally, when I show the dataset's content I expect to see
+------------+----------------+--------+
+ name + gender + age +
+------------+----------------+--------+
| Jack | Male | 30 |
| Jill | Femail | 25 |
However, I see
+-------------------+----------------+--------+
+ name + gender + age +
+-------------------+----------------+--------+
|(Jack, Male, 30) | | |
|(Jill, Femail, 25) | | |
Which is a result of the toString() method, while the header is correct.
I believe something is wrong with the Encoder, as far as if I use the Encoders.javaSerizlization(T) or Encoders.kryo(T) it shows
+------------------+
+ value +
+------------------+
|(Jack, Male, 30) |
|(Jill, Femail, 25)|
What worries me most is maybe the incorrect usage of encoders could result in incorrect SerDe and/or performance penalties.
I cannot not see anything special in all Spark Java examples that I can find...
Could you please suggest what I do wrong?
UPDATE 1
Here are my project's dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.2</version>
</dependency>
SOLUTION
As abaghel suggested I upgraded the version to 2.0.2 (please be aware that on version 2.0.0 there is the bug for Windows), used Dataset instead of DataFrames everywhere in my code (seems like DataFrames are not a part of Apache Spark starting from 2.0.0), and used the iterator-based flatMap function to transform from Row to Person.
Just to share, the approach of using the TraversableOnce-based flatMap for version 1.6.2 did not work for me as it threw the 'MyPersonConversion$function1 not Serializable' exception.
Now everything is working as expected.
What is the version of Spark you are using? Method for flatMap you have provided is not compiling with version 2.2.0. Return type required is Iterator<Person>. Please use below FlatMapFunction and you will get the desired output.
public static Dataset<Person> transformToPerson(Dataset<Row> rawData) {
return rawData.flatMap(row -> {
String[] nameArr = row.getString(0).split(",");
String[] genArr = row.getString(1).split(",");
String[] ageArr = row.getString(2).split(",");
Person person1 = new Person(nameArr[0], genArr[0], ageArr[0]);
Person person2 = new Person(nameArr[1], genArr[1], ageArr[1]);
return Arrays.asList(person1, person2).iterator();
}, Encoders.bean(Person.class));
}
//Call function
Dataset<Person> dataset1 = transformToPerson(dataset);
dataset1.show();

Parse response from Wikipedia API

I am trying to parse response from the Wikipedia API (MediaWiki). The URL i am using are of the form -
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=Argo_(2012_film)
Response from the api has the wikipedia content inside a xml tag which looks like : (this is just an incomplete sample)
{{Use mdy dates|date=October 2012}} {{Infobox film | name = Argo | image =
Argo2012Poster.jpg | alt = <!-- See: WP:ALT --> | caption = Theatrical release poster |
tagline = "The movie was fake. The mission was real." | director = [[Ben Affleck]] |
producer = [[Grant Heslov]]<br />Ben Affleck<br />[[George Clooney]] | based on = {{Based
on|''The Master of Disguise''|[[Tony Mendez|Antonio J. Mendez]]}}<br />{{Based on|''The
Great Escape''|[[Joshuah Bearman]]}} | screenplay = [[Chris Terrio]] | starring = Ben
Affleck<br />[[Bryan Cranston]]<br />[[Alan Arkin]]<br />[[John Goodman]] | music =
[[Alexandre Desplat]] | cinematography = [[Rodrigo Prieto]] | editing = [[William
Goldenberg]] | studio = [[Graham King|GK Films]]<br />[[Smokehouse Pictures]] | distributor =
[[Warner Bros.]] | released = {{Film date|2012|08|31|Telluride Film
Festival|2012|10|12|United States}} | runtime = 120 minutes<ref> ...continued
This does not look like JSON or XML, how do i parse this?
If you want to get the content parsed as HTML, add &rvparse to the query.
For example when you execute the query
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=Argo_%282012_film%29&rvparse
the response contains something like (after skipping the infobox):
<i><b>Argo</b></i> is a 2012 American <a href="/wiki/Political_thriller"
title="Political thriller">political thriller</a> film directed by Ben Affleck.

JavaCC: How to handle tokens that contain common words

I'm trying to create a parser for source code like this:
[code table 1.0]
code table code_table_name
id = 500
desc = "my code table one"
end code table
... and here below is the grammar I defined:
PARSER_BEGIN(CodeTableParser)
...
PARSER_END(CodeTableParser)
/* skip spaces */
SKIP: {
" "
| "\t"
| "\r"
| "\n"
}
/* reserved words */
TOKEN [IGNORE_CASE]: {
<CODE_TAB_HEADER: "[code table 1.0]">
| <CODE_TAB_END: "end" (" ")+ <CODE_TAB_BEGIN>>
| <CODE_TAB_BEGIN: <IDENT> | "code" (" ")+ "table">
| <ID: "id">
| <DESC: "desc">
}
/* token images */
TOKEN: {
<NUMBER: (<DIGIT>)+>
| <IDENT: (<ALPHA>)+>
| <VALUE: (<ALPHA> ["[", "]"])+>
| <STRING: <QUOTED>>
}
TOKEN: {
<#ALPHA: ["A"-"Z", "a"-"z", "0"-"9", "$", "_", "."]>
| <#DIGIT: ["0"-"9"]>
| <#QUOTED: "\"" (~["\""])* "\"">
}
void parse():
{
}
{
expression() <EOF>
}
void expression():
{
Token tCodeTab;
}
{
<CODE_TAB_HEADER>
<CODE_TAB_BEGIN>
tCodeTab = <IDENT>
(
<ID>
<DESC>
)*
<CODE_TAB_END>
}
The problem is that the parser correctly identifies token ("code table")... but it doesn't identifies token IDENT ("code_table_name") since it contains the words already contained in token CODE_TAB_BEGIN (i.e. "code"). The parser complains saying that "code is followed by invalid character _"...
Having said that, I'm wondering what I'm missing in order to let the parser work correctly. I'm a newbie and any help would be really appreciated ;-)
Thanks,
j3d
Your lexer will never produce an IDENT because the production
<CODE_TAB_BEGIN: <IDENT> | "code" (" ")+ "table">
says that every IDENT can be a CODE_TAB_BEGIN and, as this production comes first, it beats the production for IDENT by the first match rule. (RTFFAQ)
Replace that production by
<CODE_TAB_BEGIN: "code" (" ")+ "table">
You will run into trouble with ID and DESC, but this gets you past the second line of input.

Categories