Matching a content within a large text using reg ex in java

Matching a content within a large text using reg ex in java - java

Problem : I need to match a content within a large text (Wikipedia dump consisting of xml pages) in java.
Content required: Infobox
Reg ex used : "\\{\\{Infobox(.*?)\\}\\}"
Issue: the above pattern matches the first occurrence of }} within the infobox and if I remove the ? character in the reg ex, the pattern matches the last occurrence. But, I am looking for extracting just the infobox and }} should match the end of the info box.
Ex info box:
{{infobox RPG
|title= Amber Diceless Roleplaying Game
|image= [[Image:Amber DRPG.jpg|200px]]
|caption= Cover of the main ''Amber DRPG'' rulebook (art by [[Stephen Hickman]])
|designer= [[Erick Wujcik]]
|publisher= [[Phage Press]]<br>[[Guardians of Order]]
|date= 1991
|genre= [[Fantasy]]
|system= Custom (direct comparison of statistics without dice)
|footnotes=
}}
Code snippet:
String regex = "\\{\\{Infobox(.*?)\\}\\}";
Pattern p1 = Pattern.compile(regex, Pattern.DOTALL);
Matcher m1 = p1.matcher(xmlPage.getText());
String workgroup = "";
while(m1.find()){
workgroup = m1.group();
}

The solution depends upon the nesting depth of {{ .. }} blocks inside the infobox block. If the inside blocks don't nest, that is there are {{ ... }} blocks but NOT {{ .. {{ .. }} .. }} blocks then you can try the regex: infobox([^\\{]*(\\{\\{[^\\}]*\\}\\})*.*?)\\}\\}
I tested this on the string: "A {{ start {{infobox abc {{ efg }} hij }}end }} B" and was able to match " abc {{ efg }} hij "
If the nesting of {{ .. }} blocks is deeper then a regex won't help because you can't specify to the regex engine how big the inner block is. To achieve that you need to count the number of opening {{ and closing }} sequences and extract the string in that fashion. That means you would be better off reading the text one character at a time and processing it.
Explanation of regex:
We start with infobox and then open the group capture parenthesis. We then look for a string of characters which are NOT {.
Following that we look for zero or more "groups" of the form {{ .. }} (BUT with no nested blocks there-in). Nesting is not allowed here because we use [^\\}] to look for the end of the block by only allowing non-} characters inside the block.
Finally we accept the characters just prior to the closing }}.

You shloud try this regex:
String regex = "\\{\\{[Ii]nfobox([^\\}].*\\n+)*\\}\\}";
or
Pattern pattern = Pattern.compile("\\{\\{[Ii]nfobox([^\\}].*\\n+)*\\}\\}");
Explanation : the above regex expression looks for
1 . \\{\\{ - matches two {{
2. [Ii]nfobox - matches Infobox or infobox
3. ([^\\}\\}].*\\n+)* - matches the body of the infobox (the body doesn't contain }} and contains any kind of characters any number of times )
----3.a. [^\\}] - matches everything except }
----3.b. .* - matches any character any number of times
----3.c. \n+ - matches new line 1 or more times
4. \\}\\} - matches - ends with }}

If your xmlPage.getText() will return content similar to this:
{{infobox ... }}{infobox .... {{ nested stuff }} }}{{infobox ... }}
where you will have both multiple infoboxes on the same level and also nested stuff ( and the nested level can be anything ) then you can't use regexp to parse the content. Why ? because the structure behaves in similar way to html or xml and thus it behaves not like a regular structure. You can find multiple answers on the topic "regexp and html" to find good explanation to this problem. For example here:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
But if you can guarantee that you won't have multiple infoboxes on the same level but only nested ones then you can parse the doc removing '?'.

public static void extractValuesTest(String[] args) {
String payloadformatstr= "selected card is |api:card_number| with |api:title|";
String receivedInputString= "siddiselected card is 1234567 with dbs card";
int firstIndex = payloadformatstr.indexOf("|");
List<String> slotSplits= extarctString(payloadformatstr, "\\|(.*?)\\|");
String[] mainSplits = payloadformatstr.split("\\|(.*?)\\|");
int mainsplitLength = mainSplits.length;
int slotNumber=0;
Map<String,String> parsedValues = new HashMap<>();
String replaceString="";
int receivedstringLength = receivedInputString.length();
for (String slot : slotSplits) {
String[] slotArray = slot.split(":");
int processLength = slotArray !=null ? slotArray.length : 0;
String slotType = null;
String slotKey = null;
if(processLength == 2){
slotType = slotArray[0];
slotKey = slotArray[1];
}
/*String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:"";
String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:"";
int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
int endIndex = receivedInputString.indexOf(slotAfter);
String extractedValue = receivedInputString.substring(startIndex, endIndex);*/
String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:null;
String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:null;
int startIndex = StringUtils.isEmpty(slotBefore) ? 0:receivedInputString.indexOf(slotBefore)+slotBefore.length();
//int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
int endIndex = StringUtils.isEmpty(slotAfter) ? receivedstringLength: receivedInputString.indexOf(slotAfter);
String extractedValue = (endIndex != receivedstringLength) ? receivedInputString.substring(startIndex, endIndex):
receivedInputString.substring(startIndex);
System.out.println("Extracted value is "+extractedValue);
parsedValues.put(slotKey, extractedValue);
replaceString+=slotBefore+(extractedValue != null ? extractedValue:"");
//String extractedValue = extarctSlotValue(receivedInputString,slotBefore,slotAfter);
slotNumber++;
}
System.out.println(replaceString);
System.out.println(parsedValues);
}
public static void replaceTheslotsWithValues(String payloadformatstr,String receivedInputString,String slotPattern,String statPatternOfSlot) {
payloadformatstr= "selected card is |api:card_number| with |api:title|.";
receivedInputString= "selected card is 1234567 with dbs card.";
slotPattern="\\|(.*?)\\|";
statPatternOfSlot="|";
int firstIndex = payloadformatstr.indexOf(statPatternOfSlot);
List<String> slotSplits= extarctString(payloadformatstr, slotPattern);
String[] mainSplits = payloadformatstr.split(slotPattern);
int mainsplitLength = mainSplits.length;
int slotNumber=0;
Map<String,String> parsedValues = new HashMap<>();
String replaceString="";
for (String slot : slotSplits) {
String[] slotArray = slot.split(":");
int processLength = slotArray !=null ? slotArray.length : 0;
String slotType = null;
String slotKey = null;
if(processLength == 2){
slotType = slotArray[0];
slotKey = slotArray[1];
}
String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:"";
String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:"";
int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
int endIndex = receivedInputString.indexOf(slotAfter);
String extractedValue = receivedInputString.substring(startIndex, endIndex);
System.out.println("Extracted value is "+extractedValue);
parsedValues.put(slotKey, extractedValue);
replaceString+=slotBefore+(extractedValue != null ? extractedValue:"");
//String extractedValue = extarctSlotValue(receivedInputString,slotBefore,slotAfter);
slotNumber++;
}
System.out.println(replaceString);
System.out.println(parsedValues);
}

Related

Length of String within tags in java

We need to find the length of the tag names within the tags in java
{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}
so the length of Student tag is 7 and that of subject tag is 7 and that of marks is 5.
I am trying to split the tags and then find the length of each string within the tag.
But the code I am trying gives me only the first tag name and not others.
Can you please help me on this?
I am very new to java. Please let me know if this is a very silly question.
Code part:
System.out.println(
getParenthesesContent("{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}"));
public static String getParenthesesContent(String str) {
return str.substring(str.indexOf('{')+1,str.indexOf('}'));
}

You can use Patterns with this regex \\{(\[a-zA-Z\]*)\\} :
String text = "{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}";
Matcher matcher = Pattern.compile("\\{([a-zA-Z]*)\\}").matcher(text);
while (matcher.find()) {
System.out.println(
String.format(
"tag name = %s, Length = %d ",
matcher.group(1),
matcher.group(1).length()
)
);
}
Outputs
tag name = Student, Length = 7
tag name = Subject, Length = 7
tag name = Marks, Length = 5

You might want to give a try to another regex:
String s = "{Abc}{Defg}100{Hij}100{/Klmopr}{/Stuvw}"; // just a sample String
Pattern p = Pattern.compile("\\{\\W*(\\w++)\\W*\\}");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + ", length: " + m.group(1).length());
}
Output you get:
Abc, length: 3
Defg, length: 4
Hij, length: 3
Klmopr, length: 6
Stuvw, length: 5
If you need to use charAt() to walk over the input String, you might want to consider using something like this (I made some explanations in the comments to the code):
String s = "{Student}{Subject}{Marks}100{/Marks}{/Subject}{/Student}";
ArrayList<String> tags = new ArrayList<>();
for(int i = 0; i < s.length(); i++) {
StringBuilder sb = new StringBuilder(); // Use StringBuilder and its append() method to append Strings (it's more efficient than "+=") String appended = ""; // This String will be appended when correct tag is found
if(s.charAt(i) == '{') { // If start of tag is found...
while(!(Character.isLetter(s.charAt(i)))) { // Skip characters that are not letters
i++;
}
while(Character.isLetter(s.charAt(i))) { // Append String with letters that are found
sb.append(s.charAt(i));
i++;
}
if(!(tags.contains(sb.toString()))) { // Add final String to ArrayList only if it not contained here yet
tags.add(sb.toString());
}
}
}
for(String tag : tags) { // Printing Strings contained in ArrayList and their length
System.out.println(tag + ", length: " + tag.length());
}
Output you get:
Student, length: 7
Subject, length: 7
Marks, length: 5

yes use regular expression, find the pattern and apply that.

How to get exact match keyword from the given string using java?

I'm trying to match exact AdvanceJava keyword with the given inputText string but it executes both if and else condition,instead of I want only AdvanceJava keyword matched.
String inputText = ("iwanttoknowrelatedtoAdvancejava").toLowerCase().replaceAll("\\s", "");
String match = "java";
List keywordsList = new ArrayList<>();//where keywordsList{advance,core,programming} -> keywordlist fetch
// from database
Enumeration e = Collections.enumeration(keywordsList);
int size = keywordsList.size();
while (e.hasMoreElements()) {
for (int i = 0; i < size; i++) {
String s1 = (String) keywordsList.get(i);
if (inputText.contains(s1) && inputText.contains(match)) {
System.out.println("Yes we providing " + s1);
} else if (!inputText.contains(s1) && inputText.contains(match)) {
System.out.println("Yes we are working on java");
}
}
break;
}
Thanks

you can simply do this by using pattern and matcher classes
Pattern p = Pattern.compile("java");
Matcher m = p.matcher("Print this");
m.find();
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.

Here's how you can achieve what you seek using pattern matching.
In the first example I have taken your input text as it is. This only improves your algorithm which has O(n^2) performance.
String inputText = ("iwanttoknowrelatedtoAdvancejava").toLowerCase().replaceAll("\\s", "");
String match = "java";
List<String> keywordsList = Arrays.asList("advance", "core", "programming");
for (String keyword : keywordsList) {
Pattern p = Pattern.compile(keyword.concat(match));
Matcher m = p.matcher(inputText);
//System.out.println(m.find());
if (m.find()) {
System.out.println("Yes we are providing " + keyword.concat(match));
}
}
But we can improve this in to a better implementation. Here's a more generic version of the above implementation. This code doesn't manipulate the input text before matching, rather we provide a more generic regular expression which ignores spaces and matches case insensitive manner.
String inputText = "i want to know related to Advance java";
String match = "java";
List<String> keywordsList = Arrays.asList("advance", "core", "programming");
for (String keyword : keywordsList) {
Pattern p = Pattern.compile(MessageFormat.format("(?i)({0}\\s*{1})", keyword, match));
Pattern p1 = Pattern.compile(MessageFormat.format("(?i)({0})", match));
Matcher m = p.matcher(inputText);
Matcher m1 = p1.matcher(inputText);
//System.out.println(m.find());
if(m.find()) {
System.out.println("Yes we are providing " + keyword.concat(match));
} else if(m1.find()) {
System.out.println("Yes we are working with " + match);
}
}

#sithum - Thanks but it executes both condition of if else in output.Please refer Screen shot which I attached here.
I applied following logic and it works fine. please refer it , Thanks.
String inputText = ("iwanttoknowrelatedtoAdvancejava").toLowerCase().replaceAll("\\s", "");
String match = "java";
List<String> keywordsList = session.createSQLQuery("SELECT QUESTIONARIES_RAISED FROM QUERIES").list(); // Fetch values from database (advance,core,programming)
String uniqueKeyword=null;
String commonKeyword= null;
int size =keywordsList.size();
for(int i=0;i<size;i++){
String s1 = (String) keywordsList.get(i);//get values one by one from list
if(inputText.contains(match)){
if(inputText.contains(s1) && inputText.contains(match)){
Queries q1 = new Queries();
q1.setQuestionariesRaised(s1); //set matched keyword to getter setter method
keywordsList1=session.createQuery("from Queries sentence where questionariesRaised='"+q1.getQuestionariesRaised()+"'").list(); // based on matched keyword fetch according to matched keyword sentence which stored in database
for(Queries ob : keywordsList1){
uniqueKeyword= ob.getSentence().toString();// Store fetched sentence to on string variable
}
break;
}else {
commonKeyword= "java only";
}
}
}}
if(uniqueKeyword!= null){
System.out.println("Yes we providing......................" + uniqueKeyword);
}else if(commonKeyword!= null){
System.out.println("Yes we providing " + commonKeyword);
}else{
}

Java Split method strings into method name and argument

I am writing a small programming language for a game I am making, this language will be for allowing users to define their own spells for the wizard entity outside the internal game code. I have the language written down, but I'm not entirely sure how to change a string like
setSpellName("Fireball")
setSplashDamage(32,5)
into an array which would have the method name and the arguments after it, like
{"setSpellName","Fireball"}
{"setSplashDamage","32","5"}
How could I do this using java's String.split or string regex's?
Thanks in advance.

Since you're only interested in the function name and parameters I'd suggest scanning up to the first instance of ( and then to the last ) for the params, as so.
String input = "setSpellName(\"Fireball\")";
String functionName = input.substring(0, input.indexOf('('));
String[] params = input.substring(input.indexOf(')'), input.length - 1).split(",");

To capture the String
setSpellName("Fireball")
Do something like this:
String[] line = argument.split("(");
Gets you "setSpellName" at line[0] and "Fireball") at line[1]
Get rid of the last parentheses like this
line[1].replaceAll(")", " ").trim();
Build your JSON with the two "cleaned" Strings.
There's probably a better way with Regex, but this is the quick and dirty way.

With String.indexOf() and String.substring(), you can parse out the function and parameters. Once you parse them out, apply the quotes are around each of them. Then combine them all back together delimited by commas and wrapped in curly braces.
public static void main(String[] args) throws Exception {
List<String> commands = new ArrayList() {{
add("setSpellName(\"Fireball\")");
add("setSplashDamage(32,5)");
}};
for (String command : commands) {
int openParen = command.indexOf("(");
String function = String.format("\"%s\"", command.substring(0, openParen));
String[] parameters = command.substring(openParen + 1, command.indexOf(")")).split(",");
for (int i = 0; i < parameters.length; i++) {
// Surround parameter with double quotes
if (!parameters[i].startsWith("\"")) {
parameters[i] = String.format("\"%s\"", parameters[i]);
}
}
String combine = String.format("{%s,%s}", function, String.join(",", parameters));
System.out.println(combine);
}
}
Results:
{"setSpellName","Fireball"}
{"setSplashDamage","32","5"}

This is a solution using regex, use this Regex "([\\w]+)\\(\"?([\\w]+)\"?\\)":
String input = "setSpellName(\"Fireball\")";
String pattern = "([\\w]+)\\(\"?([\\w]+)\"?\\)";
Pattern r = Pattern.compile(pattern);
String[] matches;
Matcher m = r.matcher(input);
if (m.find()) {
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
String[] params = m.group(2).split(",");
if (params.length > 1) {
matches = new String[params.length + 1];
matches[0] = m.group(1);
System.out.println(params.length);
for (int i = 0; i < params.length; i++) {
matches[i + 1] = params[i];
}
System.out.println(String.join(" :: ", matches));
} else {
matches = new String[2];
matches[0] = m.group(1);
matches[1] = m.group(2);
System.out.println(String.join(", ", matches));
}
}
([\\w]+) is the first group to get the function name.
\\(\"?([\\w]+)\"?\\) is the second group to get the parameters.
This is a Working DEMO.

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:
I can get any text string that must be of the format:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:
tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string?
Here's an example:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters

You mention that the format is constant:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:". However, the split will return an empty element at the first index. Everything else will follow.
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>

To extract the URL/text part you don't need the regular expression. Use
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);

Assuming you need to do some validation along with the parsing:
break the regex into different parts like this:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.
Then you can retrieve each part by its name:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

How can you parse the string which has a text qualifier

How can I parse a String str = "abc, \"def,ghi\"";
such that I get the output as
String[] strs = {"abc", "\"def,ghi\""}
i.e. an array of length 2.
Should I use regular expression or Is there any method in java api or anyother opensource
project which let me do this?
Edited
To give context about the problem, I am reading a text file which has a list of records one on each line. Each record has list of fields separated by delimiter(comma or semi-colon). Now I have a requirement where I have to support text qualifier some thing excel or open office supports. Suppose I have record
abc, "def,ghi"
In this , is my delimiter and " is my text qualifier such that when I parse this string I should get two fields abc and def,ghi not {abc,def,ghi}
Hope this clears my requirement.
Thanks
Shekhar

The basic algorithm is not too complicated:
public static List<String> customSplit(String input) {
List<String> elements = new ArrayList<String>();
StringBuilder elementBuilder = new StringBuilder();
boolean isQuoted = false;
for (char c : input.toCharArray()) {
if (c == '\"') {
isQuoted = !isQuoted;
// continue; // changed according to the OP comment - \" shall not be skipped
}
if (c == ',' && !isQuoted) {
elements.add(elementBuilder.toString().trim());
elementBuilder = new StringBuilder();
continue;
}
elementBuilder.append(c);
}
elements.add(elementBuilder.toString().trim());
return elements;
}

This question seems appropriate: Split a string ignoring quoted sections
Along that line, http://opencsv.sourceforge.net/ seems appropriate.

Try this -
String str = "abc, \"def,ghi\"";
String regex = "([,]) | (^[\"\\w*,\\w*\"])";
for(String s : str.split(regex)){
System.out.println(s);
}

Try:
List<String> res = new LinkedList<String>();
String[] chunks = str.split("\\\"");
if (chunks.length % 2 == 0) {
// Mismatched escaped quotes!
}
for (int i = 0; i < chunks.length; i++) {
if (i % 2 == 1) {
res.addAll(Array.asList(chunks[i].split(",")));
} else {
res.add(chunks[i]);
}
}
This will only split up the portions that are not between escaped quotes.
Call trim() if you want to get rid of the whitespace.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Matching a content within a large text using reg ex in java - java

Related

Length of String within tags in java

How to get exact match keyword from the given string using java?

Java Split method strings into method name and argument

complex regular expression in Java

How can you parse the string which has a text qualifier

Categories

Resources