In java trying to extract XMLNS using a Regexpression - java

I have been trying for a few hours to get this right, and I really can't seem to do it...
Given a string
"xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\""
what is the correct expression to "save" the http://www.openarchives.org/OAI/2.0/oai-identifier bit?
Thanks in advance, really having trouble getting this right.
String validXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
Pattern p = Pattern.compile(".*\\\"(.*)\\\".*");
Matcher m = p.matcher(validXML);
System.out.println(m.group(1));
Is not printing out anything. Be aware that this attempt was just to get the string inside the quotes, I was going to worry about the other part once I got that working... To bad I never got that working. Thanks

Regular Expressions are so expensive - don't use them when you don't need to!! There are a million other ways to parse a string.
String validXml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
String start = "xmlns:oai-identifier=\"";
String end = "\" ";
int location = validXml.indexOf(start);
String result;
if (location > 0) {
result = validXml.substring(location + start.length(), validXml.length());
int endIndex = result.indexOf(end);
if (endIndex > 0) {
result = result.substring(0, endIndex);
}
else {
throw new Exception("Could not find end!");
}
}
else {
throw new Exception("Could not find start!");
}
System.out.println(result);

I think the problem might be that the first .* in your regular expression is too eager and matching more characters than you'd like.
Try changing ".*\\\"(.*)\\\".*" to be "xmlns.*=\"(.*)\".*" and see whether that works.
If it doesn't work at first, you can also try re-instating the quote escaping. Off the top of my head, I think you don't need them escaping, but I'm not 100% sure.
Note also that this will only match a single namespace declaration, not each one in the validXML variable in your example. You'll have to split the string in order to use this on an arbitrary number of xmlns:.*= attributes.

Since you are reading XML, you might be using DOM, so you can extract the namespace from the prefix name using lookupNamespaceURI() once you parse the document with the setNamespaceAware() option set to true:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new InputSource(new StringReader(validXML)));
String namespace = doc.lookupNamespaceURI("oai-identifier");
It's simpler and you don't have to do any string parsing.

Related

Need help to form a regex in java

I want to find a regx and occurrences of it in the page source using language Java. The value I am trying to search is as given in the program below.
There might be one or more spaces between tags. I am not able to form a regx for this value. Can some one please help me to find the regx for this value?
My program which checks regx is as given below-
String regx=""<img height=""1"" width=""1"" style=""border-style:none;"" alt="""" src=""//api.adsymptotic.com/api/s/trackconversion?_pid=12170&_psign=3841da8d95cc1dbcf27a696f27ccab0b&_aid=1376&_lbl=RT_LampsPlus_Retargeting_Pixel""/>";
WebDrive driver = new FirefoxDriver();
driver.navigate().to("abc.xom");
int count=0, found=0;
source = driver.getPageSource();
source = source.replaceAll("\\s+", " ").trim();
pattern = Pattern.compile(regx);
matcher = pattern.matcher(source);
while(matcher.find())
{
count++;
found=1;
}
if(found==0)
{
System.out.println("Maximiser not found");
pixelData[rowNumber][2] = String.valueOf(count) ;
pixelData[rowNumber][3] = "Fail";
}
else
{
System.out.println("Maximiser is found" + count);
pixelData[rowNumber][2] = String.valueOf(count) ;
pixelData[rowNumber][3] = "Pass";
}
count=0; found=0;
Hard to tell without the original text and expected result, but your Pattern clearly won't compile as is.
You should single-escape double quotes (\") and double-escape special characters (i.e. \\?) for your code and your Pattern to compile.
Something in the lines of:
String regx="<img height=\"1\" width=\"1\" style=\"border-style:none;\" " +
"alt=\"\" src=\"//api.adsymptotic.com/api/s/trackconversion" +
"\\?_pid=12170&_psign=3841da8d95cc1dbcf27a696f27ccab0b" +
"&_aid=1376&_lbl=RT_LampsPlus_Retargeting_Pixel\"/>";
Also consider scraping markup with appropriate framework (i.e. JSoup for HTML) instead of regex.

regex matcher check in if logic not working

Hi, you can see my code below. I have some strings Country, rank and grank in my code, initially they will be null, but if regex is mached, it should change the value. But even if regex is matched it is not changing the value it is always null. If I remove all if statements and append the string it works fine, but if match is not found it is throwing an exception. Please let me know how can I check this in if logic.
System.err.println(content);
Pattern c = Pattern.compile("NAME=\"(.*)\" RANK");
Pattern r = Pattern.compile("\" RANK=\"(.*)\"");
Pattern gr = Pattern.compile("\" TEXT=\"(.*)\" SOURCE");
Matcher co = c.matcher(content);
Matcher ra = r.matcher(content);
Matcher gra = gr.matcher(content);
co.find();
ra.find();
gra.find();
String country = null;
String Rank = null;
String Grank = null;
if (co.matches()) {
country = co.group(1);
}
if (ra.matches()) {
Rank = ra.group(1);
}
if (gra.matches()) {
Grank = gra.group(1);
}
You have to escape a single \ - use double \\ then it should work.
Tried this?
while (co.find()) {
System.out.print("Start index: " + co.start());
System.out.print(" End index: " + co.end() + " ");
System.out.println(co.group());
}
Personally I can't make your program work with / without the if so it's not a problem of logic but just a problem that it doesn't match the string for me
So I changed it to get something working, maybe you can use it :)
String content = "NAME=\"salut\" RANK=\"pouet\" TEXT=\"text\" SOURCE";
System.out.println(content);
System.out.println(content.replaceAll(("NAME=\"(.*)\"\\sRANK=\"(.*)\"\\sTEXT=\"(.*)\" SOURCE"), "$1---$2---$3"));
Output
NAME="salut" RANK="pouet" TEXT="text" SOURCE
salut---pouet---text

Replace String in Java with regex and replaceAll

Is there a simple solution to parse a String by using regex in Java?
I have to adapt a HTML page. Therefore I have to parse several strings, e.g.:
href="/browse/PJBUGS-911"
=>
href="PJBUGS-911.html"
The pattern of the strings is only different corresponding to the ID (e.g. 911). My first idea looks like this:
String input = "";
String output = input.replaceAll("href=\"/browse/PJBUGS\\-[0-9]*\"", "href=\"PJBUGS-???.html\"");
I want to replace everything except the ID. How can I do this?
Would be nice if someone can help me :)
You can capture substrings that were matched by your pattern, using parentheses. And then you can use the captured things in the replacement with $n where n is the number of the set of parentheses (counting opening parentheses from left to right). For your example:
String output = input.replaceAll("href=\"/browse/PJBUGS-([0-9]*)\"", "href=\"PJBUGS-$1.html\"");
Or if you want:
String output = input.replaceAll("href=\"/browse/(PJBUGS-[0-9]*)\"", "href=\"$1.html\"");
This does not use regexp. But maybe it still solves your problem.
output = "href=\"" + input.substring(input.lastIndexOf("/")) + ".html\"";
This is how I would do it:
public static void main(String[] args)
{
String text = "href=\"/browse/PJBUGS-911\" blahblah href=\"/browse/PJBUGS-111\" " +
"blahblah href=\"/browse/PJBUGS-34234\"";
Pattern ptrn = Pattern.compile("href=\"/browse/(PJBUGS-[0-9]+?)\"");
Matcher mtchr = ptrn.matcher(text);
while(mtchr.find())
{
String match = mtchr.group(0);
String insMatch = mtchr.group(1);
String repl = match.replaceFirst(match, "href=\"" + insMatch + ".html\"");
System.out.println("orig = <" + match + "> repl = <" + repl + ">");
}
}
This just shows the regex and replacements, not the final formatted text, which you can get by using Matcher.replaceAll:
String allRepl = mtchr.replaceAll("href=\"$1.html\"");
If just interested in replacing all, you don't need the loop -- I used it just for debugging/showing how regex does business.

How should I get this specific text from str?

I have a lot of strings in database like this : "\\LDDESKTOP\news\1455Bloomberg Document # 180784.txt". I want to get the file name after the last slash.
I do this just in a normal way :
str.substring(str.lastIndexOf("\\")+1)
But it doesn't work because the single slash is used for change meanings. Is there a way in java just like python to tell compiler to regard it as a plain string like this , str=r'.......' .
Or how to change the string to "\\\\LDDESKTOP\\news\\1455Bloomberg Document # 180784.txt". So I can pass it to File Object to read this file.
how should I do this? Or other ways to solve this.
Thanks.
The column named path(varchar(150)) in the news table is like this "\LDDESKTOP\news\1362Bloomberg Document # 180691.txt"
And I do a normal select on the path.
the code :
public List<String> getNewsFileName(String startTime,String endTime) {
List<String> newsFileNames = new ArrayList<String>();
String tableName = ConfigFile.getConfig("configuration.txt","SQLServerTable");
String sql = "select Path from [" + tableName + "] where localtime >= '" + startTime + "' and localtime <= '" + endTime + "'";
try {
if(connection==null) {
InvertedIndex.logger.log(Level.SEVERE, "Database connection has not been initialized");
System.exit(-1);
}
stmt=connection.createStatement();
ResultSet rs = stmt.executeQuery(sql);
while(rs.next()) {
String path=rs.getString(1);
newsFileNames.add(path);
}
} catch (SQLException e) {
InvertedIndex.logger.log(Level.SEVERE,"Fail to store news");
}
return newsFileNames;
}
You use Escape Sequences to specify certain special characters that also have java properties assigned to them.
In order to print a single backslash character in a string you use a set of 2 backslashes \\.
String string = new String("\\\\LDDESKTOP\\news\\1455Bloomberg Document # 180784.txt");
String str = string.substring(string.lastIndexOf("\\")+1);
System.out.println(str);
This prints
1455Bloomberg Document # 180784.txt
Edit 1:
Once you have the string, you can pass it back using the same escape character.
String string = "\\\\LDDESKTOP\\news\\" + str;
This outputs the original
\\LDDESKTOP\news\1455Bloomberg Document # 180784.txt
Edit 2:
Based on what you asked, in order to transform all single backslashes into double backslashes you must use both the escape sequence and the string "replace" method.
If you have this string:
String string = new String("\\\\LDDESKTOP\\news\\1455Bloomberg Document # 180784.txt");
You need to call this code to "double" every backslash:
String newString = string.replace("\\", "\\\\");
This produces the following:
//Note this is before we print it. This illustrates all the escape sequences.
\\\\\\\\LDDESKTOP\\\\news\\\\1455Bloomberg Document # 180784.txt
The string itself will look like this:
\\\\LDDESKTOP\\news\\1455Bloomberg Document # 180784.txt
this code :
String st = "\\LDDESKTOP\news\1455Bloomberg Document # 180784.txt";
st = st.replace("\n", "\\n");
st = st.replace("\\", "\\\\");
String str = st.substring(st.lastIndexOf("\\")+1);
test it.
"\n" is line break.
Thanks for all the efforts you have made . Finally , I think I have found the answer.
Instead of dealing with the string in java program, I process the string using sql functions directly.
Following is what I do.
SELECT * substring(path,len(path)-charindex('\',reverse(path))+2,charindex('\',reverse(path)))
FROM News
This really does a good job !!

Regular expression not returning the .group() value

I'm new with java and using regular expressions. The method seems to be OK, and it's finding results on the subject string, but when I try to get the actual string using .group(), it's empty. here's the code:
public String TestRegularExpression(){
try{
Pattern regex = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(sourceCode);
while (regexMatcher.find()) {
results += "<li>" + regexMatcher.group() + "</li>";
matches ++;
}
} catch (PatternSyntaxException ex) {
results = "<li><strong class='ibm-important'>Syntax error in the regular expression</strong></li>";
}
if(results == null){results = "<li><strong class='ibm-important'>No meta tags found</strong></li>";}
return "<h3>" + h3Title + " (" + matches + " found)</h3><ul>" + results + "</ul>";
}
Any help will be much appreciated!!!
Couldn't it be that you're just not seeing the output? If you output the match directly to HTML without quoting it, that'll just insert the META tag in the HTML code, and the web browser won't render it.

Categories