Parsing HTTP XML Response Using Regex In Java - java

I am making an API call and now I need to get a specific piece of data from the response. I am needing to get the DocumentID for the "Description" Invoice, which in the case below is 110107.
I have already created a method to get data from get a single tag by doing this:
public synchronized String getTagFromHTTPResponseAsString(String tag, String body) throws IOException {
final Pattern pattern = Pattern.compile("<"+tag+">(.+?)</"+tag+">");
final Matcher matcher = pattern.matcher(body);
matcher.find();
return matcher.group(1);
} // end getTagFromHTTPResponseAsString
However, my problem is with this result set, there are multiple fields with the same tag and I need a specific one. Here is the response:
<?xml version="1.0" encoding="utf-8"?>
<Order TrackingID="351535" TrackingNumber="TEST-843245" xmlns="">
<ErrorMessage />
<StatusDocuments>
<StatusDocument NUM="1">
<DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
<FileName>4215.pdf</FileName>
<Type>Sales Contract</Type>
<Description>Uploaded Document</Description>
<DocumentID>110098</DocumentID>
<DocumentPlaceHolder />
</StatusDocument>
<StatusDocument NUM="2">
<DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
<FileName>Apex_Shortcuts.pdf</FileName>
<Type>Other</Type>
<Description>Uploaded Document</Description>
<DocumentID>110100</DocumentID>
<DocumentPlaceHolder />
</StatusDocument>
<StatusDocument NUM="3">
<DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
<FileName>CRAddend.pdf</FileName>
<Type>Other</Type>
<Description>Uploaded Document</Description>
<DocumentID>110104</DocumentID>
<DocumentPlaceHolder />
</StatusDocument>
<StatusDocument NUM="4">
<DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
<FileName>test.pdf</FileName>
<Type>Other</Type>
<Description>Uploaded Document</Description>
<DocumentID>110102</DocumentID>
<DocumentPlaceHolder />
</StatusDocument>
<StatusDocument NUM="5">
<DocumentDate>7/14/2017 6:55:00 AM</DocumentDate>
<FileName>Invoice.pdf</FileName>
<Type>Invoice</Type>
<Description>Invoice</Description>
<DocumentID>110107</DocumentID>
<DocumentPlaceHolder />
</StatusDocument>
</StatusDocuments>
</Order>
I tried creating and testing out my regular expression on https://regex101.com/ and got this RegEx to work there, but I cannot get it to translate over correctly into my Java code:
<Description>Invoice<\/Description>
<DocumentID>(.*?)<\/DocumentID>

Try it with Jsoup
Example:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class sssaa {
public static void main(String[] args) throws Exception {
String xml = "yourXML";
Document doc = Jsoup.parse(xml);
Elements StatusDocuments = doc.select("StatusDocument");
for(Element e : StatusDocuments){
if(e.select("Description").text().equals("Invoice")){
System.out.println(e.select("DocumentID").text());
}
}
}
}

What I have done to solve this is use StringBuilder to convert the response into a single string and then used this piece of code to get the DocumentID:
// Create the pattern and matcher
Pattern p = Pattern.compile("<Description>Invoice<\\/Description><DocumentID>(.*)<\\/DocumentID>");
Matcher m = p.matcher(responseText);
// if an occurrence if a pattern was found in a given string...
if (m.find()) {
// ...then you can use group() methods.
System.out.println("group0 = " + m.group(0)); // whole matched expression
System.out.println("group1 = " + m.group(1)); // first expression from round brackets (Testing)
}
// Set the documentID for the Invoice
documentID = m.group(1);
Looks like this is probably not the best way to go about doing this, but it is working for now. I will come back and try to clean this up with a more correct solution from suggestions given here.

Related

Regex get text between tags

I try to get the text between a tag in JAVA.
`
<td colspan="2" style="font-weight:bold;">HELLO TOTO</td>
<td>Function :</td>
`
I would like to use a regex to extract "HELLO TOTO" but not "Function :"
I already tried something like this
`
String btwTags = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td>\n" + "<td>Function :</td>";
Pattern pattern = Pattern.compile("<td(.*?)>(.*?)</td>");
Matcher matcher = pattern.matcher(btwTags);
while (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
`
but the result is the same as the input.
Any ideas ?
I tried this regex (?<=<td>)(.*?)(?=</td>) too but it only catch "Function:"
I don't know of to set that he could be something after the open <td ...>
Already thanks in advance
Don't use RegEx to parse HTML, its a very bad idea...
to know why check this link:
RegEx match open tags except XHTML self-contained tags
you can use Jsoup to achieve this :
String html; // your html code
Document doc = Jsoup.parse(html);
System.out.println(doc.select("td[colspan=2]").text());
You can use a Regex for very basic HTML parsing. Here's the easiest Java regex I could find :
"(?i)<td[^>]+>([^<]+)<\\/td>"
It matches the first td tag with attributes and a value. "HELLO TOTO" is in group 1.
Here's an example.
For anything more complex, a parser like Jsoup would be better.
But even a parser could fail if the HTML isn't valid or if the structure for which you wrote the code has been changed.
I had provided solution without using REGEX Hope that would be helpful..
public class Solution{
public static void main(String ...args){
String str = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td><td>Function :</td>";
String [] garray = str.split(">|</td>");
for(int i = 1;i < garray.length;i+=2){
System.out.println(garray[i]);
}
}
}
Output :: HELLO TOTO
Function :
I am just using split function to delimit at given substrings .Regex is slow and often confuse.
cheers happy coding...

RSS Feed - Parse/Extract src image tag inside Description tag in JAVA

Extending this question
How to extract an image src from RSS feed
for JAVA, answer is already made for ios, but to make it work in JAVA there is not enough solutions made for it.
RSS Feeds parsing the direct tag is known for me, but parsing tag inside another tag is quite complicated like this below
<description>
<![CDATA[
<img width="745" height="410" src="http://example.com/image.png" class="attachment-large wp-post-image" alt="alt tag" style="margin-bottom: 15px;" />description text
]]>
</description>
How to split up the src tag alone?
Take a look at jsoup. I think it's what you need.
EDIT:
private String extractImageUrl(String description) {
Document document = Jsoup.parse(description);
Elements imgs = document.select("img");
for (Element img : imgs) {
if (img.hasAttr("src")) {
return img.attr("src");
}
}
// no image URL
return "";
}
You could try to use a regular expression to get the value,
give a look to this little example, I hope it's help you.
For more info about regular expression you can find more info here.
http://www.tutorialspoint.com/java/java_regular_expressions.htm
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test{
public static void main(String []args){
String regularExpression = "src=\"(.*)\" class";
String html = "<description> <![CDATA[ <img width=\"745\" height=\"410\" src=\"http://example.com/image.png\" class=\"attachment-large wp-post-image\" alt=\"alt tag\" style=\"margin-bottom: 15px;\" />description text ]]> </description>";
// Create a Pattern object
Pattern pattern = Pattern.compile(regularExpression);
// Now create matcher object.
Matcher matcher = pattern.matcher(html);
if (matcher.find( )) {
System.out.println("Found value: " + matcher.group(1) );
//It's prints Found value: http://example.com/image.png
}
}
}

How to check if html document contains string

What would be a fast way to check if an URL contains a given string? I tried jsoup and pattern matching, but is there a faster way.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest {
public static void main(String[] args) throws Exception {
String url = "https://en.wikipedia.org/wiki/Hawaii";
Document doc = Jsoup.connect(url).get();
String html = doc.html();
Pattern pattern = Pattern.compile("<h2>Contents</h2>");
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
System.out.println("Found it");
}
}
}
It depends. If your patterns is really only a simple substring to be found exactly in the page content, then both methods you suggest are overkill. If that is indeed the case you should get the page without parsing it in JSoup. You still can use Jsoup if you want to get the page, just don't start the parser:
Connection con = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii");
Response res = con.execute();
String rawPageStr = res.body();
if (rawPageStr.contains("<h2>Contents</h2>")){
//do whatever you need to do
}
If the pattern is indeed a regular expression, use this:
Pattern pattern = Pattern.compile("<h2>\\s*Contents\\s*</h2>");
Matcher matcher = pattern.matcher(rawPageStr);
This makes only sense, if you do not need to parse much more of the page. However, if you actually want to perform a structured search of the DOM via CSS selectors, JSoup is not a bad choice, although a SAX based approach like TagSoup probably could be a bit faster.
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii").get();
Elements h2s = doc.select("h2");
for (Element h2 : h2s){
if (h2.text().equals("Contents")){
//do whatever & more
}
}

Changing XML values in Android/Java

I want to take an XML file as input which contains the following:
<?xml version='1.0' encoding='utf-8' standalone='yes'>
<map>
<int name="count" value="10" />
</map>
and, read and change the value from 10 to any other integer value.
How can I do this in Android/Java. I'm new to Android and Java and all the tutorials available on the internet are way too complicated.
Thank You
You can change the value by matching the pattern and replacing the string as like below,
String xmlString = "<int name=\"count\" value=\"10\" />";
int newValue = 100;
Pattern pattern = Pattern.compile("(<int name=\"count\" value=\")([0-9]{0,})(\" />)");
Matcher matcher = pattern.matcher(xmlString);
while (matcher.find()) {
String match = matcher.group(2);
xmlString = xmlString.replace(match, String.valueOf(newValue));
}
System.out.println(xmlString);
You can find your answer here. It is like parsing json. You can cast your string(from file) to object and do anything with parameters

attributes pattern matcher takes a long time

I have a regex to get the src and the remaining attributes for all the images present in the content.
<img *((.|\s)*?) *src *= *['"]([^'"]*)['"] *((.|\s)*?) */*>
If the content I am matching against is like
<img src=src1"/> <img src=src2"/>
the find(index) hangs and I see the following in the thread dump
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
Is there a solution or a workaround for solving this issue?
A workaround is to use a HTML parser such as JSoup, for example
Document doc =
Jsoup.parse("<html><img src=\"src1\"/> <img src=\"src2\"/></html>");
Elements elements = doc.select("img[src]");
for (Element element: elements) {
System.out.println(element.attr("src"));
System.out.println(element.attr("alt"));
System.out.println(element.attr("height"));
System.out.println(element.attr("width"));
}
It looks like what you've got is an "evil regex", which is not uncommon when you try to construct a complicated regex to match one thing (src) within another thing (img). In particular, evil regexs usually happen when you try to apply repetition to a complex subexpression, which you are doing with (.|\s)*?.
A better approach would be to use two regexes; one to match all <img> tags, and then another to match the src attribute within it.
My Java's rusty, so I'll just give you the pseudocode solution:
foreach( imgTag in input.match( /<img .*?>/ig ) ) {
src = imgTag.match( /\bsrc *= *(['\"])(.*?)\1/i );
// if you want to get other attributes, you can do that the same way:
alt = imgTag.match( /\balt *= *(['\"])(.*?)\1/i );
// even better, you can get all the attributes in one go:
attrs = imgTag.match( /\b(\w+) *= *(['\"])(.*?)\2/g );
// attrs is now an array where the first group is the attr name
// (alt, height, width, src, etc.) and the second group is the
// attr value
}
Note the use of a backreference to match the appropriate type of closing quote (i.e., this will match src='abc' and src="abc". Also note that the quantifiers are lazy here (*? instead of just *); this is necessary to prevent too much from being consumed.
EDIT: even though my Java's rusty, I was able to crank out an example. Here's the solution in Java:
import java.util.regex.*;
public class Regex {
public static void main( String[] args ) {
String input = "<img alt=\"altText\" src=\"src\" height=\"50\" width=\"50\"/> <img alt='another image' src=\"foo.jpg\" />";
Pattern attrPat = Pattern.compile( "\\b(\\w+) *= *(['\"])(.*?)\\2" );
Matcher imgMatcher = Pattern.compile( "<img .*?>" ).matcher( input );
while( imgMatcher.find() ) {
String imgTag = imgMatcher.group();
System.out.println( imgTag );
Matcher attrMatcher = attrPat.matcher( imgTag );
while( attrMatcher.find() ) {
String attr = attrMatcher.group(1);
System.out.format( "\tattr: %s, value: %s\n", attrMatcher.group(1), attrMatcher.group(3) );
}
}
}
}

Categories