Jsoup check if string is valid HTML

Jsoup check if string is valid HTML - java

I am having difficulties with Jsoup parser. How can I tell if given string is a valid HTML code?
String input = "Your vote was successfully added."
boolean isValid = Jsoup.isValid(input);
// isValid = true
isValid flag is true, because Jsoup first uses HtmlTreeBuilder: if ony of html, head or body tag is missing, it adds them by itself. Then it uses Cleaner class and checks it against given Whitelist.
Is there any simple way to check if string is a valid HTML without Jsoup attempts to make it HTML?
My example is AJAX response, which comes as "text/html" content type. Then it goes to parser, Jsoup adds this tags and as a result, response is not displayed properly.
Thanks for your help.

First of all, solution proposed by Reuben is not working as expected. Pattern has to be compiled with Pattern.DOTALL flag. Input HTML may have (and probably will) new line signs etc.
So it should be something like this:
Pattern htmlPattern = Pattern.compile(".*\\<[^>]+>.*", Pattern.DOTALL);
boolean isHTML = htmlPattern.matcher(input).matches();
I also think that this pattern should find HTML tag not only . Next: is not the only valid option. There may also be attribute i.e . This also has to be handled.
I chose to modify Jsoup source. If HTMLTreeBuilder (actually state BeforeHtml) tries to add <html> element I throw ParseException and then I am sure that input file was not a valid HTML file.

Use regex to check String contains HTML or not
boolean isHTML = input.matches(".*\\<[^>]+>.*");
If your String contains HTML value then it will return true
String input = "<html><body></body></html>" ;
But this code String input = "Hello World <>"; will return false

Related

get <img> value from a string in java

I'm parsing data from a json file. Now, I've a data like this
String Content = <p><img class="alignleft size-full wp-image-56999" alt="abdullah" src="http://www.some.com/wp-content/uploads/2013/12/imageName.jpg" width="348" height="239" />Text</p>
<p>Text</p> <p>Text</p><p>The post Some Text appeared first on Some Webiste</p>
Now, I want to divide this string in two pieces. I want to get this URL from src.
http://www.some.com/wp-content/uploads/2013/12/imageName.jpg
and store it a variable. Also, I want to remove the last line The Post appeared... and store the text's in another variable.
So, the questions are:
Is it possible to get that?
If possible, how can I achieve that ?

IN Java
Get a Document object
Document originalDoc = new SAXReader().read(new StringReader("<div>data</div>");
Then you can parse it.. (read this tutorial)
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
In JavaScript
to get attribute
var url = document.getElementsByTagName('img')[0].getAttribute('src');
In case if you have a string and you want a document object, use jquery
string stringValue = '<div>data</div>';
var myObject= $(stringValue);

Use String.substring(firstIndex, lastIndex) to get the link from src attribute
learn to use a HTML parser like JSoup, will be useful in near future

If its a well structured string you can parse it using any DOM parser and extract data from it...

How to change the width and height of an html file using java

I wanted to change width="xyz" , where (xyz) can be any particular value to width="300". I researched on regular expressions and this was the one I am using a syntax with regular expression
String holder = "width=\"340\"";
String replacer="width=\"[0-9]*\"";
theWeb.replaceAll(replacer,holder);
where theWeb is the string
. But this was not getting replaced. Any help would be appreciated.

Your regex is correct. One thing you might be forgetting is that in Java all string methods do not affect the current string - they only return a new string with the appropriate transformation. Try this instead:
String replacement = 'width="340"';
String regex = 'width="[0-9]*"';
String newWeb = theWeb.replaceAll(regex, replacement); // newWeb holds new text

Better use JSoup for manipulating and extracting data, etc. from Html
See this link for more details:
http://jsoup.org/

Java: Carriage returns populating a var in js code?

I am not sure if this is possible, but I'm trying to find a front-end solution to this situation:
I am setting a JavaScript variable to a dynamic tag, populated by backend Java code:
var myString = '#myDynamicContent#';
However, there are some situations, in which the content from the output contains a carriage return; which breaks the code:
var mystring = '<div>
Carriage Return happened above and below.
</div>';
Is there anyway I can resolve this problem on the front-end? Or is it too late in the script to do something about it, because the dynamic tag will run before any JavaScript runs (thus the script is broken by that point)?

I'm sure my JS could be cleaned up (just thought this was a fun problem), but you could search out the comment in the JS.
Lets say your JS looks like this (noticed I added a tag to the comment so we know we're going after the correct one, and there is a div to just for testing):
<script id="testScript">
/*<captureMe><div>
Carriage Return happened above and below.
</div>
*/
var foo = 'bar';
</script>
<div id='test'>What do I see:</div>
Just use this to grab the comment:
var something = $("#testScript").html();
var newSomething = '';
newSomething = something.substr(something.indexOf("/*<captureMe>")+13);
newSomething = newSomething.substr(0, newSomething.indexOf("*/"));
$('#test').append('<br>'+newSomething); // just proving we captured the output, will not render returns or newline as expected by HTML
Technically, it works :), scripting-scripting...
Charbs

JavaScript supports strings that can span multiple lines by putting a backslash (\) at the end of the line, for example:
var myString = 'foo\
bar';
So you should be able to do a Java replace when you write in your server-side variable:
var myString = '#myDynamicContent.replaceAll("\\n", "\\\\n")#';

Replace the \n and/or \r with \\n and/or \\r respectively ... but it has to be done in the server-side language (in your case Java); it can't be done in JavaScript.

Building off of #Charbs' answer, you could avoid the JavaScript comments if you give your script tag a different mime type, so the browser won't try to evaluate it as JavaScript:
<script id="testScript" type="text/notjs" style="display:none">#myDynamicContent#</script>
And then just grab it like this (using jQuery):
var myString = $('#testScript').text();

To me it looks like you're doing token replacement instead of using a template engine. If you like token replacement you might Snippetory too, as it creates similar code. However it has a number of additional features. Using
var myString = '{v:myDynamicContent enc="string"}'
would create
var mystring = '<div>\r\n Carriage Return happened above and below.\r\n </div>'
And thus solve your problem. But you would have to change your code behind, too.

How do I manipulate strings with regex?

I'm fairly new to java and I'm trying to get a part of a string:
Say I have a URL and I want a specific part of it, such as a filename:
String url = "http://example.com/filename02563.zip";
The 02563 will be generated at random every time and it's now always 5 characters long.
I want to have java find what's between "m/" (from .com/) to the end of the line to get the filename alone.
Now consider this example:
Say I have an html file that I want a snippet extracted from. Below would be the extracted example:
<applet name=someApplet id=game width="100%" height="100%" archive=someJarFile0456799.jar code=classInsideAJarFile.class mayscript>
I want to extract the jar filename, so I want to get the text between "ve=" and ".jar". The extension will always be ".jar", so including this is not important.
How would I do this? If possible, could you comment the code so I understand what's happening?

Use the Java URI class where you can access the individual elements.
URI uri = new URI("http://example.com/filename02563.zip");
String filename = uri.getPath();
Granted, this will need a little more work if the resource no longer resides in the root path.

You can use the lastIndexOf() and substring() methods from the String class to extract a specific piece of a String:
String url = "http://example.com/filename02563.zip";
String filename = url.substring(url.lastIndexOf("/") + 1); //+1 skips ahead of the '/'

You have answers for your first question so this is for second one. Normally I would use some XML parser but your example is not valid XML file so this will be solved with regex (as you wanted).
String url = "<applet name=someApplet id=game width=\"100%\" height=\"100%\" archive=someJarFile0456799.jar code=classInsideAJarFile.class mayscript>";
Pattern pattern= Pattern.compile("(?<=archive=).*?(?= )");
Matcher m=pattern.matcher(url);
if(m.find())
System.out.println(m.group());
output:
someJarFile0456799.jar

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.

You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);

I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup check if string is valid HTML - java

Use regex to check String contains HTML or not boolean isHTML = input.matches(".\\<[^>]+>."); If your String contains HTML value then it will return true String input = "<html><body></body></html>" ; But this code String input = "Hello World <>"; will return false

Related

get <img> value from a string in java

How to change the width and height of an html file using java

Java: Carriage returns populating a var in js code?

How do I manipulate strings with regex?

Extracting an "encompassing" string based on a term within the string

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup check if string is valid HTML - java

Use regex to check String contains HTML or not boolean isHTML = input.matches(".*\\<[^>]+>.*"); If your String contains HTML value then it will return true String input = "<html><body></body></html>" ; But this code String input = "Hello World <>"; will return false

Related

get <img> value from a string in java

How to change the width and height of an html file using java

Java: Carriage returns populating a var in js code?

How do I manipulate strings with regex?

Extracting an "encompassing" string based on a term within the string

Categories

Resources

Use regex to check String contains HTML or not boolean isHTML = input.matches(".\\<[^>]+>."); If your String contains HTML value then it will return true String input = "<html><body></body></html>" ; But this code String input = "Hello World <>"; will return false