Dividing an one-line HTML file to well-formed HTML file - java

I have an HTML file in which all tags are in one line. I would like to separate each tag and put it on its own line. The end goal is to have a well-formed HTML file.
e.g.
<html><head><title>StackOverflow</title></head><body></body></html>
would be converted into:
<html>
<head>
<title>
StackOverflow
</title>
</head>
<body>
</body>
</html>
Is there an existing Java library that handles this already?

Your problem has nothing to do with well-formed HTML files. Even if html tags are on the same line, doesn't mean that the html is not well formed.
What you actually neeed is just a formatter, which basically will make your html more human-readable.
You could take a look at JTidy, which can optionally do also a syntax checking.

Related

JSP data to be downloaded to Excel sheet using ActiveQuery results in character problems

downloading data using Active query from JSP page with some parameters is leading to character problems. Special characters in the german language as for example, ö, ä, ß are printed as ö, ä and ß.
Debugging the JSP page in Java shows that the result that is returned by the JSP page is correct. So the problem seems to be due to conversion within excel after download, most probably due to a unsopported charset.
I tried to convert the result string in JSP to different charsets, but the problem still persists.
Does anyone know a solution?
Thank You very much in advance!
Did you try setting the encoding of the page?
<%# page contentType="text/html; charset=UTF-8" pageEncoding="UTF8" %>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
...
If you can't find a solution on the Microsoft side, I'd recommend this alternative here:
http://poi.apache.org/

How to parse a webpage that includes Javascript? [duplicate]

This question already has answers here:
Parse JavaScript with jsoup
(2 answers)
Closed 9 years ago.
I've got a webpage that creates a table using Javascript. Right now I'm using JSoup in my Java project to parse the webpage. By the way JSoup isn't able to run Javascript so the table isn't generated and the source of the webpage is incomplete.
How can I include the HTML code created by that script in order to parse its content using JSoup? Can you provide a simple example? Thank you!
Webpage example:
<!doctype html>
<html>
<head>
<title>A blank HTML5 page</title>
<meta charset="utf-8" />
</head>
<body>
<script>
var table = document.createElement("table");
var tr = document.createElement("tr");
table.appendChild(tr);
document.body.appendChild(table);
</script>
<p>First paragraph</p>
</body>
</html>
The output should be:
<!DOCTYPE html>
<html>
<head>
<title>
A blank HTML5 page
</title>
<meta charset="utf-8"></meta>
</head>
<body>
<script>
var table = document.createElement("table");
var tr = document.createElement("tr");
table.appendChild(tr);
document.body.appendChild(table);
</script>
<table>
<tr></tr>
</table>
<p>
First paragraph
</p>
</body>
</html>
By the way, JSoup doesn't include the table tag as it isn't able to execute Javascript. How can I achieve this?
First possibility
You have some options outside Jsoup, i.e. employing a "real" browser and interact with it. An excellent choice for this would be selenium webdriver. With selenium you can use different browsers as back end, and maybe in your case the very lightweight htmlUnit would do already. If more complicated JavaScript is called there is often no other choice then running a full browser. Luckily, phantomjs is out there and its footprint is not too bad (headless and all).
Second possibility
Another approach could be that you grab the javascript source with JSoup and start a JavaScript interpreter within Java. For that you could use Rhino. However, if you go that path you might as well use HtmlUnit directly, which is probably a bit less bulky.

Unable to understand how Playframework works

I installed playframework and have a question.I looked at the helloworld tutorial but it seems to use groovy.
#(message: String)
#main("Welcome to Play 2.1") {
#play20.welcome(message, style = "Java")
}
The first line is the function definition. What does play20 stand for. I am really new to Scala and I cant make head or tail out of it.
#(title: String)(content: Html)
<!DOCTYPE html>
<html>
<head>
<title>#title</title>
<link rel="stylesheet" media="screen" href="#routes.Assets.at("stylesheets/main.css")">
<link rel="shortcut icon" type="image/png" href="#routes.Assets.at("images/favicon.png")">
<script src="#routes.Assets.at("javascripts/jquery-1.9.0.min.js")" type="text/javascript"></script>
</head>
<body>
#content
</body>
</html>
This is just standard HTML which accepts html content and a title string. But how is this file getting called from the index.scala.html?
The #play20.welcome() part calls a Scala method, not that different from Java.
As for the HTML templates, they're compiled into Scala classes as well, a bit like JSP is compiled into servlets.
The example you are refering to sounds like it's about Play 1, while the framework you are trying out is play 2, which is a rather different thing. Play 2 has it's own template engine.
The # is the symbol that signals you're going to start a Scala expression. Like < ?php ? > or <% %> for intance in other langauges. The only difference is that you don't have a trailing symbol, because the template engine stops parsing at the end of the expression and automagicly returns to evaluating the template as html.
play20 is an object that is in scope for the template engine, like things in java.lang are in in scope in a regular java file. E.g. String.
In this case play20 is like a class with a static method in Java.
In this tutorial you have good simple introduction to how to use the Play 2 framework

Play! framework. template "include"

I'm planning my website structure as following:
header.scala.html
XXX
footer.scala.html
now, instead of "xxx" there should be a specific page (i.e. "UsersView.scala.html").
what I need is to include (like with well-known languages) the source of the footer and the
header into the the middle page's code.
so my questions are:
How do you include a page in another with scala templating?
Do you think it's a good paradigm for Play! framework based website?
Just call another template like a method. If you want to include footer.scala.html:
#footer()
A common pattern is to create a template that contains the boilerplate, and takes a parameter of type HTML. Let's say:
main.scala.html
#(content: HTML)
#header
// boilerplate
#content
// more boilerplate
#footer
In fact, you don't really need to separate out header and footer with this approach.
Your UsersView.scala.html then looks like this:
#main {
// all your users page html here.
}
You're wrapping the UsersView with main by passing it in as a parameter.
You can see examples of this in the samples
My usual main template is a little more involved and looks roughly like this:
#(title: String)(headInsert: Html = Html.empty)(content: Html)(implicit user: Option[User] = None)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>#title</title>
// bootstrap stuff here
#headInsert
</head>
<body>
#menu(user)
<div id="mainContainer" class="container">
#content
</div>
</body>
</html>
This way a template can pass in a head insert and title, and make a user available, as well as content of course.
Play provide a very convenient way to help implement that!
Layout part from official docs:
First we have a base.html (that's we call in django -_-)
// views/main.scala.html
#(title: String)(content: Html)
<!DOCTYPE html>
<html>
<head>
<title>#title</title>
</head>
<body>
<section class="content">#content</section>
</body>
</html>
How to use the base.html?
#main(title = "Home") {
<h1>Home page</h1>
}
More information here

How can I get content of HTML <body>

when I have html:
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
how can I get with DOM parser in JAVA content of body:
text
<div>
text2
<div>
text3
</div>
</div>
becasuse method getTextContent return:text text2 text3. - so without tags.
It is possible with SAX, but it is possible with DOM, too?
The getTextContent is behaving as I would expect - getting the textural content of the HTML fragment. Can you check the API docs for the DOM parser and see if there's a similar method with a name like getHtmlContent?
You would need to parse the document into a DOM and serialise only the portion of the DOM you wanted. Using the DOM Level 3 LS interfaces you can serialise the outer-XML of a single node with:
LSSerializer serializer= implementation.createLSSerializer();
String html= serializer.writeToString(node);
To get the inner-XML you would need to writeToString each child node in turn (eg. into a StringBuffer).
Depending on what DOM implementation you are using there may be alternative non-standard methods. There may also be risks with serialising HTML as XML, if that's what you're doing... eg. a standard XML serialiser may output a self-closing tag for an empty tag, which can confuse browsers parsing the output as legacy-HTML.

Categories