Matching multiline text using regular expression in java - java

my input sample is:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List href="detailedFoot_files/filelist.xml">
What i want to do is i want to select the whole html tag and replace it with something. So i am using the regular expression
<html.*>
If i use this regular expression in a Mather.DOTALL manner, the whole text input is replaced.
I cant figure out how to do it. Any kind of help will be appreciated.

This regex seems to capture what you're looking for.
pattern = "\<html[^>]*>?(.*)"
Sample Here

If you want to replace only the starting html tag the following will replace it:
String replaced = Pattern.compile("<html[^>]+>", Pattern.DOTALL)
.matcher(input).replace("my replacement for html tag");

Related

I am trying to include quotes in the href command for HTML and it won't work [duplicate]

How do I escape double quotes in an event handler in HTML?
For example, how do I properly escape the bar, which is a string literal, in the following code?
<button onclick="foo("bar")")>Click Me</button>
I can't use single quotes for the attribute value since I'm using XHTML. I could use single quotes for string literals, but I'd like to be consistent.
<button onclick="foo("bar");">Click Me</button>
And, you can mix them indeed in XHTML, try this in the W3 validator:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>foo</title>
</head>
<body>
<div onclick='bar("foo");'></div>
</body>
</html>
There are some tutorials which said single quotes are not valid, but they are incorrect.
XHTML prefers double quotes around the attributes. But you can still use single quotes inside the value. The follow for example is XHTML 1.0 Strict
<button onclick="foo('bar')">Click Me</button>
I would suggest looking into progressive enhancement and moving away from the behavioral attributes.

Replace custom tags in html with Java

is there a library on Java to help me to achieve custom tags replacement in html
like for example here is a simple template :
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<p>$welcome_title</p>
<p>$email_body</p>
<p>$footer_text</p>
</div>
</body>
</html>
Can i replace this custom tags ($welcome_title,$email_body,$footer_text) with values from java ?
The idea is to have template with tags which can be replaced at runtime with values from java objects :)
Also maybe (if there is a library) to generate straight away from html an PDF doc
Thanks :)
In Java world you can use https://www.thymeleaf.org/ or https://freemarker.apache.org/

Remove text from a node but not descendant nodes

I have an XML with HTML data, and trying to remove free text lying inside 'Body' tag without removing the child 'DIV' tag contents. Till now I have used removeChild(), which also removed everything else inside BODY.
Then tried getting the NODE_TYPE==3 for filtering and removing only text content, but I am getting NODE_TYPE==1 when running it.
When using setTextContent(), it is setting the whole tag data to my input string.
This is what my XML Looks like :
<?xml version="1.0" encoding="UTF-8"?>
<HTML>
<HEAD>
<META content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</HEAD>
<BODY>
<DIV class="WordSection1">
<P>Enter Text here</P> <P>COMPLETED</P>
</DIV>
TEXT I WANT TO REMOVE
</BODY>
</HTML>
After changes, I need output like this :
<?xml version="1.0" encoding="UTF-8"?>
<HTML>
<HEAD>
<META content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</HEAD>
<BODY>
<DIV class="WordSection1">
<P>Enter Text here</P> <P>COMPLETED</P>
</DIV>
</BODY>
</HTML>
Any suggestions ?
I understand you're using the 'old' org.w3c.dom library that comes with Java. Assuming you read the document content into a Document doc, you could do:
Node textNode = doc.getDocumentElement().getLastChild().getPreviousSibling().getLastChild();
doc.getDocumentElement().getLastChild().getPreviousSibling().removeChild(textNode);
...although this isn't quite robust with regards to changes to the input XML.
You might want to try a different XML API (e.g. JDom). The old one often doesn't make your life very easy.

How to split an HTML file into multiple according to length of characters in java

please help me to split a large html file to multiple html's using java
a tricky algorithm . I've tried up to a limit.please help me
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<link href="template.css" rel="stylesheet" type="text/css"/>
<link href="page-template.xpgt" rel="stylesheet" type="application/vnd.adobe-page-template+xml"/>
</head>
<body>
<div class="story">
<p class="cn">2</p>
<p class="img"><img src="images/common.jpg" alt=""/></p>
<p class="ct"> some text!</p>
<p class="tx"><span class="dropcap"> some text</span> some text!</p>
<p class="tx"> some text!</p>
<p class="img"><img src="images/ch02-fig1.jpg" alt=""/></p>
<p class="tx"> some text some text some text some text.</p>
<p class="img"><img src="images/ch02-fig2.jpg" alt=""/></p>
<p class="tx"> some text some text some text some text </p>
<p class="tx"> some text some text some text </p>
<p class="tx"> some text some text some text some text.</p>
<p class="img"><img src="images/ch02-fig3.jpg" alt=""/></p>
<p class="tx"> some text!</p>
<p class="tx">
</p>
</div>
</body>
</html>
this is my html file according to the count of of some text html file should be splitted !
You can use the following logic ....
List<String> lines = Files.readAllLines(FileSystems.getDefault()
.getPath("yourhtmlfile"),
StandardCharsets.UTF_8);
for (String htmlData : lines)
{
Pattern splitPattern = Pattern
.compile(sometext_pattern);
Matcher match = splitPattern.matcher(htmlData);
while (match.find())
{
String lineToBeSplit = match.group();
}
.
.
}
"lineToBeSplit" will have the split data.
Your question is pretty vague :).
On splitting String(html in this case):
The easiest was is to read in the html file as text into a String, then use String.split() method to split the string around the desired regex. For example .split("/div") will give you a crude approach where your html will be broken up into "divs" (supposed you even have divs in your html). However this will work badly for nested divs.
On reading/writing files: Reading a plain text file in Java
Also you will find a hackload of html parser on the net that will most likely work ten times better in your case.

How to get the xpath for whole website in java or javascript

Hi i want the xpath for when i give url and i get the elements xpath for whole website.For exmple
I have a html file like
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<h4>Test</h4>
<input type="text" id="firstname" name="first" value="" />
</body>
</html>
I need a output like following format
/html/body/h4
//*[#id="firstname"]
how to done this using javascript or java.
It is certainly possible to generate XPath expressions for elements in a document but such an algorithm would usually implement one approach and if the result for the h4 element is /html/body/h4 then the result for the input element would be /html/body/input. And your posted sample is an XHTML document where elements are in the namespace http://www.w3.org/1999/xhtml which requires the use of a prefix in the XPath expressions with XPath 1.0. So far your requirements are not well defined.
To give you a sample, see http://home.arcor.de/martin.honnen/javascript/storingSelection1.html for a function makeXPath which you could call for all elements in document.body.getElementsByTagName('*'). That function might do in a text/html context, it ignores namespaces. And it uses the DOM Level 3 XPath API which is not supported in IE. So take that as an idea on how to approach the problem, not as a complete answer to implement your requirement.

Categories