Merging HTML files in Java - java

I want to merge multiple HTML files into one. For example, if I have two HTML files which prints WELCOME and XYZ respectively, can i merge these two file into one which can show together WELCOME XYZ? These operation I want to do for multiple, suppose 1500, files.
Appreciate any help.

You might use an HTML parsing/manipulation API such as JSoup.

create one html file and keep on including several files using below command...
<!--#include virtual="insertthisfile1.html" -->
<!--#include virtual="insertthisfile2.html" -->
<!--#include virtual="insertthisfile3.html" -->
<!--#include virtual="insertthisfile4.html" -->

You can get this done by iterating over all your files and appending the contents of <body>...</body> tag together programmatically.
Get all the html file names in to an ArrayList<String>
Create one StringBuilder
Read each HTML file line by line till you find a line with the body tag
Read from that tag start till you find a line with the body tag closing
Append this content to the StringBuilder
After all files are read, write the StringBuilder content to one file.
At the end you will have the one single HTML file

Related

Localised Text Automatically gets changed when I rewrite html file

I have a project where I create html file. After some calculation I read that html file replace one column value but I dont know why it disturbs my localised text.
e.g.
Step 1> I create html file which is like below
Step 2> Then my project keeps running and after that I replace my this column of html file
with this
Using code :
But I dont know why my language chars change like below:

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

Restoring the deleted tag elements from a HTML file using jsoup

I am reading a HTML file from a folder and delete some unwanted html tags From the HTML file and I should save the modified HTML file.
I have done all the above things using jsoup parsing library. But the problem is if in future if I want to exclude some of the tags from the unwanted list of tags, how should I do that? Because once I deleted the unwanted tags the modified HTML will not have the unwanted content.
set the original file as filename:
full_featured_template.html
then parse it with jsoup and save it as
template_version_1.html
then in future:
parse the original again ans save it as
template_version_2.html

Extract the first page content from docx file by XML parsing

I need to extract the first page content from the docx file and save it as a seperate document. I need everything from the first page( images, tables, text) to be saved as it is in new docx file.
What i tried is :
I looked into the xml of the unzipped docx file. Since word document is reflowable i couldnt find a page break after each page ends. So i couldnt find the end of each page via the document.xml
Is there any way to get the XML content of the first page of the document alone using java XML DOM parser ?
Do not write a new parser, there are tons of already existing tools for that (e.g., what if your input changes from XML to binary Word files?).
Use Apache POI for example, as #JFB suggested.

process HTML files using hadoop map reduce

I have a input folder in hdfs which contain thousands of HTML files :
/data/htmls/1/(HTML files)
/data/htmls/2/(HTML files)
.
.
/data/htmls/n/(HTML files)
I have a java function which takes HTML file as input and parse it, I want to read these HTML files in mapper function and feed them as input to parser function. Because Input files are processed line by line by map function, is there a way to work with HTML file?
I'm not sure how well it would work, but the Mahout XmlInputFormat is a decent XML reader. You might be able to adapt that to working for HTML.
In the configuration set the following before creating the Job object:
conf.set("xmlinput.start", "<tag>");
conf.set("xmlinput.end", "</tag>");
Then set the input class by the following after creating a job object:
job.setInputFormatClass(XmlInputFormat.class);
This will select everything inside of the specified tag as a single input string.
for instance if you select <html> and </html> (or <body> </body> or any other matched pair of tags) as start and end tags you should get everything inside of that as a single record, passed to the mapper.
Hope this is helpful.

Categories