process HTML files using hadoop map reduce - java

I have a input folder in hdfs which contain thousands of HTML files :
/data/htmls/1/(HTML files)
/data/htmls/2/(HTML files)
.
.
/data/htmls/n/(HTML files)
I have a java function which takes HTML file as input and parse it, I want to read these HTML files in mapper function and feed them as input to parser function. Because Input files are processed line by line by map function, is there a way to work with HTML file?

I'm not sure how well it would work, but the Mahout XmlInputFormat is a decent XML reader. You might be able to adapt that to working for HTML.
In the configuration set the following before creating the Job object:
conf.set("xmlinput.start", "<tag>");
conf.set("xmlinput.end", "</tag>");
Then set the input class by the following after creating a job object:
job.setInputFormatClass(XmlInputFormat.class);
This will select everything inside of the specified tag as a single input string.
for instance if you select <html> and </html> (or <body> </body> or any other matched pair of tags) as start and end tags you should get everything inside of that as a single record, passed to the mapper.
Hope this is helpful.

Related

JAVA - Write into html files

I am looking to write inside an html file using java.
I have my index.html page ready and I would like to use this template and add a name list (with hyperlinks to go to their pages) at a certain place in this page.
Is it possible to use beacons or tags to tell java to write to this exact location in the html file?
I will use this type of java code to write, the array will be a names array btw, but it's in this mind:
String[] labelEquipment = { "thing1", "thing2", "thing3", "thing4",
"thing5", "thing6", "thing7", "thing8", "thing9",
"thing10" };
PrintWriter f0 = new PrintWriter(new FileWriter("filename.txt"));
for (String string : labelEquipment) {
f0.println(string);
}
f0.close();
You can create a html file like this :
<body>
{{my_placeholder}}
</body>
Using java, you can read this file as string and then use .replace('{{my_placeholder}}',your_content) to replace the place holder with individual label equipments. The variable your_content will have to be group of html tags that will be placed in your html in place of my_placeholder

Localised Text Automatically gets changed when I rewrite html file

I have a project where I create html file. After some calculation I read that html file replace one column value but I dont know why it disturbs my localised text.
e.g.
Step 1> I create html file which is like below
Step 2> Then my project keeps running and after that I replace my this column of html file
with this
Using code :
But I dont know why my language chars change like below:

Extract the first page content from docx file by XML parsing

I need to extract the first page content from the docx file and save it as a seperate document. I need everything from the first page( images, tables, text) to be saved as it is in new docx file.
What i tried is :
I looked into the xml of the unzipped docx file. Since word document is reflowable i couldnt find a page break after each page ends. So i couldnt find the end of each page via the document.xml
Is there any way to get the XML content of the first page of the document alone using java XML DOM parser ?
Do not write a new parser, there are tons of already existing tools for that (e.g., what if your input changes from XML to binary Word files?).
Use Apache POI for example, as #JFB suggested.

How to parse JSP Pages into a XML file?

I am trying to convert a JSP page document into a XML file.I have been using jsoup and very well reading the whole content except server tags, but I can't understand how can the whole HTML be converted to XML tags. I mean how can I fetch data line by line?
My Code:
File Html=new File("genXML.jsp");
Document doc=Jsoup.parse(Html,"UTF-8","http://www.example.com");
System.out.println(doc.html());
Any assistance would be great
First of all, it is not the same to convert JSP to XML with converting HTML to XML. I suppose you want to translate the HTML generated from a JSP to XML. Second of all, you don't want to do this line by line. An HTML block usually does not begin and ends in a line.
Anyway, you could use a tool like tagsoup to convert HTML code to XHTML. XHTML is actually XML. Tagsoup can be called to make the translation. I don't know if it has a usefule API, but at least it could be called from your code as an external process using something like this:
Process tr = Runtime.getRuntime().exec(new String[]{ "..." } );
Then if you want to transform it to a target XML schema, you could apply an XSLT transformation using a tool like ones found online (check this and this). You could apply the XSLT transformation programmatically using JAXP.
Hope I helped!

Merging HTML files in Java

I want to merge multiple HTML files into one. For example, if I have two HTML files which prints WELCOME and XYZ respectively, can i merge these two file into one which can show together WELCOME XYZ? These operation I want to do for multiple, suppose 1500, files.
Appreciate any help.
You might use an HTML parsing/manipulation API such as JSoup.
create one html file and keep on including several files using below command...
<!--#include virtual="insertthisfile1.html" -->
<!--#include virtual="insertthisfile2.html" -->
<!--#include virtual="insertthisfile3.html" -->
<!--#include virtual="insertthisfile4.html" -->
You can get this done by iterating over all your files and appending the contents of <body>...</body> tag together programmatically.
Get all the html file names in to an ArrayList<String>
Create one StringBuilder
Read each HTML file line by line till you find a line with the body tag
Read from that tag start till you find a line with the body tag closing
Append this content to the StringBuilder
After all files are read, write the StringBuilder content to one file.
At the end you will have the one single HTML file

Categories