read the contents from html and write in excel sheet in java - java

Please help me to read a html file and write it in an excel file by using java . I have searched the net , I could get only for copying tables , I need the data in it to be read and write in an excel file .
contents in html file
<title>**Deprecated Method Found**</title>
<h2>**Summary**</h2>
Deprecated API is error-prone and is a potential security threat and thus should not be used.
<h2>**Description**</h2>
Old API is sometimes marked deprecated because its implementation is designed in a way that can be error-prone. Deprecated API should be avoided where possible.
<h2>**Security Implications**</h2>
Blocks of code that use deprecated API are designed in a careless manner and thus are a potential security threat.
I need these separate headings in separate columns and the contents in rows .
Is it possible to parse this html to excel file

Try reading the HTML using jsoup, a Java HTML Parser. You can then save it as a CSV (Comma separated values) format which can be opened in excel.
jsoup to read text after element
Get all h2 elements using
Elements h2s = document.select("h2");
Then iterate over the elements reading what the heading says. If this heading is important, use the following code to get the text following that tag.
String text = h2.nextSibling().toString();
Sample CSV opened in Excel:

Related

converting .prn file in to html page using java

how to convert .prn file in to html page using java.
I am treating it as a text file and reading it line by line but thats quite cumbersome as each line requires its own splitting logic. As prn file is nicely formatted can we directly extract the file and load it as an html?any suggessions?
Since a .prn file is byte stream that is sent to printer for printing, I think you are going to have to keep using your custom parser as it doesn't appear that the Java Print Service has any options for parsing.
If the tags are consistent with other file formats it may be worth while to check out other parsing libraries such as simple.json and modify them to your needs.

Why extracting tables in a converted docx work better than in the original PDF?

I'm trying to perform automaticaly table extraction inside PDF. I know there are several libraries and methods Java and Python, but to my surprise, the method that has worked best for me is to convert my Pdf to a Docx document and from there to extract the tables (thanks to: How to get pictures and tables from .docx document using apache poi?).
My question is this: Assuming that within the format conversion there may be loss of information, why are my results better this way? Tabula hasn't been able to do better automatically. To understand this, I have looked for information (e.g. Extracting table contents from a collection of PDF files) but I'm still very confused.
PD: For the moment, I have used https://github.com/thoqbk/traprange (A method based on Pdfbox), How to extract table as text from the PDF using Python? (PyPdf2) and Tabula. When I get to my home I going to put code and cases, I'm writing from my smartphone.

Read a list of websites, get rid of HTML tags and write it all into a txt file

I am trying to get a list of websites to be read once at a time and printed to a single file. I would also like the html tags to be extracted which I plan to use jsoup for HTML parsing. How would I do this before writing the content to the file?
The Exceptionis quite self-explanative.
There is no next element because, quoting API:
if no more tokens are available
Wrap your assignment in a a while (myScanner.hasNext()) loop after initializing your Scanner.

Reading the content of file section wise in java

I want to read the content of any files like doc, pdf, ppt etc section or paragraph wise in java, because i want to retrieve a particular section of a file (if have) instead of retrieving the content of whole file.. Please can anyone tell me, How can i read the content of any file either section or paragraph wise………..
Thanks
This depends entirely on the format of the file in question. For example, when you have a .docx file, you can employ some XML parser and then iterate through the result or use XPath to find all paragraphs, sections or whatever you wish to extract.
For other file formats you will have to find a different approach. There is no single way to extract a specific part of any file, as different file types have different ways of storing data. Most likely, you will have to collect a bunch of libraries, one for each file type.

Convert Semicolon Text to Excel

I need to convert a semicolon delimited file to an Excel.
However, there are some fields that must be removed from the Excel file, and some extra fields to be added. These extra fields are drop-down fields.
Is there any way to do this? Programming language that is preferably to be used is Java, but also welcome the possibility to use Excel macro.
I'm pretty sure you can do this with vanilla Excel. You can either do a global search and replace on semicolon to comma and just open as CSV or use the "Text to Columns" feature.
EDIT: I've not done this programmatically in Java, but in Perl it should be pretty straightforward with Text::xSV and Spreadsheet::WriteExcel
You could look at opencsv and HSSF.
I'm not sure how to do this with an Excel macro, but for Java:
Read the file with FileReader
Use a StringTokenizer with a ";" delimeter to separate the fields
Make an array for each row holding a custom object representing each row. The object can store arrays for the data needed to populate the drop down box
Use Apache POI to create an Excel spreadsheet (There are lots of POI examples on Stackoverflow)
You have two options:
Use Apache POI to write and customise the XLS
Create a sample spreadsheet in Excel, but save it as an HTML page. Take the saved HTML and use it as a template for your data. You can save the output (template+data) as a file with .xls suffix. Even though its content is really HTML it will open correctly.
If you use CSV you won't be able to get additional features such as drop downs or styling.

Categories