PDF to lucene document coversion using pdfbox [closed] - java

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
PDFbox provides classes to convert a pdf to lucene document. Does it preserve the formatting of the document.By formatting i mean does it store details about the location and font type/size and other options.

By default, it will remove all formatting and extract only textual content and make it searchable. This content can be searched, and the original PDF can be maintained external to the index and returned with search results when a hit has been found. Rebuilding a PDF from the Lucene index may not be the best approach, if that is your intent.
PDFBox is quite capable of extracting metadata, though, and it can certainly be used to index formatting / font / etc data, if you wish to be able to search on that sort of thing.

Related

Save math formula into database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I've got a database of questions in SQlite that I read them from my java code. My questions are questions of Mathematics and I provide them as a text in the database.
My problem is that in several questions I want to include mathematical symbols and operators that I can include with my keyboard and provide it as a text.
Is there a way to include my math symbols inside the database?
Or I have to do it from my code?
I see two options
1. Unicode
In Sqlite choose text type that stores unicode. In unicode, you have plenty of math symbols https://www.wikiwand.com/en/Mathematical_operators_and_symbols_in_Unicode
If unicode does not contain symbol you need, you are out of luck, and you need some other solution
To store text as unicode, see http://www.sqlite.org/datatype3.html and https://stackoverflow.com/a/19397231/1849837
2. Math language
Store equations in some math language that is able to describe equations. In this solution, in order for your equations to look nice you will need something that translate quation description (pure text) into equation image.
If you ask this question on https://mathematica.stackexchange.com/ I am sure you will be advised component and data format that fit your needs.

Generate PDF dynamically in Java [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Which of the following is the best way of generating pdf in java using iText:
Generate pdf from scratch each time.
Have a predefined pdf, and each time push the data values to the predefined pdf and save as new pdf.
Generate a XML each time from the data to be pushed and generate a new pdf each time.
Appreciate your response.
When you have the PDF Generating code GENERIC to appending data and making STYLING and TRANSFORMATIONS to the DYNAMIC CONTENT, it is advised to pass your data to that and GENERATE from the SCRATCH.
If you are adding IMAGES, STYLING and TRANSFORMATIONS to the STATIC CONTENT, it is better to make a PREDEFINED PDF with DATA-HOTSPOT-IDs so that you can REPLACE those IDs with your DYNAMIC CONTENT.

How can I parse in Java an xml block to string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I've an xml like which looks like this one:
<rootElement>
<title> randmonString </title>
<subElement1>
<someInfo> info </someInfo>
<subElemntTrash> trash </subElementTrash>
<someInfo1> info1 </someInfo1>
</subElement1>
<trash>
<subtrash> trash </subtrash>
</trash>
<date> 19.03.15 </date>
</rootElement>
I need to extract only title, some /subElement1/subInfo, /subElement1/subInfo1 and date, rest should be automatically stored somewhere but without those elements, that were already extracted. I also should have possibility to marshal it back to the original xml.
It would be great if it can be done using annotation mapping.
Can someone give me the right direction to search?
You are asking about parsing, but then you want data extraction, data transformation and finally storing in some undefined form. Very broad question with many possible aporaches.
You can parse XML in java using DOM, SAX, StAX.
You can use XPath to extract interesting information, but it will not divide your document into the interesting bit and the 'rest'.
You can define XSLT templates, to initiate java Transformer, in order to split your input document into the interesting and 'the rest' parts.
You can use JAXB to map the xml into an java model (using your favourite the annotation mapping), and then you can build another representations containing your interesting and 'the rest' part. Then you can save both representation to different xml.

Convert from XML in format A to XML in format B [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
The schema for the XML file is changing and i need to create a utility that will take the xml file in format A and convert it to format B. How can i do it.
I am not able to figure out the starting point for it.
You will probably want to look into XSLT. You can write one for each iteration of changes, which hopefully you, or whoever is changing the XML, is versioning each change. If that is the case, you will easily be able to transform each version into the next.
On the chance that you do not have versions available to you for the XML, then you will probably have to do very strict matching on your XSLTs.

Extracting data from sites [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to extract data from sites. I already got information from sites using the article extractor but now I want to get the information about events of a particular place. I want to get the events in that place when I give location as input.For example I want to extract information from this site "http://www.indianevents.org/events-Rajasthan-14.htm" I could be able to extract all events,festivals etc.
URL url;
url = new URL(str);
InputSource is = HTMLFetcher.fetch(url).toInputSource();
BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
TextDocument doc = in.getTextDocument();
news=ArticleExtractor.INSTANCE.getText(doc);
consider Apache Tika to download the text content
you can use stanford pos tagger to parse the text into
meaningful sentences
and NLP can help identify event information.
although writing this might sound simple (trust me its difficult).
Good Luck. :)

Categories