Read .docx content from web server url - java

Read .docx content from web server url - java - java

I have WEBDAV server where documents are stored. They are available by url e.q https://my-url.net/document.docx. Now I'd like to get some document and read his content. What i have:
public void getDocumentContent() throws ExternalIntegrationException {
var client = getHttpClient();
var download = new HttpGet(doc);
try {
InputStream input = client.execute(download).getEntity().getContent();
String str = IOUtils.toString(input, StandardCharsets.UTF_8);
System.out.println(str);
} catch(IOException e) {
throw new ExternalIntegrationException("Failure download file from " + webDavPath + ". " +
"Details:" + e.getMessage(), e);
}
}
private HttpClient getHttpClient() {
var credentialsProvider = new BasicCredentialsProvider();
var credentials = new UsernamePasswordCredentials(userName, password);
credentialsProvider.setCredentials(AuthScope.ANY, credentials);
return HttpClientBuilder.create()
.setDefaultCredentialsProvider(credentialsProvider)
.build();
}
my System.out.printl (for tests) get this in the console:
X�K����nDUA*�)Y����ă�ښl 1i�J�/z,'��nV���K~ϲ��)a���m ����j0�Hu�T�9bx�<�9X�
�Q���
�Iʊ~���8��W�Z�"V0}����������>����uQwHo�� �� PK ! ��� N _rels/.rels �(� ���JA���a�}7�
ig�#��X6_�]7~
f��ˉ�ao�.b*lI�r�j)�,l0�%��b�
6�i���D�_���, � ���|u�Z^t٢yǯ;!Y,}{�C��/h> �� PK ! �d�Q� 1 word/_rels/document.xml.rels �(� ���j�0���{-;���ȹ�#��� �����$���~�
�U�>�0̀�"S�+a_݃(���vuݕ���c���T�/<�!s��Xd3�� �����?'g![�?��4���%�9���R�k6��$C�,�`&g�!/=� �� PK ! �^�� " word/document.xml�W]o�0}����y� ��"B���=T+�&�k�����wV���*�D�����s�mfW?
��k���0"�T3�6 yX��$p�*F�V��=8r5�n���Ns ��\\���{��K� �j
��[��S���|��,�)Ԧ�m�<5�*bhA �ܖנ�ע��mR�$���ٷ3m�1KwX)�w�2cu
�/����k�ga���Իۺ�⪽cgh���� 2_-�WA���`ô�x=�L�7��6�J�� ^ɶ�u:O'�cJ���2O�f:[Z���`�!�=��L,�!w��/�;��-���ٰK���<j�,��r>������/V<�B�~T�q�A����:������ZU��O7ܥx������Ͽ^h�b�^h��`���N�d�U�:��������s�r�Y��1��~��]㓿UϽ��]<��woO �F�ڟ
R�T����ߊ�9��q�Z
How can I get .docx file from URL without downloading and read document content and save it as a string or maybe List if there were more documents ??

Why is it not working for you?
Since docx is a plain text xml based format contains binary blobs in it- you can't simply print the document as a string.
Solution:
I recommend saving the file locally, and opening it as FileInputStream.
Just delete the file at the end.
If you can't save the file locally, you can convert the InputStream to FileInputStream.
Once you have the variable "input" as FileInputStream - you can use the following code:
import java.io.File;
import java.io.FileInputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public void readDocxFile(FileInputStream input) {
try {
XWPFDocument document = new XWPFDocument(input);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
}
input.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Related

How to export utf-8 content in octet stream response for REST API endpoint?

I have a Quarkus based REST API project in which one endpoint is supposed to serve exported data as .csv files. Since i do not want to create temporary files, i was writing to a ByteArrayInputStream to be used in an octet stream response for my webservice.
However, although this works fine for latin character content we also have content that may be in Chinese. The downloaded .csv file does not view the characters properly or rather does not write them properly (they only show up as question marks, even in plain text view e.g. with notepad).
We already checked the source of the problem not being how the data is stored, for example the encoding in the database is correct and it works fine when we export it as .json (here we can set charset utf-8).
As far as i understand a charset or encoding cannot be set for an octet stream.
So how can we export/stream this content as a file download without creating an actual file?
Some code examples below on how we do it currently. We use the apache common library component CSVPrinter to create the CSV format in text in a custom CSV streamer class:
#ApplicationScoped
public class JobRunDataCsvStreamer implements DataFormatStreamer<JobData> {
#Override
public ByteArrayInputStream streamDataToFormat(List<JobData> dataList) {
try {
ByteArrayOutputStream out = getCsvOutputStreamFor(dataList);
return new ByteArrayInputStream(out.toByteArray());
} catch (IOException e) {
throw new RuntimeException("Failed to convert job data: " + e.getMessage());
}
}
private ByteArrayOutputStream getCsvOutputStreamFor(List<JobData> dataList) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
CSVPrinter csvPrinter = new CSVPrinter(new PrintWriter(out), getHeaderFormat());
for (JobData jobData : dataList) {
csvPrinter.printRecord(extractStringRowData(jobData));
}
csvPrinter.flush();
csvPrinter.close();
return out;
}
private CSVFormat getHeaderFormat() {
return CSVFormat.EXCEL
.builder()
.setDelimiter(";")
.setHeader("ID", "Source term", "Target term")
.build();
}
private List<String> extractStringRowData(JobData jobData) {
return Arrays.asList(
String.valueOf(jobData.getId()),
jobData.getSourceTerm(),
jobData.getTargetTerm()
);
}
}
Here is the quarkus API endpoint for the download:
#Path("/jobs/data")
public class JobDataResource {
#Inject JobDataRepository jobDataRepository;
#Inject JobDataCsvStreamer jobDataCsvStreamer;
...
#GET
#Path("/export/csv")
#Produces(MediaType.APPLICATION_OCTET_STREAM)
public Response getAllAsCsvExport() {
List<JobData> jobData = jobDataRepository.getAll();
ByteArrayInputStream stream = jobDataCsvStreamer.streamDataToFormat(jobData);
return Response.ok(stream, MediaType.APPLICATION_OCTET_STREAM)
.header("content-disposition", "attachment; filename = job-data.csv")
.build();
}
}
Screenshot of result in the downloaded file for chinese characters in the second column:
We tried setting headers etc. for encoding, but none of it worked. Is there a way to stream content which requires specific encoding as a file in Java web services? We tried using PrintWriter which works, but requies creating a local file on the server.
Edit: We tried using PrintWriter(out, false, StandardCharsets.UTF_8) for the PrintWriter to write to a byte array out stream for the response, which yields a different result but still with broken view in both Excel and plain text:
Screenshot:
Code for endpoint:
#GET
#Path("/export/csv")
#Produces(MediaType.APPLICATION_OCTET_STREAM)
public Response getAllAsCsvExport() {
List<JobData> jobData = jobRunDataRepository.getAll();
ByteArrayOutputStream out = new ByteArrayOutputStream();
try{
PrintWriter pw = new PrintWriter(out, false, StandardCharsets.UTF_8);
pw.println(String.format("%s, %s, %s", "ID", "Source", "Target"));
for (JobData item : jobData) {
pw.println(String.format("%s, %s, %s",
String.valueOf(item.getId()),
String.valueOf(item.getSourceTerm()),
String.valueOf(item.getTargetTerm()))
);
}
pw.flush();
pw.close();
} catch (Exception e) {
throw new RuntimeException("Failed to convert job data: " + e.getMessage());
}
return Response.ok(out).build();
}

How to download files(e.g png, jpeg,pdf,msg) using Amazon S3 presigned url in java?

I have generated presigned url using which preview is coming of that file but I want to download that file which I'm not able to do it. Is there any way by which we can get presigned download url using java.

Normally when you a sign a URL, by default S3 doesn't add any additional headers, which will cause most modern browsers to open a PDF file in the browser. If you want the browser to download the file instead, you need to signal the download with a "Content-Disposition" header.
There's a fairly easy way to add the Content-Disposition to the S3 response by only changing how the presigned link is generated. You just need to add a call to responseContentDisposition to the builder for the GetObjectRequest, for instance, this simple app will generate a link useful for "preview", and a link that will trigger a download for the same object:
package com.example.myapp;
import java.time.Duration;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;
import software.amazon.awssdk.services.s3.presigner.model.GetObjectPresignRequest;
import software.amazon.awssdk.services.s3.presigner.model.PresignedGetObjectRequest;
import software.amazon.awssdk.services.s3.presigner.S3Presigner;
public class App
{
public static void main( String[] args )
{
String bucketName = "example-bucket";
String keyName = "test.pdf";
Region region = Region.US_WEST_2;
String downloadFilename = "the_filename_to_download_to.pdf";
S3Presigner presigner = S3Presigner.builder().region(region).build();
// Generate the presigned request, this will be the "preview" URL
GetObjectRequest getObjectRequest = GetObjectRequest.builder()
.bucket(bucketName).key(keyName).build();
GetObjectPresignRequest getObjectPresignRequest = GetObjectPresignRequest.builder()
.signatureDuration(Duration.ofHours(1))
.getObjectRequest(getObjectRequest)
.build();
PresignedGetObjectRequest presignedGetObjectRequest = presigner
.presignGetObject(getObjectPresignRequest);
// Log the presigned URL
System.out.println("Presigned URL for preview: " + presignedGetObjectRequest.url());
// Generate the presigned request, this will be the "download" URL
// Note, the addition of the content-encoding and content-disposition headers
getObjectRequest = GetObjectRequest.builder()
.bucket(bucketName).key(keyName)
.responseContentEncoding("application/octet-stream")
.responseContentDisposition("attachment; filename=\"" + downloadFilename + "\"")
.build();
getObjectPresignRequest = GetObjectPresignRequest.builder()
.signatureDuration(Duration.ofHours(1))
.getObjectRequest(getObjectRequest)
.build();
presignedGetObjectRequest = presigner
.presignGetObject(getObjectPresignRequest);
// Log the presigned URL
System.out.println("Presigned URL for download: " + presignedGetObjectRequest.url());
}
}

When you want to perform use cases with Amazon S3 and Java SDK, always look at the code example repo in Github. This is AWS SDK for Java V2 - which is much better practice to use then V1.
You will find many examples that have been tested such as this one that shows you how to get an object located in an Amazon S3 bucket by using the S3Presigner client object.
package com.example.s3;
// snippet-start:[presigned.java2.getobjectpresigned.import]
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.time.Duration;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;
import software.amazon.awssdk.services.s3.model.S3Exception;
import software.amazon.awssdk.services.s3.presigner.model.GetObjectPresignRequest;
import software.amazon.awssdk.services.s3.presigner.model.PresignedGetObjectRequest;
import software.amazon.awssdk.services.s3.presigner.S3Presigner;
import software.amazon.awssdk.utils.IoUtils;
// snippet-end:[presigned.java2.getobjectpresigned.import]
/**
* To run this AWS code example, ensure that you have setup your development environment, including your AWS credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class GetObjectPresignedUrl {
public static void main(String[] args) {
final String USAGE = "\n" +
"Usage:\n" +
" GetObjectPresignedUrl <bucketName> <keyName> \n\n" +
"Where:\n" +
" bucketName - the Amazon S3 bucket name. \n\n"+
" keyName - a key name that represents a text file. \n\n";
if (args.length != 2) {
System.out.println(USAGE);
System.exit(1);
}
String bucketName = args[0];
String keyName = args[1];
Region region = Region.US_WEST_2;
S3Presigner presigner = S3Presigner.builder()
.region(region)
.build();
getPresignedUrl(presigner, bucketName, keyName);
presigner.close();
}
// snippet-start:[presigned.java2.getobjectpresigned.main]
public static void getPresignedUrl(S3Presigner presigner, String bucketName, String keyName ) {
try {
GetObjectRequest getObjectRequest =
GetObjectRequest.builder()
.bucket(bucketName)
.key(keyName)
.build();
GetObjectPresignRequest getObjectPresignRequest = GetObjectPresignRequest.builder()
.signatureDuration(Duration.ofMinutes(10))
.getObjectRequest(getObjectRequest)
.build();
// Generate the presigned request
PresignedGetObjectRequest presignedGetObjectRequest =
presigner.presignGetObject(getObjectPresignRequest);
// Log the presigned URL
System.out.println("Presigned URL: " + presignedGetObjectRequest.url());
HttpURLConnection connection = (HttpURLConnection) presignedGetObjectRequest.url().openConnection();
presignedGetObjectRequest.httpRequest().headers().forEach((header, values) -> {
values.forEach(value -> {
connection.addRequestProperty(header, value);
});
});
// Send any request payload that the service needs (not needed when isBrowserExecutable is true)
if (presignedGetObjectRequest.signedPayload().isPresent()) {
connection.setDoOutput(true);
try (InputStream signedPayload = presignedGetObjectRequest.signedPayload().get().asInputStream();
OutputStream httpOutputStream = connection.getOutputStream()) {
IoUtils.copy(signedPayload, httpOutputStream);
}
}
// Download the result of executing the request
try (InputStream content = connection.getInputStream()) {
System.out.println("Service returned response: ");
IoUtils.copy(content, System.out);
}
} catch (S3Exception e) {
e.getStackTrace();
} catch (IOException e) {
e.getStackTrace();
}
}
// snippet-end:[presigned.java2.getobjectpresigned.main]
}
UPDATE
The above Java code will produce a pre-signed URL. Debug through it and get the pre-signed URL at line 86.
I also tested the above code with a file name people.png. In the Github repo, there is a Java Swing example that you can enter the pre-signed URL and the file is downloaded. Modify lines 50 and 53 in the Java Swing app.
This app downloaded the people PNG file to a local file where it can be opened.

How to download/save a LinkedIn profile as a PDF file using Java?

I have a list of URLs to LinkedIn profiles and I would like to download/save all of them as PDF files using Java. So far, I have managed to download the html version of the profiles, which cannot even be opened using browsers for some reason. I have used the JSoup library and this is the code I got:
public static void main(String arg [])
{
try {
String url = "https://www.linkedin.com/uas/login?goback=&trk=hb_signin";
Connection.Response response = Jsoup
.connect(url)
.method(Connection.Method.GET)
.execute();
Document responseDocument = response.parse();
Element loginCsrfParam = responseDocument
.select("input[name=loginCsrfParam]")
.first();
response = Jsoup.connect("https://www.linkedin.com/uas/login-submit")
.cookies(response.cookies())
.data("loginCsrfParam", loginCsrfParam.attr("value"))
.data("session_key", "user#name.com")
.data("session_password", "aPassWord")
.method(Connection.Method.POST)
.followRedirects(true)
.execute();
Connection.Response aResponse = Jsoup.connect("ProfileURL").cookies(response.cookies()).method(Connection.Method.GET).execute();
Document aResponseDocument = aResponse.parse();
try
{
FileWriter fileWriter = new FileWriter("C:/Users/userName/Desktop/DownLoadedProfile.html", false);
BufferedWriter bufferedWriter= new BufferedWriter(fileWriter);
bufferedWriter.write(aResponseDocument.getAllElements().toString());
bufferedWriter.newLine();
bufferedWriter.close();
}
catch(Exception e)
{
}
} catch (IOException e) {
e.printStackTrace();
}
}
If possible, how can I extend this code to invoke the (Save to PDF) option and download the profile?

You can use some free java library that can convert html to pdf for example jPDFWriter. Here is example :
import com.qoppa.pdfWriter.PDFDocument;
...
File f1 = new File ("c:/htmlsamplepage.html");
pdfDoc = PDFDocument.loadHTML(f1.toURI().toURL(), new PageFormat (), false);
pdfDoc.saveDocument ("c:\\output.pdf");

how to use jsoup on router address?

i have a question about Jsoup library ...
i have this little program , which download ,parse and get the title of an HTML page which is google.com .
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class HTMLParser{
public static void main(String args[]) {
// JSoup Example - Reading HTML page from URL
Document doc;
try {
doc = Jsoup.connect("http://google.com/").get();
title = doc.title();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Jsoup Can read HTML page from URL, title : "+title);
}
}
The program is working very well,BUT the problem is :
when i try to parse a file from the ip adress 192.168.1.1(i change the google.com to 192.168.1.1 which is the adress of the router):
doc = Jsoup.connect("http://192.168.1.1/").get();
it does not work and shows me the error below :
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=401, URL=http://192.168.1.1/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
at HTMLParser.main(HTMLParser.java:43)
first i think that the problem is related to "ussername and the password" so i change the address 192.168.1.1 to Username:Password#192.168.1.1 :
doc = Jsoup.connect("http://username:password#192.168.1.1/").get();
but it does not work , the program read the entire line as an adress.
if someone have any idea please help me !! and thanks for everybody

As with saka1029, you can request the URL with authentication. Then you use Jsoup.parse(String) to get the Document object.
Or you simply use Jsoup methods to send the request and get the response:
Getting HTML Source using Jsoup of a password protected website
Jsoup connection with basic access authentication
(I usually use javax.xml.bind.DatatypeConverter.printBase64Binary for the Base64 conversion.)

thank you very much saka1029;Griddoor. i read what you suggest , and it helps very much,
for me i use this solution :
URL url = new URL("http://user:pass#domain.com/url");
URLConnection urlConnection = url.openConnection();
if (url.getUserInfo() != null) {
String basicAuth = "Basic " + new String(new Base64().encode (url.getUserInfo().getBytes()));
urlConnection.setRequestProperty("Authorization", basicAuth);
}
InputStream inputStream = urlConnection.getInputStream();
from : Connecting to remote URL which requires authentication using Java
and used this method to read the inputstream:
StringWriter writer = new StringWriter();
IOUtils.copy(inputStream, writer);
String theString = writer.toString();
from : Read/convert an InputStream to a String
then i parse the theString with Jsoup.

File download returns corrupted file (I think) in Play framework 2.2.2

I'm struggling with getting file upload/download to work properly in Play framework 2.2.2. I have a Student class with a field called "cv". It's annotated with #Lob, like this:
#Lob
public byte[] cv;
Here are the upload and download methods:
public static Result upload() {
MultipartFormData body = request().body().asMultipartFormData();
FilePart cv = body.getFile("cv");
if (cv != null) {
filenameCV = cv.getFilename();
String contentType = cv.getContentType();
File file = cv.getFile();
Http.Session session = Http.Context.current().session();
String studentNr = session.get("user");
Student student = Student.find.where().eq("studentNumber", studentNr).findUnique();
InputStream is;
try {
is = new FileInputStream(file);
student.cv = IOUtils.toByteArray(is);
} catch (IOException e) {
Logger.debug("Error converting file");
}
student.save();
flash("ok", "Vellykket! Filen " + filenameCV + " ble lastet opp til din profil");
return redirect(routes.Profile.profile());
} else {
flash("error", "Mangler fil");
return redirect(routes.Profile.profile());
}
}
public static Result download() {
Http.Session session = Http.Context.current().session();
Student student = Student.find.where().eq("studentNumber", session.get("user")).findUnique();
File f = new File("/tmp/" +filenameCV);
FileOutputStream fos;
try {
fos = new FileOutputStream(f);
fos.write(student.cv);
fos.flush();
fos.close();
} catch(IOException e) {
}
return ok(f);
}
The file seems to be correctly saved to the database (the cv field is populated with data, but it's obviously cryptic to me so I don't know for sure that the content is what it's supposed to be)
When I go to my website and click the "Download CV" link (which runs the download action), the file gets downloaded but can't be opened - saying the PDF viewer can't recognize the file etc. (Files uploaded have to be PDF)
Any ideas on what might be wrong?

Don't keep your files in DB, filesystem is much better for that! Save uploaded file on the disk with some unique name, then in your database keep only path to the file as a String!
It's cheaper in longer run (as said many times)
It's easier to handle downloads, i.e. in Play all you need to serve PDF is:
public static Result download() {
File file = new File("/full/path/to/your.pdf");
return ok(file);
}
it will set proper headers, like Content-Disposition, Content-Length and Content-Type not only for PDFs

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read .docx content from web server url - java - java

Related

How to export utf-8 content in octet stream response for REST API endpoint?

How to download files(e.g png, jpeg,pdf,msg) using Amazon S3 presigned url in java?

How to download/save a LinkedIn profile as a PDF file using Java?

how to use jsoup on router address?

File download returns corrupted file (I think) in Play framework 2.2.2

Categories

Resources