How to extract .gz file Dynamically in Java?

How to extract .gz file Dynamically in Java? - java

In http://www.newegg.com/Siteindex_USA.xml lots of urls of .gz-files are given, like this:
<loc>
http://www.newegg.com//Sitemap/USA/newegg_sitemap_product01.xml.gz
</loc>
I want to extract these dynamically. I don't want to store them locally, I just want to extract them and store the contained data in a database.
Modify:
I am getting exception
private void processGzip(URL url, byte[] response) throws MalformedURLException,
IOException, UnknownFormatException {
if (DEBUG) System.out.println("Processing gzip");
InputStream is = new ByteArrayInputStream(response);
// Remove .gz ending
String xmlUrl = url.toString().replaceFirst("\\.gz$", "");
if (DEBUG) System.out.println("XML url = " + xmlUrl);
InputStream decompressed = new GZIPInputStream(is);
InputSource in = new InputSource(decompressed);
in.setSystemId(xmlUrl);
processXml(url, in);
decompressed.close();
}

Simply wrap the input stream in GZIPInputStream, and it'll decompress the data as you're reading it.

Related

Convert InputStream from ISO-8859-1 to UTF-8

I have a file in ISO-8859-1 containing german umlauts and I need to unmarshall it using JAXB. But before I need the content in UTF-8.
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
inputWithNamespace = convertFileToUtf(inputWithNamespace);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
...
I get the "file" as an InputStream. My idea was to read the file's content in UTF-8 and make another InputStream to use. This is what I've tried:
private InputStream convertFileToUtf(InputStream inputStream) throws IOException {
byte[] bytesInIso = ByteStreams.toByteArray(inputStream);
String stringIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String stringUtf = new String(bytesInUtf);
return new ByteArrayInputStream(bytesInUtf);
}
I have those 2 Strings to check the contents, but even just reading the ISO file, it gives question marks where umlauts are (?) and converting that to UTF_8 gives strange characters like 1/2 and so on.
UPDATE
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
Verifying contentInIso prints question marks instead of the umlauts and by checking contentInIso instead of umlauts, it has characters like "ï¿½".
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
This method convert it's called by another one called processUsageFile:
private void processUsageFile(File usageFile) {
try (FileInputStream fileInputStream = new FileInputStream(usageFile)) {
usageImporterService.importUsages(usageFile.getName(), fileInputStream, getUsageTypeValidated(usageFile.getName()));
log.info("Usage file {} imported successfully. Moving to archive directory", usageFile.getName());
If i take the code I have written under the UPDATE statement and put it immediately after the try, the first contentInIso has question marks but the contentInUtf has the umlauts. Then, by going into the convert, jabx throws an exception that the file has a premature end of line.

Regarding the behaviour you are getting,
String stringIso = new String(bytesInIso);
In this step, you construct a new String by decoding the specified array of bytes using the platform's default charset.
Since this is probably not ISO_8859_1, I think the String you are looking at becomes garbled here.

How to set file name while downloading as zip?

I have a rest api which allows me to pass multiple IDS to a resource to download records from specific table and zip it. MSSQL is the backend mastering messages.
So when a ID is passed as param, it calls the database table to return the message data. Below is the code:
#GetMapping("message/{ids}")
public void downloadmessage(#PathVariable Long[] ids, HttpServletResponse response) throws Exception {
List<MultiplemessageID> multiplemessageID = auditRepository.findbyId(ids);
String xml = new ObjectMapper().writeValueAsString(MultiplemessageID);
String fileName = "message.zip";
String xml_name = "message.xml";
byte[] data = xml.getBytes();
byte[] bytes;
try (ByteOutputStream bout = new ByteOutputStream(); ZipOutputStream zout = new ZipOutputStream(bout)) {
for (Long id : ids) {
zout.setLevel(1);
ZipEntry ze = new ZipEntry(xml_name);
ze.setSize(data.length);
ze.setTime(System.currentTimeMillis());
zout.putNextEntry(ze);
zout.write(data);
zout.closeEntry();
}
bytes = bout.getBytes();
}
response.setContentType("application/zip");
response.setContentLength(bytes.length);
response.setHeader("Content-Disposition", "attachment; " + String.format("filename=" + fileName));
ServletOutputStream outputStream = response.getOutputStream();
FileCopyUtils.copy(bytes, outputStream);
outputStream.close();
}
Message on the database has the following structure:
MSG_ID C_ID NAME INSERT_TIMESTAMP MSG CONF F_NAME POS ID INB HEADERS
0011d540 EDW,WSO2,AS400 invoicetoedw 2019-08-29 23:59:13 <invoice>100923084207</invoice> [iden1:SMTP, iden2:SAP, service:invoicetoedw, clients:EDW,WSO2,AS400, file.path:/c:/nfs/store/invoicetoedw/output, rqst.message.format:XML,] p3_pfi_1 Pre 101 MES_P3_IN [clients:EDW,WSO2,AS400, UniqueName:Domain]
My file name should be like: part of header name + _input parameterId[0]
i.e. Domain_1
File name for multiple paramter (1,2,3,4)will be like
Domain_1
Domain_2
Domain_3
Domain_4
Below code retrieves the part of file name as string from the header.
private static String serviceNameHeadersToMap(String headers) {
String sHeaders = headers.replace("[", "");
sHeaders = sHeaders.replace("]", "");
String res = Arrays.stream(sHeaders.split(", "))
.filter(s->s.contains("serviceNameIdentifier"))
.findFirst()
.map(name->name.split(":")[1])
.orElse("Not Present");
return res;
I need to create a file name with header and input parameter. Once the file name is set, I would like individual records downloaded with correct file name and zipped.
Zip file name is message.zip. When unzipped it should contain individual files like Domain_1.xml, Domain_2.xml, Domain_3.xml, Domain_4.xml etc...
How do I achieve this? Please advise. I need some guidance for the limited knowledge on java I have. Thank you.

Read PDVInputStream dicomObject information on onCStoreRQ association request

I am trying to read (and then store to 3rd party local db) certain DICOM object tags "during" an incoming association request.
For accepting association requests and storing locally my dicom files i have used a modified version of dcmrcv() tool. More specifically i have overriden onCStoreRQ method like:
#Override
protected void onCStoreRQ(Association association, int pcid, DicomObject dcmReqObj,
PDVInputStream dataStream, String transferSyntaxUID,
DicomObject dcmRspObj)
throws DicomServiceException, IOException {
final String classUID = dcmReqObj.getString(Tag.AffectedSOPClassUID);
final String instanceUID = dcmReqObj.getString(Tag.AffectedSOPInstanceUID);
config = new GlobalConfig();
final File associationDir = config.getAssocDirFile();
final String prefixedFileName = instanceUID;
final String dicomFileBaseName = prefixedFileName + DICOM_FILE_EXTENSION;
File dicomFile = new File(associationDir, dicomFileBaseName);
assert !dicomFile.exists();
final BasicDicomObject fileMetaDcmObj = new BasicDicomObject();
fileMetaDcmObj.initFileMetaInformation(classUID, instanceUID, transferSyntaxUID);
final DicomOutputStream outStream = new DicomOutputStream(new BufferedOutputStream(new FileOutputStream(dicomFile), 600000));
//i would like somewhere here to extract some TAGS from incoming dicom object. By trying to do it using dataStream my dicom files
//are getting corrupted!
//System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
try {
outStream.writeFileMetaInformation(fileMetaDcmObj);
dataStream.copyTo(outStream);
} finally {
outStream.close();
}
dicomFile.renameTo(new File(associationDir, dicomFileBaseName));
System.out.println("DICOM file name: " + dicomFile.getName());
}
#Override
public void associationAccepted(final AssociationAcceptEvent associationAcceptEvent) {
....
#Override
public void associationClosed(final AssociationCloseEvent associationCloseEvent) {
...
}
I would like somewhere between this code to intercept a method wich will read dataStream and will parse specific tags and store to a local database.
However wherever i try to put a piece of code that tries to manipulate (just read for start) dataStream then my dicom files get corrupted!
PDVInputStream is implementing java.io.InputStream ....
Even if i try to just put a:
System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
before copying datastream to outStream ... then my dicom files are getting corrupted (1KB of size) ...
How am i supposed to use datastream in a CStoreRQ association request to extract some information?
I hope my question is clear ...

The PDVInputStream is probably a PDUDecoder class. You'll have to reset the position when using the input stream multiple times.
Maybe a better solution would be to store the DICOM object in memory and use that for both purposes. Something akin to:
DicomObject dcmobj = dataStream.readDataset();
String whatYouWant = dcmobj.get( Tag.whatever );
dcmobj.initFileMetaInformation( transferSyntaxUID );
outStream.writeDicomFile( dcmobj );

Extract text from a large pdf with Tika

I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.
This is the code
public class ParsePDF {
public static void main(String args[]) throws Exception {
try {
File file = new File("C:/vlarge.pdf");
String content = new Tika().parseToString(file);
System.out.println("The Content: " + content);
}
catch (Exception e) {
e.printStackTrace();
}
}
}

From the Javadocs:
To avoid unpredictable excess memory use, the returned string contains
only up to getMaxStringLength() first characters extracted from the
input document. Use the setMaxStringLength(int) method to adjust this
limitation.
Calling setMaxStringLength(-1) will disable this limit.

Try the apache api TIKA. Its working for large PDF's also.
Sample :
InputStream input = new FileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);

Convert byte[] to Base64 string for data URI

I know this has probably been asked 10000 times, however, I can't seem to find a straight answer to the question.
I have a LOB stored in my db that represents an image; I am getting that image from the DB and I would like to show it on a web page via the HTML IMG tag. This isn't my preferred solution, but it's a stop-gap implementation until I can find a better solution.
I'm trying to convert the byte[] to Base64 using the Apache Commons Codec in the following way:
String base64String = Base64.encodeBase64String({my byte[]});
Then, I am trying to show my image on my page like this:
<img src="data:image/jpg;base64,{base64String from above}"/>
It's displaying the browser's default "I cannot find this image", image.
Does anyone have any ideas?
Thanks.

I used this and it worked fine (contrary to the accepted answer, which uses a format not recommended for this scenario):
StringBuilder sb = new StringBuilder();
sb.append("data:image/png;base64,");
sb.append(StringUtils.newStringUtf8(Base64.encodeBase64(imageByteArray, false)));
contourChart = sb.toString();

According to the official documentation Base64.encodeBase64URLSafeString(byte[] binaryData) should be what you're looking for.
Also mime type for JPG is image/jpeg.

That's the correct syntax. It might be that your web browser does not support the data URI scheme. See Which browsers support data URIs and since which version?
Also, the JPEG MIME type is image/jpeg.

You may also want to consider streaming the images out to the browser rather than encoding them on the page itself.
Here's an example of streaming an image contained in a file out to the browser via a servlet, which could easily be adopted to stream the contents of your BLOB, rather than a file:
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException
{
ServletOutputStream sos = resp.getOutputStream();
try {
final String someImageName = req.getParameter(someKey);
// encode the image path and write the resulting path to the response
File imgFile = new File(someImageName);
writeResponse(resp, sos, imgFile);
}
catch (URISyntaxException e) {
throw new ServletException(e);
}
finally {
sos.close();
}
}
private void writeResponse(HttpServletResponse resp, OutputStream out, File file)
throws URISyntaxException, FileNotFoundException, IOException
{
// Get the MIME type of the file
String mimeType = getServletContext().getMimeType(file.getAbsolutePath());
if (mimeType == null) {
log.warn("Could not get MIME type of file: " + file.getAbsolutePath());
resp.setStatus(HttpServletResponse.SC_INTERNAL_SERVER_ERROR);
return;
}
resp.setContentType(mimeType);
resp.setContentLength((int)file.length());
writeToFile(out, file);
}
private void writeToFile(OutputStream out, File file)
throws FileNotFoundException, IOException
{
final int BUF_SIZE = 8192;
// write the contents of the file to the output stream
FileInputStream in = new FileInputStream(file);
try {
byte[] buf = new byte[BUF_SIZE];
for (int count = 0; (count = in.read(buf)) >= 0;) {
out.write(buf, 0, count);
}
}
finally {
in.close();
}
}

If you don't want to stream from a servlet, then save the file to a directory in the webroot and then create the src pointing to that location. That way the web server does the work of serving the file. If you are feeling particularly clever, you can check for an existing file by timestamp/inode/crc32 and only write it out if it has changed in the DB which can give you a performance boost. This file method also will automatically support ETag and if-modified-since headers so that the browser can cache the file properly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract .gz file Dynamically in Java? - java

Simply wrap the input stream in GZIPInputStream, and it'll decompress the data as you're reading it.

Related

Convert InputStream from ISO-8859-1 to UTF-8

How to set file name while downloading as zip?

Read PDVInputStream dicomObject information on onCStoreRQ association request

Extract text from a large pdf with Tika

Convert byte[] to Base64 string for data URI

Categories

Resources