Difference between APPLICATION_STREAM_JSON_VALUE and APPLICATION_NDJSON_VALUE - java

While working with Spring 5 reactive APIs , i came across the deprecated MediaType APPLICATION_STREAM_JSON_VALUE , which when used display values from a GET REST endpoint in a stream kind of fashion , that is showing up values as they appear on the browser . But as of today the documentation states that it has been replaced by APPLICATION_NDJSON_VALUE as per below text from the documentation :
APPLICATION_STREAM_JSON_VALUE Deprecated. as of 5.3 since it
originates from the W3C Activity Streams specification which has a
more specific purpose and has been since replaced with a different
mime type. Use APPLICATION_NDJSON as a replacement or any other
line-delimited JSON format (e.g. JSON Lines, JSON Text Sequences).
When i checked the behaviour of the MediaType APPLICATION_NDJSON_VALUE , i observe that when a GET API is consumed on the browser , instead of streaming it on browser real time , results get downloaded as file , which you can later view . But does that in any way impact the streaming behaviour or is it exactly the same ? Does APPLICATION_NDJSON_VALUE brings in some other significance as well or its just a pure replacement for APPLICATION_STREAM_JSON_VALUE . And if it is just a replacement , why the browser streaming behaviour changes to being results of a Flux getting downloaded ? Or let me know if i am doing any mistake while trying to replicate the exact behaviour ?

But does that in any way impact the streaming behaviour or is it exactly the same?
It's exactly the same. The content type header is only telling the client what type of content it's serving, nothing more. The browser will do its best to look at that header and work out whether to display something inline or download it, but it's just a "best guess", especially in the case of reasonably new standards like newline delimited JSON. In practice you're never going to be opening this in a browser anyway (instead consuming it as an API), so it's not really that big a deal.
If you really need it not to download in the browser, you can try adding a Content-Disposition: inline header - but personally I'd just ignore the browser's behaviour and consume it with a tool more suited to the job (like curl for instance) instead.

Related

Possible to run two ContentHandlers for a single parse in Apache-Tika?

I'm using Apache Tika to parse documents and generate both a plaintext version and an HTML preview of the document. I'm able to generate both just fine if I call the parse function twice and pass in two separate ContentHandlers— this works great for text only documents. But when I get documents that require OCR with tesseract, it's a bit of a problem— it's extremely wasteful to call the parse function twice because it does the OCR (which can take a minute or so) twice as well.
I know I can write my own ContentHandler, but just wondering if anyone knows of an out-of-the-box solution for this? Much appreciated!
Good news - Apache Tika provides something out of the box for this!
TeeContentHandler - Content handler proxy that forwards the received SAX events to zero or more underlying content handlers.
Just create your 2+ real Content Handlers, pass those to the constructor of TeeContentHandler, then hand the TeeContentHandler to Tika when you do the parse

Reject url's after fetching based on a condition in Nutch

I want to know whether it's possible to filter the url's that are fetched, based on a condition (for example published date or time). I know that we can filter the url's by regex-urlfilter for fetching.
In my case I don't want to index old documents. So, if a document is published before 2017 then, it has to be rejected. Is there any date filter plugin needed or it's already available !
Any help will be appreciated. Thanks in advance.
If you only want to avoid indexing old documents you could write your own IndexingFilter that will check your condition and avoid the indexing of the documents. You don't mention your Nutch version, but assuming that you're using v1 we have a new PR (it will be ready for the next release) that will offer this feature out of the box using JEXL expressions to allow/prevent documents from being indexed.
If you can grab the PR and test it and provide some feedback would be amazing!
You could write your own custom plugin if you want, and you can check the mimetype-filter for something similar to what you want (in this case we apply the filtering based on the mimetype).
Also a warning is in place, at the moment the fetchTime or modifiedTime that Nutch uses are coming from the headers that the webserver sends when the resource is fetched, keep in mind that these values should not be trusted (unless you are 100% sure) because in most cases you'll get wrong dates. NUTCH-1414 proposes a better approach to extracting the publication date from the content of the page, or you can implement your own parser.
Keep in mind that with this approach you still fetch/parse the old documents you'll just skip the indexing step.

Reference implementation / .lib for full url encoding

I'm writing a Java application which parses links from html & uses them to request their content. The area of url encoding when we have no idea of the "intent" of the url author is very thorny. For example when to use %20 or + is a complex issue: (%20 vs +), a browser would perform this encoding for a url containing an un-encoded space.
There are many other situations in which a browser would change the content of a parsed url before requesting a page, for example:
http://www.Example.com/þ
... when parsed & requested by a browser becomes ...
http://www.Example.com/%C3%BE
.. and...
http://www.Example.com/&
... when parsed & requested by a browser becomes ...
http://www.Example.com/&
So my question is, instead of re-inventing the wheel again is there perhaps a Java library I haven't found to do this job? Failing that can anyone point me towards a reference implementation in a common browsers source? or perhaps pseudo code? Failing that, any recommendations on approach welcome!
Thanks,
Jon
HtmlUnit can certainly pick URLs out of HTML and resolve them (and much more).
I don't know whether it handles your corner cases, though. I would imagine it will handle the second, since that is a normal, if slightly funny-looking, use of HTML and a URL. I don't know what it will do with the second, in which an invalid URL is encoded in HTML.
I also know that if you find that HTMLUnit does something differently to how real browsers do it, write a JUnit test case to prove it, and file a bug report, then its maintainers will happily fix it with great alacrity.
How about using java.net.URLEncoder.encode() & java.net.URLDecoder.decode().

How do I send a query to a website and parse the results?

I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!
First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.
This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.

HTTP Request (POST) field size limit & Request.BinaryRead in JSPs

First off my Java is beyond rusty and I've never done JSPs or servlets, but I'm trying to help someone else solve a problem.
A form rendered by JavaScript is posting back to a JSP.
Some of the fields in this form are over 100KB in size.
However when the form field is being retrieved on the JSP side the value of the field is being truncated to 100KB.
Now I know that there is a similar problem in ASP Request.Form which can be gotten around by using Request.BinaryRead.
Is there an equivalent in Java?
Or alternatively is there a setting in Websphere/Apache/IBM HTTP Server that gets around the same problem?
Since the posted request must be kept in-memory by the servlet container to provide the functionality required by the ServletRequest API, most servlet containers have a configurable size limit to prevent DoS attacks, since otherwise a small number of bogus clients could provoke the server to run out of memory.
It's a little bit strange if WebSphere is silently truncating the request instead of failing properly, but if this is the cause of your problem, you may find the configuration options here in the WebSphere documentation.
We have resolved the issue.
Nothing to do with web server settings as it turned out and nothing was being truncated in the post.
The form field prior to posting was being split into 102399 bytes sized chunks by JavaScript and each chunk was added to the form field as a value so it was ending up with an array of values.
Request.Form() appears to automatically concatenate these values to reproduce the single giant string but Java getParameter() does not.
Using getParameterValues() and rebuilding the string from the returned values however did the trick.
You can use getInputStream (raw bytes) or getReader (decoded character data) to read data from the request. Note how this interacts with reading the parameters. If you don't want to use a servlet, have a look at using a Filter to wrap the request.
I would expect WebSphere to reject the request rather than arbitrarily truncate data. I suspect a bug elsewhere.

Categories