jsoup don't get full data

jsoup don't get full data - java

I have a project for school to parse web code and use it like a data base. When I tried to down data from (https://www.marathonbet.com/en/betting/Football/), I didn't get it all?
Here is my code:
Document doc = Jsoup.connect("https://www.marathonbet.com/en/betting/Football/").get();
Elements newsHeadlines = doc.select("div#container_EVENTS");
for (Element e: newsHeadlines.select("[id^=container_]")) {
System.out.println(e.select("[class^=block-events-head]").first().text());
System.out.println(e.select("[class^=foot-market]").select("[class^=event]").text());
}
for result you get (this is last of displayed leagues):
Football. Friendlies. Internationals All bets Main bets
1. USA 2. Mexico 16 Apr 01:30 +124 7/5 23/10 111/50 +124
on top of her are all leagues displayed.
Why don't I get full data? Thank you for your time!

Jsoup has a default body response limit of 2MB. You can change it to whatever you need with maxBodySize(int)
Set the maximum bytes to read from the (uncompressed) connection into
the body, before the connection is closed, and the input truncated.
The default maximum is 2MB. A max size of zero is treated as an
infinite amount (bounded only by your patience and the memory
available on your machine).
E.g.:
Document doc = Jsoup.get(url).userAgent(ua).maxBodySize(0).get();
You might like to look at the other options in Connection, on how to set request timeouts, the user-agent, etc.

Related

Metadata, content-length for GCS objects

Two things:
I am trying to set custom metadata on a GCS object signed URL.
I am trying to set a maximum file size on a GCS object signed URL.
Using the following code:
Map<String, String> headers = new HashMap<>();
headers.put("x-goog-meta-" + usernameKey, username);
if (StringUtils.hasText(purpose)) {
headers.put("x-goog-meta-" + purposeKey, purpose);
}
if (maxFileSizeMb != null) {
headers.put("x-goog-content-length-range", String.format("0,%d", maxFileSizeMb * 1048576));
}
List<Storage.SignUrlOption> options = new ArrayList<>();
options.add(Storage.SignUrlOption.httpMethod(HttpMethod.POST));
options.add(Storage.SignUrlOption.withExtHeaders(headers));
String documentId = documentIdGenerator.generateDocumentId().getFormatted();
StorageDocument storageDocument =
StorageDocument.builder().id(documentId).path(getPathByDocumentId(documentId)).build();
storageDocument.setFormattedName(documentId);
SignedUrlData.SignedUrlDataBuilder builder =
SignedUrlData.builder()
.signedUrl(storageInterface.signUrl(gcpStorageBucket, storageDocument, options))
.documentId(documentId)
.additionalHeaders(headers);
First of all the generated signed URL works and I can upload a document.
Now I am expecting to see the object metadata through the console view. There is no metadata set though. Also the content-length-range is not respected. I can upload a 1.3 MB file when the content-length-range is set to 0,1.
Something happens when I upload a bigger file (~ 5 MB), but within the content-length-range. I receive an error message: Metadata part is too large..

As you can see here content-length-range requires both a minimum and maximum size. The unit used for the range is bytes, as you can see in this example.
I also noticed that you used x-goog-content-length-range, I found this documentation for it, when using this header take into account:
Use a PUT request, otherwise it will be silently ignored.
If the size of the request's content is outside the specified range, the request fails and a 400 Bad Request code is returned in the response.
You have to set the minimum and maximum size in bytes.

MS Word prevents edit protected document because word says document changed by another author while no one has edited it

I've implemented WebDAV server using WebDAV-Servlet.
I open a document by through WebDAV and I make a change on it,when I want to save the document, word alert me that this document changed by another user while no one had edited this document.
I don't understand the problem. Who has edited this document?
Is there any problem with my Lock implementation?

After a while I found out the solution.
The root cause of this problem is changing lastmodified date between lock and unlock requests.
last modifieddate and createddate will combine in numeral form. Then this numeric will be inserted into Head request, response which call "Etag". in my caste it's like ETag: W/"1234--9223372036854775808"
Microsoft word will get Etag value and will insert it in If-None-Match request header.
The Etag value If-None-Match should be same else ms word suppose the word content had changed so the combine alert is ms word will show.
Another point is that you should add your website in trusted site. If you don't do this an alert will raise before word opens.

Tika parsing gives maximum limit reached error

I am using Apache Tika for getting content from PDF files.
When I run it I get below error. I don't see this error documented anywhere and this is just a bad surprise.
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
at org.apache.tika.parser.pdf.PDF2XHTML.writeWordSeparator(PDF2XHTML.java:318)
at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1741)
at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
Just want to know how to get away with this error and be able to parse files again. Or How to make this limit unlimited.

You can use the writeLimit to set the limit or even disable it using:
public BodyContentHandler(int writeLimit)
The docs says the following:
writeLimit - maximum number of characters to include in the string, or
-1 to disable the write limit

Document feed limit

Is there a limit to the number of entries returned in a DocumentListFeed? I'm getting 100 results, and some of the collections in my account are missing.
How can I make sure I get all of the collections in my account?
DocsService service = new DocsService(APP_NAME);
service.setHeader("Authorization", "Bearer " + accessToken);
String feedUrl = new URL("https://docs.google.com/feeds/default/private/full/-/folder?v=3&showfolders=true&showroot=true");
DocumentLisFeed feed = service.getFeed(feedUrl, DocumentListFeed.class);
List<DocumentListEntry> entries = feed.getEntries();
The size of entries is 100.

A single request to the Documents List feed by default returns 100 element, but you can configure that value by setting the ?max-results query parameter.
Regardless, in order to retrieve all documents and files you should always take into account sending multiple requests, one per page, as explained in the documentation:
https://developers.google.com/google-apps/documents-list/#getting_all_pages_of_documents_and_files
Please also note that it is now recommended to switch to the newer Google Drive API, which interacts with the same resources and has complete documentation and sample code in multiple languages, including Java:
https://developers.google.com/drive/

You can call
feed.getNextLink().getHref()
to get a URL that you an form another feed with. This can be done until the link is null, at which point all the entries have been fetched.

Torrent tracker reply with same peer id for every request?

I'm writing a Java app to look at the trackers listed in a torrent file.
I send the following:
http://pow7.com/announce?info_hash=%3f%99%79%31%73%27%9e%be%1d%d2%cd%5f%af%98%7c%17%5f%43%89%f3&peer_id=-jT1000-122843C6A4B0&port=6881&downloaded=0&left=0
But it doesn't matter what info_hash I send I either get the same peers ip address back (74.253.253.31:6757) or an error.
Any ideas why this happens?
Best regards,
TX

Ok, I think I found the answer to my question:
One needs to generate a SHA1 hash from the value of the info key. I take all the bytes from d (included, next byte after the word "info") to the last e of the info map (included).
Thus is will be SHA1 of the bold part of the snippet below:
...:info d5:filesld6:...[many bytes]...e 9:...
(With out the spaces in front of d and after e)
Then I simple convert the byte array returned by MessageDigest and insert % for every hex 2 digit pair. E.g.:
%70%47%8F...[snip]...%13%6F%6C

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

jsoup don't get full data - java

Related

Metadata, content-length for GCS objects

MS Word prevents edit protected document because word says document changed by another author while no one has edited it

Tika parsing gives maximum limit reached error

Document feed limit

Torrent tracker reply with same peer id for every request?

Categories

Resources