how a browser open a saved html page ? It must have to run html file and other files from hard-disk. But how can a browser find the link of the other small files ? Is a browser change the link of the other small files of html page from url to hard-disk location?
How it can do that? I want to do the same thing in my application. But I could not figure out the process.
Most browsers store attached resources (Style sheets, images, scripts and the like) in a separate folder named after the saved page.
All references to resources are then converted to relative references, like so:
<img src="name_of_saved_folder/image.jpg">
the browser will then look in name_of_saved_folder relative to the saved HTML document's location.
If the HTML file is moved to a different locations, the references will usually no longer work.
Internet Explorer introduced the very interesting concept of an archived HTML format in 1999 that combines all resources in one file, but sadly, this hasn't yet caught on in terms of global, real-world support in all browsers.
Instead of coding this on your own, you may be able to interface with an existing tool like wget that can do all the grabbing for you. For most programming languages, there are probably related questions on Stack Overflow already on how to best store a HTML page and its resources locally.
You just have to use relative URLs, so the browser will load the external files (images, etc) relatively to the location of the HTML page.
So if your HTML page is saved at file:///some/directory/page.html, if you have a <img src="image.png">, the browser will load this image from file:///some/directory/image.png.
Related
I have a URL that causes a web server to generate a PDF when opened. Is it possible to save this PDF document to disk (client side), using Java? I found lots of examples for doing this when the PDF already exists as a document on the web server, but the code for these examples does not seem to work in the case where the web server doesn't begin to create the content until the link is opened (at least that is my impression at this point).
There is a link that I can click to produce the PDF. The HREF for that link is:
<a href="javascript:open_window('ReportDisplay.cfmincidentID=223189&cs=377041B‌​A2467C3CEA7FD989A12126E0E&services=2815&format=1&UniqueID=651F76E4E56‌​91207B9B2AF1F51A780AA')">
<img src="../../Images/pdf-small.gif" alt="Report" border="0" height="15" width="15">
</a>
I construct a complete URL, including the protocol and such, and I can paste that complete URL into the location bar of the browser. This does in fact produce the PDF in the current window. So, this is what I'm trying to capture into a file on my local disk.
You are currently serving the PDF inline whereas you want to change it as an attachment. See the answer to the question Content-Disposition:What are the differences between "inline" and "attachment"? to find out what's the difference.
If you use the Content-Disposition header "inline", the PDF will be shown in a browser window. If you use Content-Disposition header "attachment", a dialog box will open, asking the end user where he wants to save the PDF.
You can't "automatically" save the PDF on the end user's machine because you don't have any idea about the operating system and the disk organization of the end user. If he's on Windows, the path C:/temp will probably exist, but if he's on a Mac or a Linux machine, that path won't exist. That's why you'll always need a "Save as" dialog on the client side.
I am working on a code, that requires uploading of any kind of document from client's machine to the server, and extracting images out of it. For almost all docs Tika is helpful, but in case of an html page, the images are referenced to the local machine's path. So how do I upload the html page along with the images it contains?
I'm using Java Servlets and JSP as platform.
This is impossible to solve server-side, you have to implement a client-side (Javascript? Java applet? Flash (yuck!)?) solution. The HTML document is just a text, it does not contain the images - it just references them. So you have to parse the document, get the images, upload them independently, and then - server-side - process the document and adjust the image references (the values of src attributes).
Pretty complex, isn't it?
I want to download an HTML page, extract some used full text out of this HTML and convert the HTML to PDF then store the useful text and PDF in a noSQL solution.
What is the most efficient way to pass the HTML to the modules which extract useful text and the module which creates the PDF. I don't want to download the same HTML twice.
One way to store the HTML is to download the HTML to a local disk under a unique named folder and pass the path to other modules so that they can process the HTML.
This approach doesn't looks that good to me, as there is implementation overhead.
I would love to see the entire HTML as a single variable so I can give it to other modules so they can traverse the HTML without loading it. One idea that crossed my mind is to download and zip the HTML and related code/pics then store the binary in a byte[].
I haven't used these before but a quick Type search on eclipse with the text html gave me this:
Class HTMLDocument
From the docs :
A document that models HTML. The purpose of this model is to support both browsing and editing
How to get the web page size from an url in bytes. This should include all images.
How can we do that. Any help
Thanks in Advance
The way to find the number of bytes used to represent a web page you would need to:
fetch the HTML page and all images, scripts, CSS files, etc that it references, transitively,
evaluate any embedded scripts (as per the HTML spec) to see if they pull in further resources, and
sum the byte counts for all resources loaded to give the "web page size".
But I don't see what you would learn by doing this. For instance, the web page size (as above) is not a good predictor of network usage.
You say:
I am doing this to analyze the performance of an web page.
A better way would be to use something like the "yslow" plugin for Firefox.
I generate a html file using log4j WriterAppender file. I also takesnapshots of my screen using webdriver. Now I wish to append them together.
Any idea how to do that?
Thanks!
Apologies for not being clear and daft. My situation is that I have got a html file which is generated dynamically by my logger class and then there are some .png file which are also being created dynamically. Now I want them to appear together in one file. Am I clear now? Please ask for more information if needed
It's possible to embed graphics data in a couple of ways. Most modern browsers accept the data: url notation. An image can be embedded straight into a url.
I took an example from this site. Cut and paste the whole line into the url bar:

You should see a folder graphic. Some older browsers don't accept this, and some such as IE8 restrict content in various ways, to static content for security reasons.
The second way of doing the same is for the server to serve multi-part MIME. Basically a server would shove out a multi-part mime document consisting of the HTML body and then any inline images base64 encoded as separate parts. This is more suitable for email HTML although it might work through a web browser.
It's not quite clear what you're asking here, but let's assume that you want to manually add an image to the log output HTML file.
If you want to include an image in your HTML file, just save the snapshot PNG file in a place relative to where the HTML is generated, then include it using standard HTML syntax:
<img src="images/snapshot.png" alt="snapshot description">
Update: the requirement is to add dynamically generated PNG files to a dynamically created HTML log file.
If one process is creating both the PNG and the log output, you should be fine - just keep note of the appropriate PNG filename and include it in the logger output in an IMG tag (as described above).
If they are generated by separate processes, this may be more difficult; you would need to either stick to a known naming convention, have the process generating the log query the filesystem to determine the appropriate PNG file to include, or build some sort of message-passing between the two processes.
Please stop posting the same comment to each and any of the different answers given to you, when all of those answers basically tell you that the notion of concatenating two different file formats into a single file is not meaningful.
Let me repeat that again for clarity: Copying a PNG file into a HTML document makes no sense.
You either save the PNG in a directory where it's accessible in the HTML document and add an img tag so it can be referenced (see the answer by stark), which would be the recommended way in terms of portability and usage of the files as they were intended to be used.
If you really, really want to end up with a single file for whatever reasons, there are bascially two options: You follow the advice of locka and encode the PNG image with Base64 and insert an img tag with a data URI at a meaningful position. This probably involves parsing the HTML "a little" to come up with a good place to insert it.
The other option is to not create HTML, but MHTML files. MHTML is a file format that allows saving HTML source code and resources like images into a single file. MHTML is supported by the most popular browsers nowadays, you may find info on the file format here: http://people.dsv.su.se/~jpalme/ietf/mhtml.html
In the code where you are generating the html you should just include the img using the img html tag
If you want the picture to appear in the html, add the tag
<img src=./img.png /> to your html.
If you want the 2 files in one, you'll need to zip them into an archive or something?
It makes no sense to append a HTML file to a PNG file, or vice-versa. Neither file format allows this, so if you do this you will end up with a "corrupt" document that a typical web browser or image viewer won't understand.
"I want them to appear together in one file".
That's still pretty vague, I'm afraid.
Assuming that you want the image to appear embedded in the HTML document when you open the HTML document in a browser, the simple solution is create separate HTML and PNG files, and have the HTML file link to the PNG file using an <img> element.
If you want, you can bundle up the files (and others) as a ZIP or TAR file, so that you can deliver everything as a single file. However, a ZIP/TAR file typically needs to be extracted before the document can be viewed. (A typical web browser won't "display" a ZIP file. Rather it will open it in some kind of archive extractor or directory browser, allowing the user to access the individual files.)
It might also be possible to embed an image file in a HTML file by base64 encoding the image, and using embedded javascript to decode the image and then insert it into the DOM ... But this is probably waaay to complicated.