I am writing a Programm which crawls through various Webpages and does some tests using Selenium.
Now I want to find out which CSS Frameworks are used on these Websites to get some statistics.
Right now I just check with the FireFox Webdriver if on there are .css Files linked in the page which have the name of a specific Framework:
Iterator<WebElement> divWebElementIteratorCSS = webDriver.findElements(By.xpath("//link[#rel='stylesheet']")).iterator();
and then I check if the name of the found .css Files contains the name of one of the CSS Frameworks I would like to check:
if ( src.contains( frameWorkName ) && cssFrameWorks.get( frameWorkName ) == false ) {
result.addAttribute("Framework", "STRING", frameWorkName);
result.setPercent( 100 );
result.setSuccessful( true );
cssFrameWorks.put( frameWorkName, true );
}
The Hashmap frameWorkName contains all the names of the Frameworks I am interested in.
Now my problem: If the administrator of the site renamed the .css file of the framework, my test does not work! Is there a safe way to check this, which works even if the .css has a different name?
I think #AaronDigulla's answer is pretty clear.
An alternative that I can think of though, is when you iterate through those, do a GET on that css file, then do a quick scan through the documentation at the beginning. For example, a CSS file might contain...
/* CSS Framework vX.X
* Author: Some Author
* License:
* Some ridiculously long license
*/
This would alleviate your file name changes issue.
I know no reliable way to determine which CSS (or JavaScript) frameworks are being used in a web site.
If you're lucky, then the admins will use a global URL (like the CDN links provided by jQuery).
When people start to rename files, you can try to download the CSS file and fingerprint it (create a checksum).
That will fail, of course, when people change those files. This can happen automatically; wro4j is a framework that will compile all JavaScript and CSS resources into one big file each automatically.
I'm also a bit worried why you would need this information. Instead of trying to figure out which framework is being used (and which version) look for the actual CSS styles that are being applied which might influence your tests.
Related
I want to make the js and css files which are modified are to be downloaded at the client end when a page is accessed. I have these approaches
Manually add the modified timestamp the URL in each page.
I was thinking of writing a scriptlet code in all the jsp pages which will read all the js and css files modified timestamp and append it to the url in the page.
Add the modified timestamp while building the war file using ANT.
I have following questions.
Can any one let me know which would be a better solution of the above approaches? I am open to any other solutions also.
I went through this answer on SO and using it I can get the modified date but how to change the jsp file?
Is there anything similar to this in java?
In this situation better or best solution is took shape according to your exact requirements. I might derive simple questions like; Will your static resources in same server or included in your app in same server etc.. May be some other better ways..
I don't have ant experience so I can't talk about it now , but you can go with java way already.I want to share just idea/s. A filter(looks the .css or .js requests , gets resources and look resource lastmodified date or checksum return as version on response) or custom jsp tag will provide your requirements. Write a custom jsp tag <resource:static path="app.js"/> like that example. So it may look specific file's last modified date, assumed under the same document root, and it can produce <script type="text/javascript" src="app.js?version=8637"> like this result, so this result will bust the cache.
I am trying to build a very rudimentary crawler which could move through certain specific links and extract the contents from them. I am using JSoup for traversing through the links on a page and reading the required content.
However I have hit a roadblock on one of the sites. It is a kind of news portal on which users are allowed to post their own comments. I need to extract these comments. However if there are more than 5 comments, they are spread over several pages and the links to the subsequent pages are created by a JavaScript code in href (instead of a real link). It is something like this:
<a id="pager1_lnkPage2" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("pager1$lnkPage2", "", true, "", "", false, true))">2</a>
Now I have no idea how to traverse through the links generated by this JavaScript. Is there any way to get the data on the pages referred to by these links (on the face of it this does not seem to create any new link since the URL does not change while we navigate through other pages)?
For your reference here is a link to one such page. The links to navigate through multiple pages are at the lower right corner of the page.
This is embedded on the page with the main story in an iframe.
I have also come across an interface called ScriptEngine in javax but I could not understand it well enough to use it here.
Thanks
I've never used jsoup, but judging by its description (it is HTML parser) and the fact you try to somehow incorporate javascript into it, is telling me that you chose wrong tool for the job.
In your case I would rather go with Zombie.js (Node.js based) or Selenium. Latter may be better choice if you want to stick with Java (Selenium has Java based plugins).
My work has tasked me with determining the feasibility of migrating our existing in-house built change management services(web based) to a Sharepoint solution. I've found everything to be easy except I've run into the issue that for each change management issue (several thousand) there may be any number of attachment files associated with them, called through javascript, that need to be downloaded and put into a document library.
(ex. ... onClick="DownloadAttachment(XXXXX,'ProjectID=YYYY');return false">Attachment... ).
To keep me from manually selecting them all I've been looking over posts of people wanting to do similar, and there seem to be many possible solutions, but they often seem more complicated than they need to be.
So I suppose in a nutshell I'm asking what would be the best way to approach this issue that yields some sort of desktop application or script that can interact with web pages and will let me select and organize all the attachments. (Making a purely web based app (php, javascript, rails, etc.) is not an option for me, so throwing that out there now).
Thanks in advance.
Given a document id and project id,
XXXXX and YYYY respectively in
your example, figure out the URL
from which the file contents can be
downloaded. You can observe a few
URL links in the browser and detect
the pattern which your web
application uses.
Use a tool like Selenium to get a
list of XXXXXs and YYYYs of
documents you need to download.
Write a bash script with wget to
download the files locally and put
in the correct folders.
This is a "one off" migration, right?
Get access to your in-house application's database, and create an SQL query which pulls out rows showing the attachment names (XXXXX?) and the issue/project (YYYY?), ex:
|file_id|issue_id|file_name |
| 5| 123|Feasibility Test.xls|
Analyze the DownloadAttachment method and figure out how it generates the URL that it calls for each download.
Start a script (personally I'd go for Python) that will do the migration work.
Program the script to connect and run the SQL query, or can read a CSV file you create manually from step #1.
Program the script to use the details to determine the target-filename and the URL to download from.
Program the script to download the file from the given URL, and place it on the hard drive with the proper name. (In Python, you might use urllib.)
Hopefully that will get you as far as a bunch of files categorized by "issue" like:
issue123/Feasibility Test.xls
issue123/Billing Invoice.doc
issue456/Feasibility Test.xls
Thank you everyone. I was able to get what I needed using htmlunit and java to traverse a report I made of all change items with attachments, go to each one, copy the source code, traverse that to find instances of the download method, and copy the unique IDs of each attachment and build an .xls of all items and their attachments.
I would like to have a tree/ folder structure for my content but would like all pages to be served as a flat URL. E.g.
the page located at /cat1/subcat2/tulips.html would be served at:
http://example.com/tulips.html
and the page located at /cat5/roses.html would be served at:
http://example.com/roses.html
I would need all links to be automatically calculated and ensure that there are no conflicts.
Is this possible with opencms?
Thanks,
Assaf
A rough outline how I'd to approach this:
You would first get the list of all the resources via <cms:contentload> (http://www.bng-galiza.org/opencms/opencms/alkacon-documentation/documentation_taglib/docu_tag_contentload.html), taglib or the respective java API in java code as you need some coding anyway, and then create new resources of type 'external link' in your OpenCms root folder, pointing to your targets; probably using something like
getCms().createResource(newFileName, templateFile.getTypeId());
or similar method (as external link isn't structured content) for it.
You could wrap this logic up into a java class and schedule it as a scheduled job, I guess it's sufficient, as long as you don't need it right away and some delay is acceptable. Otherwise you'd need to hook it into the publishing flow.
I'm in charge of updating an existing java app for an embedded device (a copier).
One of the things I want to do is create a servlet which allows the download of all the files in our sandboxed directory on the device (which will include the application log files, local caches, etc). At the moment these files are all in a single directory with no subdirectories.
Basically what I'd like to do is as follows:
Log.log
Log.log.1
Log.log.2
SomeLocalCache.txt
AnotherLocalCache.txt
where each line is a clickable link allowing download of the file.
However, my HTML experience is basically nil, and my familiarity with the Java API is still fairly rudimentary, so I'm looking for some advice on the proper way to go about it.
I've read through all the samples provided, and here's what I'm thinking.
I can create a servlet at a specified URL on the device which will call into my code. Let's call this /MyApp.
I add another link below that, let's call it /MyApp/Download.
When this address it reached in a browser, it displays the list of files.
This list will have to be created on the fly. I can create an HTML template file and put it in the res folder (this seems to be the recommended method for the device in question), but the whole list of files/links will need to be substituted in at run time. Here's an example I found using <ol>+<li> tags for the list and <a> tags for the links. I can generate that on the fly pretty easily. Is that a reasonable way to go?
e.g.
<ol>
<li>
Log.log
</li>
<!--more <li> elements-->
</ol>
Clicking on an individual file will link to /MyApp/Download/File.ext which will then trigger the file download via my servlet (I've found this code which looks promising for the actual download).
The device will require users to log before they are allowed to access the /MyApp link or any sub-links, and I can additionally require that the logged in user be an admin before allowing file download, which together seems like sufficient security in this case (heavy security is not required for these files).
So am I missing anything big or is this a reasonable plan of engagement?
EDIT
Judging by this link when to use UL or OL in html? Many people are going to hammer the answer and comment below because they say it is important to put semantic information into the HTML.
My point is simply this -- the only difference is browsers will display one with bullet points (as OP seems to want) and one with numbers (as the OP does not want.) I suggest he change the HTML to the way he wants it to render, or leave it as is, and make some CSS changes.
Yes there is a semantic difference between the two... they will both still render in order as defined here http://www.w3.org/TR/html401/struct/lists.html Is the HTML the place to put semantic information? I think not, code that generated the HTML is the correct place. Your cohesion may vary.
I won't change my original comment for the sake of history.
END EDIT
Seems fine to me -- however <ol> is not really used any more, I'd go with <ul>. Don't worry, it is still ordered as you would expect.
The reason for this is the only difference between the two was browsers would automatically number (render with a number before) ordered lists. However, with CSS all the rendering control can be in the CSS (including numbering) and everyone is happy.
Hardly anyone uses the auto number anymore. In fact via CSS, lists can and are used for all sorts of crazy things, including CSS menuing systems.
Here's a summary you need to do:
You can use File#listFiles() to get a File[].
You can use JSTL c:forEach to iterate over an array.
You can use HTML <ol>, <ul> or <dl> elements to display a list.
You can use HTML <a> element to display a link.
You can use a Servlet to write an InputStream of a local file to OutputStream of the response. Remember to pass at least Content-Type, Content-Length and Content-Disposition along.
You can make use of request pathinfo to pass file identifier safely. E.g. map servlet on /files/* and let link point to http://example.com/files/path/to/file.ext and in the servlet you can get /path/to/file.ext by request.getPathInfo().
A basic and solid servlet example can be found here: FileServlet. If you want to add resume and compressing capabilities, then you may find the improved FileServlet more useful.
That said, most appservers also just supports directory listing by default. Tomcat for example supports it by default. You can just define another <Context> in Tomcat's server.xml with a docBase of C:/path/to/all/files and a context path of /files (so that it's accessible by http://example.com/files.
<Context docBase="/path/to/all/files" path="/files" />
That's basically all. No homegrown code/html/servlet needed.