I writing some web-spider now. I want to crawl a bunch of pages from the web. I have succeed part of my goal, with hundreds of URL link stored on my hand. But those links are not the final link. That means, when you put a URL in a web browser like Google Chrome, the URL would be automatically redirected to another page, which is what I want. But that only work in a web browser. When I write code to crawl from that URL, redirection would not happen.
Some example:
given (URL_1):
http://weixin.sogou.com/websearch/art.jsp?sg=CBf80b2xkgZ8cxz1-SgG-dBH_4QL8uVunUQKxf0syVWvynE5nPZm2TPqNuEF6MO2xv0MclVANfsVYUGr5-1b3ls29YYxgU27ra8qaaU15iv7KVkBsZp5Td27Cb2A24cIwEuw__0ZHdPeivmW-kcfnw..&url=p0OVDH8R4SHyUySb8E88hkJm8GF_McJfBfynRTbN8wjVuWMLA31KxFCrZAW0lIGG1EpZGR0F1jdIzWnvINEMaGQ3JxMQ33742MRcPWmNX2CMTFYIzOo-v8LrDlfP2AnF54peD-GxvCNYy-5x5In7jJFmExjqCxhpkyjFvwP6PuGcQ64lGQ2ZDMuqxplQrsbk
put this link in a browser, it would be automatically redirect to (URL_2):
http://mp.weixin.qq.com/s?__biz=MzA4OTIxOTA4Nw==&mid=404672464&idx=1&sn=bdfff50b8e9ac28739cf8f8a51976b03&3rd=MzA3MDU4NTYzMw==&scene=6#rd
which is a different link.
But put this in python code like:
response=urllib2.urlopen(URL_1)
print response.read()
that auto-redirection does't happen!
In a word, my question is: given a URL, how to get the redirected one ?
Some body give me some java code, which work in some other situation, but doesn't help in mine:
import java.net.HttpURLConnection;
import java.net.URL;
public class Main {
public void test()throws Exception {
String expectedURL ="http://www.zhihu.com/question/20583607/answer/16597802";
String url = "http://www.baidu.com/link?url=ByBJLpHsj5nXx6DESXbmMjIrU5W4Eh0yg5wCQpe3kCQMlJK_RJBmdEYGm0DDTCoTDGaz7rH80gxjvtvoqJuYxK";
String redirtURL = getRedirectURL(url);
if (redirtURL.equals(expectedURL)) {
System.out.println("Equal");
}else{
System.out.println(url);
System.out.println(redirtURL);
}
}
public String getRedirectURL(String path) throws Exception {
HttpURLConnection conn = (HttpURLConnection) new URL(path).openConnection();
conn.setInstanceFollowRedirects(false);
conn.setConnectTimeout(5000);
return conn.getHeaderField("Location");
}
public static void main(String[] args) throws Exception{
Main obj = new Main();
obj.test();
}
}
It would print out Equal in this case, which mean that we can now get expecteURL from url. But this would work in the former case.( I don't know why, but looking carefully in to the URL_1 above and that url in the java code, I notice that there is some interesting difference: there is a snippet .../link?url=... in the url in above java code , which would probably means some direction. But in the URL_1 above, it is .../art.jsp?sg=... )
Look for follow_redirects option. In python, you can do it e.g. with requests
import requests
response = requests.get('http://example.com', follow_redirects=True)
print response.url
# history contains list of responses for redirects
print response.history
Related
I have a code which uses tor every time to get a new IP address, and then it opens a blog page, but then also the views counter of the blog do not increases?
import java.io.InputStream;
import java.net.*;
public class test {
public static void main (String args [])throws Exception {
System.out.println (test.getData("http://checkip.amazonaws.com"));
System.out.println (test.getData("***BLOG URL***"));
}
public static String getData(String ur) throws Exception {
String TOR_IP="127.0.0.1", TOR_PORT="9050";
System.setProperty("java.net.preferIPv4Stack" , "true");
System.setProperty("socksProxyHost", TOR_IP);
System.setProperty("socksProxyPort", TOR_PORT);
URL url = new URL(ur);
String s = "";
URLConnection c = url.openConnection();
c.connect();
InputStream i = c.getInputStream();
int j ;
while ((j = i.read()) != -1) {
s+=(char)j;
}
return s;
}
}
This I just made to understand what they have to pass this little auto script.
This is an evolving field, the blog sites try to detect and thwart cheating. Wordpress in particular excludes (https://en.support.wordpress.com/stats/):
visits from browsers that do not execute javascript or load images
In other words just hitting the page doesn't count. You need to fetch all the resources and possibly execute the JavaScript as well.
Using Socket I can send http request to server and get the html response. My objective is to get each image may it be png, jpeg, gif, or any other image types.
However, by looking at the responses from different websites, I noticed that some images do not use html's <img> tag, and instead may be in CSS.
How can I extract both <img> images and css images (e.g. background-image)?
Is it good to use regex to get those images urls from <img>?
Please do not refer me to http classes like Apache HttpClient.
My problem is not on http protocol.
To get all images, including images loaded by css and perhaps js, you need more than the html code.
You need code that understands html and css and js.
You need a full browser.
Fortunately, Java comes with a browser. The JavaFX WebEngine.
Give it a url or html and it will load everything.
As WebKit, it knows the latest image loading technology, for example CSS border-image.
We just need a way to get its images.
It does not provide media list, but since it is pure Java, we can hijack Java's URL handler to intercept its requests:
import java.io.IOException; import java.net.URL; import java.net.URLConnection; import javafx.application.Application; import javafx.application.Platform; import javafx.concurrent.Worker; import javafx.scene.Scene; import javafx.scene.web.WebView; import javafx.stage.Stage;
public class NetworkMonitor extends Application {
private final String url = "http://www.google.com/";
public static void main( String[] args ) {
// Override default http/https handler. Must do once only.
URL.setURLStreamHandlerFactory( protocol ->
protocol.equals( "http" ) ? new HttpHandler() :
protocol.equals( "https" ) ? new HttpsHandler() : null );
// Launch as JavaFX app. Required for WebView / WebEngine.
launch( args );
}
#Override public void start(Stage primaryStage) throws Exception {
// Create webview and listen for ondone
WebView v = new WebView();
v.getEngine().getLoadWorker().stateProperty().addListener( ( prop, old, now ) -> {
if ( now == Worker.State.SUCCEEDED || now == Worker.State.FAILED )
Platform.exit(); } );
// Showing GUI is easiest way to make sure ondone will be fired.
primaryStage.setScene( new Scene( v ) );
primaryStage.show();
// Load the target url.
v.getEngine().load( url );
}
// Your IDE should warn you about the sun package.
private static class HttpHandler extends sun.net.www.protocol.http.Handler {
#Override protected URLConnection openConnection(URL url) throws IOException {
System.out.println( url ); // Capture url!
return super.openConnection( url );
}
}
// If there is no warning, you need to switch to a better IDE!
private static class HttpsHandler extends sun.net.www.protocol.https.Handler {
#Override protected URLConnection openConnection(URL url) throws IOException {
System.out.println( url ); // Capture url!
return super.openConnection( url );
}
}
}
Since you only asked how to get the url, this is what the code do.
The code can be expanded depending on your needs.
For example, two decorator objects for the URLConnection should allow you to intercept getInputStream call and query its header (to determine mime type) and fork the stream (to save a copy of the image).
If this answer is useful, don't forget to vote up!
As other answers have already mentioned, ideally you would use a tool that understands how to parse, render and recurse HTTP resources (i.e. .html/css/js/png/gif/jpg/etc).
That being said, if you were feeling particularly masochistic (and I suspect you are), you could do this yourself...
It's not a perfect solution, but if I was going to attack this with a blunt instrument, I'd use regular expressions (I won't go into the specifics of regex, it's already widely documented on the interwebs). My process would be:
HTTP GET my base page.
Strip out all strings that match your definition of a "resource" (using regex).
Optionally recurse those resources, for more strings.
You've already mentioned that you can perform HTTP request/responses (using Sockets), so I won't cover that here.
Voila!
/**
* Regular expression to match file types - .js/.css/.png/.jpg/.gif
*/
public static final Pattern resources = Pattern.compile("([^\"'\n({}]+\\.(js|css|png|jpg|gif))",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
/**
* Pulls out "resources" from the provided text.
*/
public static Set<String> findResources(URL url, String text) {
Matcher matcher = resources.matcher(text);
Set<String> resources = new HashSet<>();
while (matcher.find()) {
String resource = matcher.group(1);
String urlStr = url.toString();
int endIndex = urlStr.lastIndexOf("/") + 1;
String parentPath = endIndex > 0 ? urlStr.substring(0, endIndex) : urlStr;
String fqResource = resource.startsWith("//") ? url.getProtocol() + ":" + resource :
resource.startsWith("http") ? resource
: resource.startsWith("/") ? getBaseUrl(url) + resource : parentPath + resource;
if (fqResource.contains("?")) {
fqResource = fqResource.substring(0, fqResource.indexOf("?"));
}
resources.add(fqResource);
}
return resources;
}
The regular expression: looks for well formed strings ending in css/js/png/gif/jpg
The method: retrieves all matching strings from the given text (aka 'http response'), tries to build a fully qualified URL, and returns a Set of the data.
I've uploaded a full example here (with sample output). Have fun!
You can use JSoup a HTML & XML parser.
Here is an example how to do it,
String responseData = ""; // HTML data
Document doc = Jsoup.parse(responseData);
Elements images = doc.select("img");
// Elements pngImages = doc.select("img[src$=.png]");
// To parse specific image format in this case png
for(Element image : images){
// Do what ever you wanted to do
}
Here is related official documentation.
I want to write code for login to websites with java.
Here is the code :
package login;
import java.net.*;
import java.io.*;
public class ConnectToURL {
// Variables to hold the URL object and its connection to that URL.
private static URL URLObj;
private static URLConnection connect;
public static void main(String[] args) {
try {
CookieManager cManager = new CookieManager();
CookieHandler.setDefault(cManager);
// Establish a URL and open a connection to it. Set it to output mode.
URLObj = new URL("https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/#identifier");
connect = URLObj.openConnection();
connect.setDoOutput(true);
}
catch (MalformedURLException ex) {
System.out.println("The URL specified was unable to be parsed or uses an invalid protocol. Please try again.");
System.exit(1);
}
catch (Exception ex) {
System.out.println("An exception occurred. " + ex.getMessage());
System.exit(1);
}
try {
// Create a buffered writer to the URLConnection's output stream and write our forms parameters.
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(connect.getOutputStream()));
writer.write("Email=myemail#gmail.Com&Passwd=123456&submit=Login");
writer.close();
// Now establish a buffered reader to read the URLConnection's input stream.
BufferedReader reader = new BufferedReader(new InputStreamReader(connect.getInputStream()));
String lineRead = "";
// Read all available lines of data from the URL and print them to screen.
while ((lineRead = reader.readLine()) != null) {
System.out.println(lineRead);
}
reader.close();
}
catch (Exception ex) {
System.out.println("There was an error reading or writing to the URL: " + ex.getMessage());
}
}
}
I have tried this code on Facebook and Gmail but the problem is that it didn't work.
It keep telling me that the cookies is not enabled. (I have used chrome browser and they were enabled).
Is there any other ways to achieve this?
If your goal is just login to some web site, much better solution is to use Selenium Web Driver.
It has API for creating modern drivers instances, and operate with their web elements.
Code example:
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
public class Example {
public static void main(String[] args) {
// Create a new instance of the html unit driver
// Notice that the remainder of the code relies on the interface,
// not the implementation.
WebDriver driver = new HtmlUnitDriver();
// And now use this to visit Google
driver.get("http://www.google.com");
// Find the text input element by its name
WebElement element = driver.findElement(By.name("q"));
// Enter something to search for
element.sendKeys("Cheese!");
// Now submit the form. WebDriver will find the form for us from the element
element.submit();
// Check the title of the page
System.out.println("Page title is: " + driver.getTitle());
driver.quit();
}
}
Also it has solution how to manage cookies as well - Cookies
Just look at documentation how to configure driver instances and manage web elements, preferred way is to use Page Object pattern.
Update:
For getting location from web page which doesn't have id or name attributes can be done using xpath expressions, very useful for this can be firefox extensions like:
FirePath
XpathChecker.
And use concisely and short Xpath functions.
For example:
<table>
<tr>
<td>
<p>some text here 1</p>
</td>
</tr>
<tr>
<td>
<p>some text here 2</p>
</td>
</tr>
<tr>
<td>
<p>some text here 3</p>
</td>
</tr>
</table>
for getting text some text here 2 you able to use following xpath:
//tr[2]/td/p
if you know that text is static you able to use contains():
//p[contains(text(), 'some text here 2')]
For checking if your xpath is unique at this page the best is to use console.
How to do is described here How to verify an XPath expression
What exactly are you trying to do with this? You are almost certainly better off using something like Selenium web-driver for browser automation tasks, as you piggy back on the work of an existing web-browser to handle things like cookies.
In this case, you're talking about your web browser saying cookies are not enabled, but you're not actually using a web browser, you're sending a connection via your java application.
I'm trying to write unit tests for my program and use mock data. I'm a little confused on how to intercept an HTTP Get request to a URL.
My program calls a URL to our API and it is returned a simple XML file. I would like the test to instead of getting the XML file from the API online to receive a predetermined XML file from me so that I can compare the output to the expected output and determine if everything is working correctly.
I was pointed to Mockito and have been seeing many different examples such as this SO post, How to use mockito for testing a REST service? but it's not becoming clear to me how to set it all up and how to mock the data (i.e., return my own xml file whenever the call to the URL is made).
The only thing I can think of is having another program made that's running locally on Tomcat and in my test pass a special URL that calls the locally running program on Tomcat and then return the xml file that I want to test with. But that just seems like overkill and I don't think that would be acceptable. Could someone please point me in the right direction.
private static InputStream getContent(String uri) {
HttpURLConnection connection = null;
try {
URL url = new URL(uri);
connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "application/xml");
return connection.getInputStream();
} catch (MalformedURLException e) {
LOGGER.error("internal error", e);
} catch (IOException e) {
LOGGER.error("internal error", e);
} finally {
if (connection != null) {
connection.disconnect();
}
}
return null;
}
I am using Spring Boot and other parts of the Spring Framework if that helps.
Part of the problem is that you're not breaking things down into interfaces. You need to wrap getContent into an interface and provide a concrete class implementing the interface. This concrete class will then
need to be passed into any class that uses the original getContent. (This is essentially dependency inversion.) Your code will end up looking something like this.
public interface IUrlStreamSource {
InputStream getContent(String uri)
}
public class SimpleUrlStreamSource implements IUrlStreamSource {
protected final Logger LOGGER;
public SimpleUrlStreamSource(Logger LOGGER) {
this.LOGGER = LOGGER;
}
// pulled out to allow test classes to provide
// a version that returns mock objects
protected URL stringToUrl(String uri) throws MalformedURLException {
return new URL(uri);
}
public InputStream getContent(String uri) {
HttpURLConnection connection = null;
try {
Url url = stringToUrl(uri);
connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "application/xml");
return connection.getInputStream();
} catch (MalformedURLException e) {
LOGGER.error("internal error", e);
} catch (IOException e) {
LOGGER.error("internal error", e);
} finally {
if (connection != null) {
connection.disconnect();
}
}
return null;
}
}
Now code that was using the static getContent should go through a IUrlStreamSource instances getContent(). You then provide to the object that you want to test a mocked IUrlStreamSource rather than a SimpleUrlStreamSource.
If you want to test SimpleUrlStreamSource (but there's not much to test), then you can create a derived class that provides an implementation of stringToUrl that returns a mock (or throws an exception).
The other answers in here advise you to refactor your code to using a sort of provider which you can replace during your tests - which is the better approach.
If that isn't a possibility for whatever reason you can install a custom URLStreamHandlerFactory that intercepts the URLs you want to "mock" and falls back to the standard implementation for URLs that shouldn't be intercepted.
Note that this is irreversible, so you can't remove the InterceptingUrlStreamHandlerFactory once it's installed - the only way to get rid of it is to restart the JVM. You could implement a flag in it to disable it and return null for all lookups - which would produce the same results.
URLInterceptionDemo.java:
public class URLInterceptionDemo {
private static final String INTERCEPT_HOST = "dummy-host.com";
public static void main(String[] args) throws IOException {
// Install our own stream handler factory
URL.setURLStreamHandlerFactory(new InterceptingUrlStreamHandlerFactory());
// Fetch an intercepted URL
printUrlContents(new URL("http://dummy-host.com/message.txt"));
// Fetch another URL that shouldn't be intercepted
printUrlContents(new URL("http://httpbin.org/user-agent"));
}
private static void printUrlContents(URL url) throws IOException {
try(InputStream stream = url.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(stream))) {
String line;
while((line = reader.readLine()) != null) {
System.out.println(line);
}
}
}
private static class InterceptingUrlStreamHandlerFactory implements URLStreamHandlerFactory {
#Override
public URLStreamHandler createURLStreamHandler(final String protocol) {
if("http".equalsIgnoreCase(protocol)) {
// Intercept HTTP requests
return new InterceptingHttpUrlStreamHandler();
}
return null;
}
}
private static class InterceptingHttpUrlStreamHandler extends URLStreamHandler {
#Override
protected URLConnection openConnection(final URL u) throws IOException {
if(INTERCEPT_HOST.equals(u.getHost())) {
// This URL should be intercepted, return the file from the classpath
return URLInterceptionDemo.class.getResource(u.getHost() + "/" + u.getPath()).openConnection();
}
// Fall back to the default handler, by passing the default handler here we won't end up
// in the factory again - which would trigger infinite recursion
return new URL(null, u.toString(), new sun.net.www.protocol.http.Handler()).openConnection();
}
}
}
dummy-host.com/message.txt:
Hello World!
When run, this app will output:
Hello World!
{
"user-agent": "Java/1.8.0_45"
}
It's pretty easy to change the criteria of how you decide which URLs to intercept and what you return instead.
The answer depends on what you are testing.
If you need to test the processing of the InputStream
If getContent() is called by some code that processes the data returned by the InputStream, and you want to test how the processing code handles specific sets of input, then you need to create a seam to enable testing. I would simply move getContent() into a new class, and inject that class into the class that does the processing:
public interface ContentSource {
InputStream getContent(String uri);
}
You could create a HttpContentSource that uses URL.openConnection() (or, better yet, the Apache HttpClientcode).
Then you would inject the ContentSource into the processor:
public class Processor {
private final ContentSource contentSource;
#Inject
public Processor(ContentSource contentSource) {
this.contentSource = contentSource;
}
...
}
The code in Processor could be tested with a mock ContentSource.
If you need to test the fetching of the content
If you want to make sure that getContent() works, you could create a test that starts a lightweight in-memory HTTP server that serves the expected content, and have getContent() talk to that server. That does seem overkill.
If you need to test a large subset of the system with fake data
If you want to make sure things work end to end, write an end to end system test. Since you indicated you use Spring, you can use Spring to wire together parts of the system (or to wire the entire system, but with different properties). You have two choices
Have the system test start a local HTTP server, and when you have your test create your system, configure it to talk to that server. See the answers to this question for ways to start the HTTP server.
Configure spring to use a fake implementation of ContentSource. This gets you slightly less confidence that everything works end-to-end, but it will be faster and less flaky.
I'm working on upgrading our existing Wicket webapp to 1.5 and have hit a snag in our renderPage function that we use to render our HTML emails.
Previously we used the code referenced/listed in this StackOverflow question and this (currently broken but maybe fixed later) link but that code no longer works as a lot of those classes don't exist in 1.5.
I also found this email thread but it is light on the details and I don't know how to create the WebPage from my pageClass and parameters.
http://apache-wicket.1842946.n4.nabble.com/Render-WebPage-to-String-in-Wicket-1-5-td3622130.html
Here is my code:
// Renders a page under a temporary request cycle in order to get the rendered markup
public static String renderPage(Class<? extends Page> pageClass, PageParameters pageParameters)
{
//get the servlet context
WebApplication application = (WebApplication) WebApplication.get();
ServletContext context = application.getServletContext();
//fake a request/response cycle
MockHttpSession servletSession = new MockHttpSession(context);
servletSession.setTemporary(true);
MockHttpServletRequest servletRequest = new MockHttpServletRequest(application, servletSession, context);
MockHttpServletResponse servletResponse = new MockHttpServletResponse(servletRequest);
//initialize request and response
servletRequest.initialize();
servletResponse.initialize();
WebRequest webRequest = new ServletWebRequest(servletRequest);
BufferedWebResponse webResponse = new BufferedWebResponse(servletResponse);
webResponse.setAjax(true);
WebRequestCycle requestCycle = new WebRequestCycle(application, webRequest, webResponse);
requestCycle.setRequestTarget(new BookmarkablePageRequestTarget(pageClass, pageParameters));
try
{
requestCycle.getProcessor().respond(requestCycle);
if (requestCycle.wasHandled() == false)
{
requestCycle.setRequestTarget(new WebErrorCodeResponseTarget(HttpServletResponse.SC_NOT_FOUND));
}
}
finally
{
requestCycle.detach();
requestCycle.getResponse().close();
}
return webResponse.toString();
}
Specifically, the code breaks because the WebRequestCycle and BookmarkablePageRequestTarget classes no longer exist. I feel like I should be able to use the StringResponse class some how but I'm missing the link that would help me trigger a render on that response.
Any help would be appreciated. Thanks.
My Final Solution
Using the example that I was directed to by the answer below I ended up with the following code. I'm pasting it here as well so that if that link disappears or is changed with a future version of Wicket then people from the future will still be able to get the answer they need.
I ended up passing in a PageProvider because in some cases I needed to pass in an instantiated Page and in others a pageClass + parameters.
public static String renderPage(final PageProvider pageProvider)
{
final RenderPageRequestHandler handler = new RenderPageRequestHandler(pageProvider, RedirectPolicy.NEVER_REDIRECT);
final PageRenderer pageRenderer = Application.get().getPageRendererProvider().get(handler);
RequestCycle requestCycle = RequestCycle.get();
final Response oldResponse = requestCycle.getResponse();
BufferedWebResponse tempResponse = new BufferedWebResponse(null);
try
{
requestCycle.setResponse(tempResponse);
pageRenderer.respond(requestCycle);
}
finally
{
requestCycle.setResponse(oldResponse);
}
return tempResponse.getText().toString();
}
Check the source code of http://www.wicket-library.com/wicket-examples/mailtemplate/ example.