I am trying to scrape this web page. I need to find all names under "Rank name"
Website: https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1
But I am having a major issue. Only the first match is found (Lynx Titan). log(m.group(1).split("\"")[0]); matches with Lynx Titan. But if I replace [0] to 1 or 2, it does not work. It should display the next name. Any help is appreciated
public void getRsn() throws IOException {
URL url = new URL("https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0");
conn.connect();
BufferedReader serverResponse = new BufferedReader(
new InputStreamReader(conn.getInputStream()));
String s = serverResponse.lines().collect(Collectors.joining());
Matcher m = Pattern.compile("user1=(.*)\">").matcher(s);
while(m.find()) {
log(m.group(1).split("\"")[0]);
}
serverResponse.close();
}
Your problem is your regex: you're only matching one user. Also, each iteration of the while loop will find anotherr user, not each element of the split (which you actually don't need).
Use a regex that matches every user, and only the user's name:
Matcher m = Pattern.compile("(?<=user1=)[^\"]+").matcher(s);
while(m.find()) {
System.out.println(m.group()); // the entire match is the user name
}
Try this way,
URL url = new URL("https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0");
conn.connect();
BufferedReader serverResponse = new BufferedReader(
new InputStreamReader(conn.getInputStream(), StandardCharsets.UTF_8));
String s = serverResponse.lines().collect(Collectors.joining());
serverResponse.close();
Matcher m = Pattern.compile("(user1=)([a-zA-Z0-9\\s\\�]+)(\">)").matcher(s);
while(m.find()) {
System.out.println((m.group(2).replaceAll("\\�", " ")));
}
My output was,
Lynx Titan
Hey Jase
ShawnBay
senZe
Tomdabom
Karma
Harmony
DedWilson
GodTormentor
Vinny
borsi
Brundeen
Aziz
Eeli
baile y
gaslighter73
Dan Gleesac
blind idiot
he box jonge
Gustav
Randalicious
Oskar
Killzone
moksi
Capt King
Update
The Charsets is actually ISO_8859_1 not UTF_8. So you can make this answer better.
URL url = new URL("https://secure.runescape.com/m=hiscore_oldschool/overall?table=0&page=1");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0");
conn.connect();
BufferedReader serverResponse = new BufferedReader(
new InputStreamReader(conn.getInputStream(), StandardCharsets.ISO_8859_1));
String s = serverResponse.lines().collect(Collectors.joining());
serverResponse.close();
Matcher m = Pattern.compile("(user1=)([a-zA-Z0-9\\ \\ ]+)(\">)").matcher(s);
while(m.find()) {
System.out.println((m.group(2)));
}
Note: There are two different space characters are there in the document
Related
I got a discord bot that I've made in Java and one of its purposes is to send an embedded link (I don't own the site) everytime someone leaves the server. It worked the first 2-3 times and every time after that I get the following exception:
java.io.IOException: Server returned HTTP response code: 403 for URL: ...
Example link:
https://signature.hzgaming.net/sig.php?name=Juntao_Lubu&style=1
I tried numerous solutions I've found online (with User-Agents and all that fancy stuff), but none of them seem to work for me.
Is there any other workaround this?
Code:
String link = "https://signature.hzgaming.net/sig.php?name=" + allMembers.get(mEvent.getUser().getDiscriminatedName()).replace(" ", "_") + "&style=1";
URLConnection urlCon = new URL(link).openConnection();
urlCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.29 Safari/537.36");
InputStream is = urlCon.getInputStream();
StringBuilder textBuilder = new StringBuilder();
Reader reader = new BufferedReader(new InputStreamReader(is, Charset.forName(StandardCharsets.UTF_8.name())));
int c = 0;
while((c = reader.read()) != -1) {
textBuilder.append((char)c);
}
String result = textBuilder.toString().replaceAll("<[^>]*>", "");
if(!result.equalsIgnoreCase("Non-Existant Player") && !result.equalsIgnoreCase("Non-ExistantPlayer")) {
new MessageBuilder().append(link).send((TextChannel)server.getChannelById(973242211623895080L).get());
}
Thanks in advance.
I've this code in Java 8
String u = "https://tes.com/Prova.asmx?WSDL";
URL url = new URL(u);
System.out.println("Link: " + u);
SSLContext sc = SSLContext.getInstance("TLSv1");
sc.init(null, null, new java.security.SecureRandom());
HttpsURLConnection con = (HttpsURLConnection) url.openConnection();
con.setSSLSocketFactory(sc.getSocketFactory());
System.out.println("After url connection");
//con.setRequestProperty("Accept","*/*");
//con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
//con.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream(), Charset.forName("UTF-8")));
String input;
while ((input = br.readLine()) != null) {
System.out.println(input);
}
br.close();
How can i see header that are sent on connection using Microsoft Windows?
Thanks
I am using the code below to read a page source from url (https://www.amazon.com) with "UTF-8" charset in NetBeans, but it returns unknown characters (the attached image). I don't have any idea that what is the problem and would be gratefull if help me to modify the code to work properly? Thanks.
public static String getURLSource(String url) throws IOException
{
URL urlObject = new URL(url);
URLConnection urlConnection = urlObject.openConnection();
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
return toString(urlConnection.getInputStream());
}
private static String toString(InputStream inputStream) throws IOException
{
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
{
String inputLine;
StringBuilder stringBuilder = new StringBuilder();
while ((inputLine = bufferedReader.readLine()) != null)
{
stringBuilder.append(inputLine);
}
return stringBuilder.toString();
}
}
Use HttpsUrlConnection instead of UrlConnection. See a similar question.
You just need to unzip your content. Here is the code that worked for me
HttpClient httpClient = new HttpClient();
try {
httpClient.setConnectionUrl("https://www.amazon.com");
ByteBuffer buff = httpClient.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11")
.sendHttpRequestForBinaryResponse(HttpClient.HttpMethod.GET);
try (
ByteArrayInputStream bais = new ByteArrayInputStream(buff.array());
GZIPInputStream gzis = new GZIPInputStream(bais);
InputStreamReader isr = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(isr)
) {
br.lines().forEach(line -> System.out.println(line));
}
} catch (Exception e) {
System.out.println(httpClient.getLastResponseCode() + " "
+ httpClient.getLastResponseMessage() + TextUtils.getStacktrace(e, false));
}
Just few clarifications: In this example I use a 3d party Http client class HttpClient (And also class TextUtils). They both come from Open source MgntUtils library writen and maintained by me. But you don't have to use it. The main part is - read the info from the InputStream as binary info (as byte array or ByteBuffer) and than and unzip it with GZIPInputStream like in my example.
If you do want to use MgntUtils library you can get it As maven artifact or from Github (including source code and Javadoc). and here is Javadoc online
So am trying to login this website using java but for some reason its not working as expected i got the host and all that stuff but its not going to the account page with the cookie it still shows the login page and yes my account info is correct any help is great
public static void main(String[] args) {
try {
String params = "loginEmail=private#hotmail.com&loginPassword=privatepassword&Submit=Sign+In";
String urls = "http://www.filefactory.com/member/signin.php";
URL url = new URL(urls);
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setRequestMethod("POST");
connection.setRequestProperty("Content-Type",
"application/x-www-form-urlencoded");
connection.setRequestProperty("Host", "www.filefactory.com");
connection.setRequestProperty("Referer", "http://www.filefactory.com/member/signin.php");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 OPR/25.0.1614.50");
connection.setRequestProperty("Content-Length", "" +
Integer.toString(params.getBytes().length));
connection.setRequestProperty("Content-Language", "en-US");
connection.setDoInput(true);
connection.setDoOutput(true);
//Send request
DataOutputStream wr = new DataOutputStream (
connection.getOutputStream ());
wr.writeBytes (params);
wr.flush ();
wr.close ();
//Get Response
InputStream is = connection.getInputStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(is));
String line;
StringBuilder response = new StringBuilder();
while((line = rd.readLine()) != null) {
response.append(line);
response.append('\r');
}
rd.close();
System.out.println(response.toString());
// get the cookie if need, for login
String cookies = connection.getHeaderField("Set-Cookie");
// open the new connnection again
connection = (HttpURLConnection) new URL("http://www.filefactory.com/account/").openConnection();
connection.setRequestProperty("Cookie", cookies);
connection.addRequestProperty("Accept-Language", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
connection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 OPR/25.0.1614.50");
connection.addRequestProperty("Host", "www.filefactory.com");
System.out.println("Redirect to URL : " + "http://www.filefactory.com/account/");
BufferedReader in = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
String inputLine;
StringBuilder html = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
html.append(inputLine);
}
in.close();
System.out.println("URL Content... \n" + html.toString());
System.out.println("Done");
} catch (Exception e) {
e.printStackTrace();
}
}
}
You are using : String cookies = connection.getHeaderField("Set-Cookie");
Are you sure there is only one entry for that header? There could be more.
http://en.wikipedia.org/wiki/HTTP_cookie
Try using chrome or firefox and try logging in manually to capture the request and response. That may give you some hints regarding what could be wrong.
Additionally you could use a tool to view the communication between your client and the server (unless you are already doing so)
It's hard to tell, not knowing the exact way that website works, but you should note that it sends you the cookies first when it presents the login page to you. When you send in your credentials you have to already send them together with the cookies, so that it knows to associate those credentials with this cookie.
String urlString = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf";
URL url = new URL(urlString);
if(/* Url does not return 404 */) {
System.out.println("exists");
} else {
System.out.println("does not exists");
}
urlString = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_190.pdf";
url = new URL(urlString);
if(/* Url does not return 404 */) {
System.out.println("exists");
} else {
System.out.println("does not exists");
}
This should print
exists
does not exists
TEST
public static String URL = "http://www.nbc.com/Heroes/novels/downloads/";
public static int getResponseCode(String urlString) throws MalformedURLException, IOException {
URL u = new URL(urlString);
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.connect();
return huc.getResponseCode();
}
System.out.println(getResponseCode(URL + "Heroes_novel_001.pdf"));
System.out.println(getResponseCode(URL + "Heroes_novel_190.pdf"));
System.out.println(getResponseCode("http://www.example.com"));
System.out.println(getResponseCode("http://www.example.com/junk"));
Output
200
200
200
404
SOLUTION
Add the next line before .connect() and the output would be 200, 404, 200, 404
huc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
You may want to add
HttpURLConnection.setFollowRedirects(false);
// note : or
// huc.setInstanceFollowRedirects(false)
if you don't want to follow redirection (3XX)
Instead of doing a "GET", a "HEAD" is all you need.
huc.setRequestMethod("HEAD");
return (huc.getResponseCode() == HttpURLConnection.HTTP_OK);
this worked for me:
URL u = new URL ( "http://www.example.com/");
HttpURLConnection huc = ( HttpURLConnection ) u.openConnection ();
huc.setRequestMethod ("GET"); //OR huc.setRequestMethod ("HEAD");
huc.connect () ;
int code = huc.getResponseCode() ;
System.out.println(code);
thanks for the suggestions above.
Use HttpUrlConnection by calling openConnection() on your URL object.
getResponseCode() will give you the HTTP response once you've read from the connection.
e.g.
URL u = new URL("http://www.example.com/");
HttpURLConnection huc = (HttpURLConnection)u.openConnection();
huc.setRequestMethod("GET");
huc.connect() ;
OutputStream os = huc.getOutputStream();
int code = huc.getResponseCode();
(not tested)
Based on the given answers and information in the question, this is the code you should use:
public static boolean doesURLExist(URL url) throws IOException
{
// We want to check the current URL
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection();
// We don't need to get data
httpURLConnection.setRequestMethod("HEAD");
// Some websites don't like programmatic access so pretend to be a browser
httpURLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
int responseCode = httpURLConnection.getResponseCode();
// We only accept response code 200
return responseCode == HttpURLConnection.HTTP_OK;
}
Of course tested and working.
There is nothing wrong with your code. It's the NBC.com doing tricks on you. When NBC.com decides that your browser is not capable of displaying PDF, it simply sends back a webpage regardless what you are requesting, even if it doesn't exist.
You need to trick it back by telling it your browser is capable, something like,
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");