How to have Firefox not detected by bot with Selenium [duplicate] - java

Is there a way to make your Selenium script undetectable in Python using geckodriver?
I'm using Selenium for scraping. Are there any protections we need to use so websites can't detect Selenium?

There are different methods to avoid websites detecting the use of Selenium.
The value of navigator.webdriver is set to true by default when using Selenium. This variable will be present in Chrome as well as Firefox. This variable should be set to "undefined" to avoid detection.
A proxy server can also be used to avoid detection.
Some websites are able to use the state of your browser to determine if you are using Selenium. You can set Selenium to use a custom browser profile to avoid this.
The code below uses all three of these approaches.
profile = webdriver.FirefoxProfile('C:\\Users\\You\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\something.default-release')
PROXY_HOST = "12.12.12.123"
PROXY_PORT = "1234"
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.http", PROXY_HOST)
profile.set_preference("network.proxy.http_port", int(PROXY_PORT))
profile.set_preference("dom.webdriver.enabled", False)
profile.set_preference('useAutomationExtension', False)
profile.update_preferences()
desired = DesiredCapabilities.FIREFOX
driver = webdriver.Firefox(firefox_profile=profile, desired_capabilities=desired)
Once the code is run, you will be able to manually check that the browser run by Selenium now has your Firefox history and extensions. You can also type "navigator.webdriver" into the devtools console to check that it is undefined.

The fact that selenium driven Firefox / GeckoDriver gets detected doesn't depends on any specific GeckoDriver or Firefox version. The Websites themselves can detect the network traffic and can identify the Browser Client i.e. Web Browser as WebDriver controled.
As per the documentation of the WebDriver Interface in the latest editor's draft of WebDriver - W3C Living Document the webdriver-active flag which is initially set as false, is set to true when the user agent is under remote control i.e. when controlled through Selenium.
Now that the NavigatorAutomationInformation interface should not be exposed on WorkerNavigator.
So,
webdriver
Returns true if webdriver-active flag is set, false otherwise.
where as,
navigator.webdriver
Defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for example so that alternate code paths can be triggered during automation.
So, the bottom line is:
Selenium identifies itself
However some generic approaches to avoid getting detected while web-scraping are as follows:
The first and foremost attribute a website can determine your script/program is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, you need to keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
To simulate human like behavior you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep webdriver in python for milliseconds

As per the current WebDriver W3C Editor's Draft specification:
The webdriver-active flag is set to true when the user agent is under remote control. It is initially false.
Hence, the readonly boolean attribute webdriver returns true if webdriver-active flag is set, false otherwise.
Further the specification further clarifies:
navigator.webdriver Defines a standard way for co-operating user
agents to inform the document that it is controlled by WebDriver, for
example so that alternate code paths can be triggered during
automation.
There had been tons and millions of discussions demanding Feature: option to disable navigator.webdriver == true ? and #whimboo in his comment concluded that:
that is because the WebDriver spec defines that property on the
Navigator object, which has to be set to true when tests are running
with webdriver enabled:
https://w3c.github.io/webdriver/#interface
Implementations have to be conformant to this requirement. As such we
will not provide a way to circumvent that.
Generic Conclusion
From the above discussions it can be concluded that:
Selenium identifies itself
and there is no way to conceal the fact that the browser is WebDriver driven.
Recommendations
However some users have suggested approaches which can conceal the fact that the Mozilla Firefox browser is WebDriver controled through the usage of Firefox Profiles and Proxies as follows:
selenium4 compatible python code
from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
profile_path = r'C:\Users\Admin\AppData\Roaming\Mozilla\Firefox\Profiles\s8543x41.default-release'
options=Options()
options.set_preference('profile', profile_path)
options.set_preference('network.proxy.type', 1)
options.set_preference('network.proxy.socks', '127.0.0.1')
options.set_preference('network.proxy.socks_port', 9050)
options.set_preference('network.proxy.socks_remote_dns', False)
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = Firefox(service=service, options=options)
driver.get("https://www.google.com")
driver.quit()
Other Alternatives
It is observed that in some specific os variants a couple of diverse settings/configuration can bypass the bot detectation which are as follows:
selenium4 compatible code block
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Chrome(service=s, options=options)
Potential Solution
A potential solution would be to use the tor browser as follows:
selenium4 compatible python code
from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import os
torexe = os.popen(r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Tor\tor.exe')
profile_path = r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default'
firefox_options=Options()
firefox_options.set_preference('profile', profile_path)
firefox_options.set_preference('network.proxy.type', 1)
firefox_options.set_preference('network.proxy.socks', '127.0.0.1')
firefox_options.set_preference('network.proxy.socks_port', 9050)
firefox_options.set_preference("network.proxy.socks_remote_dns", False)
firefox_options.binary_location = r'C:\Users\username\Desktop\Tor Browser\Browser\firefox.exe'
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Firefox(service=service, options=firefox_options)
driver.get("https://www.tiktok.com/")

It may sound simple, but if you look how the website detects selenium (or bots) is by tracking the movements, so if you can make your program slightly towards like a human is browsing the website you can get less captcha, such as add cursor/page scroll movements in between your operations, and other actions which mimics the browsing. So between two operations try to add some other actions, Add some delay etc. This will make your bot slower and could get undetected.
Thanks

Related

How to Conceal WebDriver in Geckodriver from BotD in Java?

I followed this post on Stackoverflow to disable Firefox WebDriver detection.
Launch Geckodriver:
System.setProperty("webdriver.gecko.driver", geckdriverExecutableFilePath);
File firefoxProfileFile = new File(fullPathOfFirefoxInstallationFolder);
FirefoxProfile firefoxProfile = null;
try {
firefoxProfile = new FirefoxProfile(firefoxProfileFile);
} catch (Exception e) {
e.printStackTrace();
}
I disabled WebDriver:
WebDriver Disabled
FirefoxOptions firefoxOptions = new FirefoxOptions();
firefoxOptions.setProfile(firefoxProfile);
// Disables WebRTC
firefoxProfile.setPreference("media.peerconnection.enabled", false);
I disabled Automation Extensions:
Automation Extension Disabled
// Disables Automation Extension
firefoxProfile.setPreference("useAutomationExtension", false);
I added Proxy:
DesiredCapabilities dc = DesiredCapabilities.firefox();
Proxy proxy = new Proxy();
proxy.setHttpProxy(ipAddress + ":" + port);
proxy.setFtpProxy(ipAddress + ":" + port);
proxy.setSslProxy(ipAddress + ":" + port);
dc.setCapability(CapabilityType.PROXY, proxy);
firefoxOptions.merge(dc);
driver = new FirefoxDriver(firefoxOptions);
Yet BotD still detects my browser as being controlled by automation tool.
BotD Detection
How can I solve this?
When using Selenium driven GeckoDriver initiated firefox Browsing Context
The webdriver-active flag is set to true when the user agent is under remote control. It is initially false.
where, webdriver returns true if webdriver-active flag is set, false otherwise.
As:
navigator.webdriver Defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for
example so that alternate code paths can be triggered during
automation.
Further #whimboo in his comments confirmed:
This implementation have to be conformant to this requirement. As such
we will not provide a way to circumvent that.
Conclusion
So, the bottom line is:
Selenium identifies itself
and there is no way to conceal the fact that the browser is WebDriver driven.
Recommendations
However some pundits have suggested some different approaches which can conceal the fact that the Mozilla Firefox browser is WebDriver controled through the usage of Firefox Profiles and Proxies as follows:
selenium4 compatible python code
from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
profile_path = r'C:\Users\Admin\AppData\Roaming\Mozilla\Firefox\Profiles\s8543x41.default-release'
options=Options()
options.set_preference('profile', profile_path)
options.set_preference('network.proxy.type', 1)
options.set_preference('network.proxy.socks', '127.0.0.1')
options.set_preference('network.proxy.socks_port', 9050)
options.set_preference('network.proxy.socks_remote_dns', False)
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = Firefox(service=service, options=options)
driver.get("https://www.google.com")
driver.quit()
Potential Solution
A potential solution would be to use the tor browser as follows:
selenium4 compatible python code
from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import os
torexe = os.popen(r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Tor\tor.exe')
profile_path = r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default'
firefox_options=Options()
firefox_options.set_preference('profile', profile_path)
firefox_options.set_preference('network.proxy.type', 1)
firefox_options.set_preference('network.proxy.socks', '127.0.0.1')
firefox_options.set_preference('network.proxy.socks_port', 9050)
firefox_options.set_preference("network.proxy.socks_remote_dns", False)
firefox_options.binary_location = r'C:\Users\username\Desktop\Tor Browser\Browser\firefox.exe'
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Firefox(service=service, options=firefox_options)
driver.get("https://www.tiktok.com/")
References
You can find a couple of relevant detailed discussions in
How to initiate a Tor Browser 9.5 which uses the default Firefox to 68.9.0esr using GeckoDriver and Selenium through Python
How to connect to Tor browser using Python
How to use Tor with Chrome browser through Selenium
BotD detects you because you do not override navigator.webdriver attribute.
I was able to override it with this code:
((JavascriptExecutor)driver).executeScript("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})");
Re-run your code with this line after driver.get("BotD url") and click on
'Start detect' on the BotD page.
It will no longer show that webdriver is detected.
I understand you are looking for a way to make it work before the initial page load.
But here are 2 things to consider:
Webdriver developers want their tool to be detected by browsers.
Gecko driver developers are not going to implement an option to disable navigator.webdriver attribute. (This is the official reply from gecko developer.)

How can we use pre-populate cookies in Selenium Web driver to run the tests faster

How do we pre-populate cookies in Selenium Web driver to make the selenium tests run faster ?
If we need to write test for multiple pages how can we have a single reusable method to pre-populate the cookies and how can we use inside tests ?
public void addCookie(String name, String value) {
Cookie pagecookieSize = new Cookie.Builder(name, value)
.domain("somedomain.com.au")
.expiresOn(new Date(2020,12,31))
.isSecure(true)
.path("/")
.build();
driver.manage().addCookie(pagecookieSize);
}
pre-populate cookies means? cookie before starting the selenium or execution. As we know selenium/webdriver starts always new session and cant use old cookies and also we cant save cookies which we see in execution once execution is done.
preferably, use profile concept, create browser profile.. do required navigation's.. so there is cache/cookie/temp what ever may be and call that profile.
here is one of my old post to start chrome with profile
How do you start selenium using Chrome driver and all existing browser cookies?

Set capability on already running selenium webdriver

In selenium test step (like a button click) i want to prevent the selenium waiting for page finish loading. I cant throw the load Exception because then i cant work with the page anymore.
Its possible to do a simmilar thing like this:
DesiredCapabilities dr = DesiredCapabilities.chrome();
dr.setCapability("pageLoadStrategy", "none");
WebDriver driver = new RemoteWebDriver(new URL("...."), dr);
What I want is like "dr.setCapability("pageLoadStrategy", "none");" but just for one specifique step.
Does anyone know a way to do this?
Capabilities are no longer editable once the browser is launched.
One way to temporary disable the waiting is to implement your own get with a script injection.
Something like this:
//
// loads the page and stops the loading without exception after 2 sec if
// the page is still loading.
//
load(driver, "https://httpbin.org/delay/10", 2000);
public static void load(WebDriver driver, String url, int timeout) {
((JavascriptExecutor)driver).executeScript(
"var url = arguments[0], timeout = arguments[1];"
"window.setTimeout(function(){window.location.href = url}, 1);" +
"var timer = window.setTimeout(window.stop, timeout);" +
"window.onload = function(){window.clearTimeout(timer)}; "
, url, timeout);
}
As of the current implementation of Selenium once we configure the WebDriver instance with our intended configuration through DesiredCapabilities class and initialize the WebDriver session to open a Browser, we cannot change the capabilities runtime.
It is worth to mention, somehow if you are able to retrieve the runtime capabilities still you won't be able to change them back.
So, in-order to make a change in the pageLoadStrategy you have to initiate a new WebDriver session.
Here is #JimEvans clear and concise answer (as of Oct 24 '13 at 13:02) related to proxy settings capability:
When you set a proxy for any given driver, it is set only at the time WebDriver session is created; it cannot be changed at runtime. Even if you get the capabilities of the created session, you won't be able to change it. So the answer is, no, you must start a new session if you want to use different proxy settings.

Selenium RemoteWebDriver - Do something if element could not be found

I am trying to develop a Suite of classes for testing my websites functionality every night and I do this in Chrome, Firefox, Edge and IE. Because sometimes Selenium doesn't find an element I need something that e.g. takes a screenshot of the browser before giving out an error. I don't need a function for taking a screenshot I need something that triggers when Selenium can't continue.
Best regards,
MK
If I understand correct, you need to set up the trigger for your another system, which can react on Selenium test error.
In your test code you can use :
try {
// find element and test code
} catch (NoSuchElementException e) {
// set up the trigger code
}
To notify another system you can choose any system, which can provide notification mechanism.
In your case, you could use for example Redis with pub/sub.
So your reaction system will be a subscriber and test - provider of the event.

PhantomJS slower than ChromeDriver, using Selenium

I'm trying to use PhantomJS 2.0/GhostDriver instead the ChromeDriver, since I have read I could speed up my UI tests.
This is the test code I'm running, as part of a Junit test:
#Override
public void runTestCase() throws Exception {
long startTime = System.currentTimeMillis();
// log in as admin
Login.loginAs("admin", "password");
System.out.println(System.currentTimeMillis() - startTime);
}
The loginAs function fills in the text fields for the username and password, then click on the submit button and finally moves in the home section of the new returned page.
Now, I'm running once a time this simple test using both Phantomjs and ChromeDriver as driver for Selenium in Java (v2.45).
They are initialized as follow:
ChromeDriver
System.setProperty("webdriver.chrome.logfile", workingDirectory + "\\chromedriver.log");
service = new ChromeDriverService.Builder().usingDriverExecutable(new File(workingDirectory + "\\chromedriver.exe")).build();
capabilities = DesiredCapabilities.chrome();
options = new ChromeOptions();
options.addArguments("--allow-file-access-from-files");
options.addArguments("--verbose");
capabilities.setVersion("");
capabilities.setCapability(ChromeOptions.CAPABILITY, options);
driver = new ChromeDriver(service, capabilities);
PhantomJS
System.setProperty("phantomjs.binary.path", workingDirectory + "\\phantomjs.exe");
cliArgsCap = new ArrayList<String>();
capabilities = DesiredCapabilities.phantomjs();
cliArgsCap.add("--web-security=false");
cliArgsCap.add("--ssl-protocol=any");
cliArgsCap.add("--ignore-ssl-errors=true");
cliArgsCap.add("--webdriver-loglevel=INFO");
cliArgsCap.add("--load-images=false");
capabilities.setCapability(CapabilityType.SUPPORTS_FINDING_BY_CSS, true);
capabilities.setCapability(CapabilityType.TAKES_SCREENSHOT, true);
capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, cliArgsCap);
driver = new PhantomJSDriver(capabilities);
I'm running my test on a 64bit Windows 7 machine. So, having a look the time took by the test, I always note that ChromeDriver is faster than PhantomJS. Always. For instance, if the test with ChromeDriver takes about 3-4 seconds, the same with PhantomJS takes about 5-6 seconds.
Has anyone experienced with this issue? Or could anyone give me any reason for that? Am I setting something wrong?
Furthermore, if you need more details, let me know.
I have found that this setting uses a lot of memory that seems to keep growing:
cliArgsCap.add("--load-images=false");
But when I use this setting the memory usage is stable:
cliArgsCap.add("--load-images=true");
"PhantomJS is a headless WebKit scriptable with a JavaScript API" as it's explained on the main page of the project.
Google split from WebKit to create Blink to use it in Chrome.
What are the main differences between them - unfortunately I'm not the expert here.
I run one of my really long scenarios both on Chrome and PhantomJS and to my surprise the difference was pretty significant:
PhantomJS - 583.251 s
Chrome - 448.384 s
Using PhantomJS doesn't bring performance benefits in my case but running tests headless does. I can use machine without graphical desktop and save computing power for some additional threads.
The slowest aspect of a web page is the downloading of html, JavaScript, css, images, etc and making AJAX request.
To anybody who says Headless is faster, how can headless address any of these?

Categories