Как ускорить selenium python
Перейти к содержимому

Как ускорить selenium python

  • автор:

Building a Concurrent Web Scraper with Python and Selenium

Posted by Caleb PollmanCaleb Pollman Last updated December 22nd, 2021

Share this tutorial

  • Twitter
  • Reddit
  • Hacker News
  • Facebook

This article looks at how to speed up a Python web scraping and crawling script with multithreading via the concurrent.futures module. We’ll also break down the script itself and show how to test the parsing functionality with pytest.

After completing this article, you will be able to:

  1. Scrape and crawl websites with Selenium and parse HTML with Beautiful Soup
  2. Set up pytest to test the scraping and parsing functionalities
  3. Execute a web scraper concurrently with the concurrent.futures module
  4. Configure headless mode for ChromeDriver with Selenium

Contents

Project Setup

Clone down the repo if you’d like to follow along. From the command line run the following commands:

The above commands may differ depending on your environment.

Install ChromeDriver globally. (We’re using version 96.0.4664.45).

Script Overview

The script makes 20 requests to Wikipedia:Random — https://en.wikipedia.org/wiki/Special:Random — for information about each article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

Let’s start with the main block. After determining whether Chrome should run in headless mode and defining a few variables, the browser is initialized via get_driver() from scrapers/scraper.py:

A while loop is then configured to control the flow of the overall scraper.

Within the loop, run_process() is called, which manages the WebDriver connection and scraping functions.

In run_process() , the browser instance passed to connect_to_base() .

This function attempts to connect to wikipedia and then uses Selenium’s explicit wait functionality to ensure the element with id=’content’ has loaded before continuing.

Review the Selenium docs for more information on explicit wait.

To emulate a human user, sleep(2) is called after the browser has connected to Wikipedia.

Once the page has loaded and sleep(2) has executed, the browser grabs the HTML source, which is then passed to parse_html() .

parse_html() uses Beautiful Soup to parse the HTML, generating a list of dicts with the appropriate data.

This function also passes the article URL to get_load_time() , which loads the URL and records the subsequent load time.

The output is added to a CSV file.

Finally, back in the while loop, the current_attempt is incremented and the process starts over again.

Want to test this out? Grab the full script here.

It took about 57 seconds to run:

Got it? Great! Let’s add some basic testing.

Testing

To test the parsing functionality without initiating the browser and, thus, making repeated GET requests to Wikipedia, you can download the page’s HTML (test/test.html) and parse it locally. This can help avoid scenarios where you may get your IP blocked for making too many requests too quickly while writing and testing your parsing functions, as well as saving you time by not needing to fire up a browser every time you run the script.

Ensure all is well:

Want to mock get_load_time() to bypass the GET request?

Configure Multithreading

Now comes the fun part! By making just a few changes to the script, we can speed things up:

With the concurrent.futures library, ThreadPoolExecutor is used to spawn a pool of threads for executing the run_process functions asynchronously. The submit method takes the function along with the parameters for that function and returns a future object. wait is then used to block execution until all tasks are complete.

It’s worth noting that you can easily switch to multiprocessing via ProcessPoolExecutor since both ProcessPoolExecutor and ThreadPoolExecutor implement the same interface:

Why multithreading instead of multiprocessing?

Web scraping is I/O bound since the retrieving of the HTML (I/O) is slower than parsing it (CPU). For more on this along with the difference between parallelism (multiprocessing) and concurrency (multithreading), review the Speeding Up Python with Concurrency, Parallelism, and asyncio article.

To speed things up even further we can run Chrome in headless mode by passing in the headless command line argument:

Conclusion

With a small amount of variation from the original code, we were able to execute the web scraper concurrently to take the script’s run time from around 57 seconds to just over 6 seconds. In this specific scenario that’s just about 90% faster, which is a huge improvement.

I hope this helps your scripts. You can find the code in the repo. Cheers!

Caleb Pollman

Caleb Pollman

Caleb is a software developer with a background in fine art and design. He’s excited to learn new things and is most comfortable in challenging environments. In his free time he creates art and hangs out with random cats.

Share this tutorial

Share this tutorial

  • Twitter
  • Reddit
  • Hacker News
  • Facebook

Building Your Own Python Web Framework

In this course, you'll learn how to develop your own Python web framework to see how all the magic works beneath the scenes in Flask, Django, and the other Python-based web frameworks.

https://amdy.su/wp-admin/options-general.php?page=ad-inserter.php#tab-8

Tutorial Topics

Table of Contents

Building Your Own Python Web Framework

In this course, you'll learn how to develop your own Python web framework to see how all the magic works beneath the scenes in Flask, Django, and the other Python-based web frameworks.

How to Implement Slow Page Scrolling with Python and Selenium

Usually, finding elements on a page with Python and Selenium is not too difficult. To do this, you just need to choose the right locator. But there are pages where even the simplest search for an element can return an error. One of the reasons for these problems are pages with lazy loading.

Lazy loading

What is it and how does it affect the search for elements on the page? Usually, when you open a page, all of its elements are immediately loaded and searchable. But not in this case. On these pages, elements appear one by one as you scroll down.

I met two types of such pages. On some, it is enough to quickly scroll the page to the end so that all the elements are displayed on the page. On others, this is not enough, since the elements only appear for the area you are looking at. Thus, in order to display all the elements, you need to slowly scroll the entire page.

Scrolling to the bottom of the page

This solution is for the first type of pages. You only need to complete one action. Scroll the page to the bottom. In order to do this, we can use a JavaScript function and indicate that we want to scroll the entire page. And call this function using the execute_script selenium function

As a result, you will find yourself at the very bottom of the page and all the elements will immediately begin to load.

Slowly scroll the entire page

In order to deal with the second type of page, let’s find a page that has this type of loading. The only thing I managed to find among the public is the vivino website page. For example, we can take a page of a particular product and try to find an element on it that has data-testid=”mentions”. This element looks like this

The element we are looking for

If we open this page manually, do not scroll anywhere and try to find this element, we will fail. The same result will be if we search for this element using Python and Selenium.

The element not found on the page

But if we slowly scroll the whole page and repeat the search, then the element will be found.

Finally we’ve found it

Implement slow scrolling

Now let’s move on to trying to find this element with Python and Selenium.

First, we can try the same approach we used for the first page type. The code will look like this

This code will return the error:

Thus, we see that this approach is not suitable for this page. Therefore, we need to force the browser to scroll the page, not entirely, but screen by screen.

To do this, we need to know the height of the page and how much of the page is displayed on the screen. The first parameter can be obtained using the following code

And the second parameter can be obtained like this

Also, while we are scrolling, we will need to understand where we are now on the page. You can find out with this code

Now we can create a loop that will scroll screen after screen until the page ends.

Also, we need to make a short pause in each cycle so that the loading of elements in this section begins.

In the end, the whole code will be like this

As a result of executing this code, we will get the content of the element that we were looking for.

How to Make Selenium Load Faster with Firefox in Python

This article was first published in my Tumblr blog in 2015. Yes, the devblogging concept was not wide-spread those days and I was looking for a nice solution and Tumblr was a good choice. Now, however, we have many blogging solutions for developers. That's why I migrate the article here.

There are some things to be aware of, though. This article was published in 2015. Things were quite different back then.

  • There was no concept of headless browsing, a browser that runs as a service and not providing a GUI. In those days, there was a project called PhantomJS, which is pretty much abandoned today and does not receive updates. Today, however, major browsers provide a headless option. If you see some mumbling about PhantomJS, ignore it.
  • The part about QuickJava extension configuration is removed in this version of article. The extension was removed from Firefox extension registry and does not exist today. It was basically an extension to disable some things, like Flash, Silverlight, Javascript etc. Today, Firefox provides configurations to disable them, but.
  • Standard Firefox configurations are kept as is in this article. That's because (i) I am lazy, (ii) I want to persist my technical past and (iii) I've actually linked this in a Stackoverflow question that still gets reactions today. That's why disabling CSS is not included in this article (which was done by QuickJava). You should figure it out.

I've got to say, however, I will return to this article one day and edit it out properly. Until then, though, I present you "me" in 2015.

Article

It is good to run a browser then manipulate the DOM elements on a page and scrap data. However, it might be a nightmare testing on a personal computer. There are a couple of solutions for headless browser in Python, but in Selenium, there’s one choice and it seems to be buggy while manipulating DOM elements. So there’s another choice, which is built-in widely in the most of Linux distributions: Firefox!

You might find it too slow. However, there are a couple of about:config tricks and extension for increasing the rendering speed. First, create a Firefox profile instance:

And these are the built-in configuration of Firefox:

Those will make your browser load and render the page faster. Thanks for reading.

How To Speed Up Selenium Test Cases?

When we talk about automation, one of the tools that comes first to our mind is Selenium. We all know that the Selenium WebDriver is a remarkable tool for web automation. The primary reason for implementing Selenium automation testing is to speed up selenium tests. In most of the cases, Selenium performs extraordinarily well than the manual ones.

But, sometimes automation scripts generally run slower. Integration and Unit Testing are comparatively faster than Selenium tests. Sometimes a single test takes minutes to run, making it even slower when their number is more because of which it is difficult to get accurate and faster feedback. However, you can always speed up selenium tests using the best approaches to selenium test automation.

How can you execute your Selenium test cases faster?

There are various ways that testers can follow to speed up Selenium test cases. You can consider using explicit waits, web locators, prefer different browsers, optimize Selenium infrastructure, and consider other best practices for enhanced software performance. The process of maintaining Selenium test cases becomes cumbersome with so many updates in the end product. So, we cannot afford to ignore the Selenium test case performance; we should focus on accelerating them right from the initial stages instead. Key tasks of the Selenium Test cases in any given scenarios are:

  • Open URL under test utilizing Selenium Webdriver (local/remote)
  • Making use of relevant web locators, locate the web elements
  • Perform assertions on located web elements on the page under test
  • Relieve the resources used by WebDriver

Let us highlight a few of the methods to understand how to speed up selenium tests.

Parallel Testing in Selenium Automation

It is one of the easiest ways to expedite the Selenium test cases. Parallel testing allows you to execute multiple tests simultaneously on different device-browser combinations and OS configurations, covering the entire test suite in no time. If you have an in-house Selenium Grid infrastructure, you can always check the benefits of the Selenium Grid 4 and see what it has to offer in terms of accelerating the speed of Selenium test cases. Let us assume you have ten tests to run. If you run them on different devices, all the ten tests can be completed in just ten seconds instead of 100 seconds. You can opt for this method at class and method levels. Grouping test scenarios, their parameterization, and cloud-based options would further strengthen the process.

a. Grouping tests:

Multiple test methods and test files in the test suite makes the implementation difficult. If we group the test scenarios based on the type of functionality under test, it becomes easy to manage any emerging complexities.

b. Replacing Selenium 3 with Selenium 4:

Selenium has seen significant improvements with the release of Selenium 4. It comes with optimized Selenium Grid, Standardized Selenium Webdriver World wide web consortium (W3C), Enhanced Selenium 4 IDE, and additionally, it has also introduced Chrome Web Tools and relative locators. These improvements can significantly speed up Selenium tests. If we compare Selenium 3 and 4, we will figure out that the former uses JSON Wired Protocol for interaction between the browser and the test code which causes an additional burden of encrypting and decoding multiple API requests via W3C. But the latter utilizes WebDriver W3C protocol which speeds up the interaction between the web browser and test code. The newly introduced Selenium 4 relative locators like – ‘above’ , ‘below’, ‘to_left_of’, ‘to_right_of’, ‘near’, speed up the Selenium test cases and improve their overall stability. Also, it is easy to upgrade from version 3 to 4 without any hassle.

c. Cloud-based Selenium Grid:

Whenever you want to test large-scale web applications where many parallel tests have to be run across multiple browser-OS-device combinations, you will need a cloud-based Selenium Grid to execute and expedite Selenium test cases. Below is a pictorial representation of Selenium Grid.

speed up Selenium test

Choosing relevant Web locators

Web locators are indispensable parts of any Selenium test scenario. After locating the web elements, there is a need for an appropriate web locator to act further. It is always advisable to use faster web locators out of the many options available. Out of all web locators, the ID locator is the fastest one in Selenium WebDriver. Let us discuss in brief some of the most used web locators:

a. ID Locator: It works fastest using document.getElementById() javascript command which is relevant to all browsers. In cases where many unique elements persist, this command yields the first unique match. It works only if the HTML element has an ID attribute that is unique to every element on-page. In terms of execution speed, after ID, Name, CSS Selector, and XPath are the fastest ones, respectively.

b. Name Selector: The Name Selector web locator is utilized when there is no ID in the WebElement.

c. CSS Selector: If the WebElement does not have an ID or NAME attribute, choosing CSS Selector Web locator in such a scenario is appropriate. CSS usually doesn’t differ across most common web browsers and ensures better performance of CSS Engine using CSS Selector in Selenium. Advantage of using this web locator is faster element recognition, lowered browser incompatibility, and reduced test execution. CSS locator is preferred in the case of legacy web browsers like Internet Explorer to provide better explicitness in comparison to XPath.

d. XPath: XPath Selector is the most flexible web locator, but it is the slowest among the fastest four locators because every layer of the path has to be crossed to select a particular web element and move from one browser to another. Using an XPath locator should not be the primary choice but should be used only when this is the only option remaining.

Here’s a Helpful Poster about the tips that make Selenium Test Cases Run Faster

  • Using few Web locators: Keeping the number of web locators at a minimum improves the test script readability reducing the time taken in the execution of the Selenium script.
  • Explicit Waits: Explicit wait commands for automation testing will eliminate any slowdown and allow you to carry out wait conditions like Element is visible, Element is Clickable, Element is Selectable on-page Web Elements, which is not possible in the case of Implicit Wait in Selenium. For example, the ToBeClickable method yields a WebElement when the identified element is clickable. Explicit wait retreats as soon as the condition is fulfilled. It means the element is returned as a result and does not wait for the entire time duration. There is a code snippet below which shows the WebElement with is located within 6 seconds. After its location, the explicit wait exits, and the required WebElement returns.
Test Scripts that utilize explicit wait showcase better performance.
  • Create Atomic Scripts: Creating independent test cases by simplifying the complex scenarios makes the Selenium tests efficient. Frameworks like TestNG support explicit test dependencies between test methods, whereas atomic tests detect the failures easily, which reduces testing time, effort spent in maintenance, it minimizes test dependency, and accelerates the Selenium tests.
  • Disable images on Web pages for faster page loads: After creating the Selenium instance, you can open the page under test using driver.get() method. Many web pages are rich in content and composed of many images responsible for slow page load time. But the page loading speed can be accelerated by disabling image loading using browser-related settings.

The below snapshots show:

– How to disable page loading using Selenium Scripts in Chrome (on Amazon website) to speed up Selenium test cases and page loading.

-How to disable page loading using Selenium Scripts in Firefox to speed up Selenium tests:

In this scenario, image loading is controlled in the Amazon e-commerce website where Firefox preference is set to 2 using permissions.default.image.

  • Data-Driven Testing for Parameterization: Let’s examine how to speed up selenium tests using Parameterization. When it is about testing against extensive data-sets, and running the same test on different test inputs. Parameterization proves to be a great choice. Parameterization is well supported by most of the automation frameworks like TestNG(Selenium Java), JUnit, NUnit(C#), PyTest(Selenium Python), etc.

    Using headless Browsers /Drivers:

Headless browsers allow us to execute browser User Interface(UI) tests without browser Graphical User Interface(GUI). It also helps to improve the efficiency of cross-browser tests that run in the background. You do not require this best practice if you do not want to know about UI interactions received via test scripts. Some common headless browsers are HtmlUnit, Splash, PhantomJS, etc. Check out the performance of Selenium Browser Tests in context to PhantomJS Driver.

Conclusion

Speed of Selenium test execution is of crucial importance to the business. Even if they are slow, there are so many ways to speed up selenium tests. The aforementioned best practices help speed up Selenium tests, accelerate and reduce test times. Early detection of bugs in continuous testing leads to a faster resolution, which improves test performance and enhances product quality.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *