In this guide, I will show you how to use the BeautifulSoup library to make a simple program that notifies you when a product on an online site drops in price.
This library runs in the background, scraping static online e-commerce sites of your choice and notifying you when a product drops in price.
Prerequisites
This guide assumes that you have Python installed, pip
added to your system’s PATH, along with a basic understanding of Python and HTML.
Installing Required Components
First, let’s install BeautifulSoup and Requests. The Requests library retrieves our data, but the BeautifulSoup library actually analyzes our data.
We can install those two required components by running the command below:
pip install beautifulsoup4 requests
Note that depending on what your system’s setup is, you might need to use pip3
instead of pip
.
Grabbing Our Sample: Price
In this step, we will be telling BeautifulSoup what exactly to scrape. In this case, it’s the price. But we need to tell BeautifulSoup where the price is on the website.
To do this, navigate to the product you want to scrape. For this guide, I will be scraping an AV channel receiver I found on Amazon.
Then, use your browser’s DevTools and navigate to the price. However, make sure that you have a very “unique” element selected. This is an element that shows the product’s price but is also very specifically identified within the HTML document. Ideally, choose an element with an id
attribute, as there cannot be two elements with the same HTML ID. Try to get as much “uniqueness” as you can because this will make the parsing easier.
The elements I have selected above are not the most “unique” but are the closest we can get as they have lots of classes that I can safely assume not many other elements have all of.
We also want to ensure that our web scraper stays as consistent as possible with website changes.
If you also don’t have an element that is completely “unique”, then I suggest using the Console tab and JavaScript DOM to see how many other elements have those attributes.
Like, in this case, I am trying to see whether the element I selected is “unique” enough to be selected by its class.
In this case, there is only one other element that I need to worry about, which I think is good enough.
Basic Scraping: Setup
This section will detail the fundamentals of web scraping only. We will add more features as this guide goes on, building upon the code we will write now.
First, we need to import the libraries we will be using.
import requests as rq
from bs4 import BeautifulSoup
Then, we need to retrieve the content from our product. I will be using this AV receiver as an example.
request = rq.get("https://www.amazon.com/Denon-AVR-X1700H-Channel-Receiver-Built/dp/B09HFN8T64/")
If the content you want to scrape is locked behind a login screen, chances are you need to provide basic HTTP authentication to the site. Luckily, the Requests library has support for this. If you need authentication, add the auth
parameter to the get
method above, and make it a tuple that follows the format of ('username','password')
.
For example, if Amazon required us to use HTTP basic authentication, we would declare our request
variable like the one below:
request = rq.get("https://www.amazon.com/Denon-AVR-X1700H-Channel-Receiver-Built/dp/B09HFN8T64/", auth=("replaceWithUsername","replaceWithPwd"))
If that authentication type does not work, then the site may be using HTTP Digest authentication.
To authenticate with Digest, you will need to import HTTPDigestAuth
from Request’s sub-library, auth
. Then it’s as simple as passing that object into the auth
parameter.
from requests.auth import HTTPDigestAuth
request = rq.get("https://www.amazon.com/Denon-AVR-X1700H-Channel-Receiver-Built/dp/B09HFN8T64/", auth=HTTPDigestAuth("replaceWithUsername","replaceWithPwd"))
If the content you want to scrape requires a login other than basic HTTP authentication or Digest authentication, consult this guide for other types of authentications.
Amazon does not require any authentication, so our code will work providing none.
Now, we need to create a BeautifulSoup
object and pass in our website’s response to the object.
parser = BeautifulSoup(request.content, 'html.parser')
When you use the Requests library to print a response to the console, you generally will want to use request.text
. However, since we don’t need to worry about decoding the response into printable text, it is considered better practice to return the raw bytes with request.content
.
Basic Scraping: Searching Elements
Now we can get to the fun part! We will find the price element using our sample we got earlier.
I will cover two of the most common scenarios, one where you need to find the price based on its element’s ID – the simplest, or one where you need to find the price based on class names and sub-elements – a little more complicated but not too difficult, assuming you have a “unique” enough element.
If we wanted to refer to an element based on its ID with BeautifulSoup, you would use the find
method. For example, if we wanted to store the element with the ID of pricevalue
within a variable called priceElement
, we would invoke find()
with the argument of id
set to the value "pricevalue"
.
priceElement = parser.find(id="pricevalue")
We can even print our element to the console!
print(priceElement.prettify())
<div id="pricevalue"><p>$19.99</p></div>
The function prettify
is used to reformat (“pretty-print”) the output. It is used when you want to be able to visualize the data, as it results in better-looking output to the console.
Now we get to the tougher part – making references to element(s) based on one or more class names. This is the method you will need to use for most major e-commerce sites like Amazon or Ebay.
This time, we will be using the find_all
function. It is used only in situations where it is theoretically possible to get multiple outputs, like when we have multiple classes as the function gives the output as a list of strings, not a single string.
If you are not sure, know that you can use find_all
even when the query you give it only returns one result, you will just get a one item list.
The code below will return any elements with the classes of priceToPay
or big-text
.
priceElements = parser.find_all(class_=["priceToPay","big-text"])
The select
function is just like that of the find
function except instead of directly specifying attributes using its function parameters, you simply pass in a CSS selector and get a list of matching element(s) back.
The code above selects all elements with the class of both price-value
and main-color
. Although many use the find
or find_all
functions, I prefer select
as I am already familiar with CSS selectors.
If, and this is not much of a good idea when finding elements, we would like to filter by element type, we will just call find_all
with a single positional argument, the element’s type. So, parser.find_all("p")
will return a list of every single paragraph (“p
“) element.
An element type is one of the broadest filters you can pass into the find_all
function, so this only becomes useful when you combine it with another narrower filter, such as an id
or class
.
parser.find_all("h1", id="title")
That would return all h1
elements with an ID of title
. But since each element needs to have its own unique ID, we can just use the find
function. Let’s do something more realistic.
parser.find_all("h1",class_="bigText")
This code would return all h1
elements that had a class of bigText
.
Below are a few reviews of what we know so far and some other, rarer methods of element finding.
"""
Never recommended, but returns a list of ALL the elements that have type 'p'
"""
typeMatch = parser.find_all("p")
"""
Finds element with the ID of 'priceValue' using a CSS selector
"""
idSelMatch = parser.select("#priceValue")
"""
Finds element with the ID of 'priceValue', except with the BeautifulSoup-native find function and not with a CSS selector
"""
idMatch = parser.find(id="priceValue") # Same as above
"""
Extremely rare, but returns a list of elements containing an ID of 'priceValue' OR 'price'
"""
orIdMatch = parser.find_all(id=["priceValue","price"])
"""
Returns a list of elements that have the class 'price' OR 'dollarsToPay'. I do not know of a CSS selector that does the same
"""
orClassMatch = parser.find_all(class_=['price','dollarsToPay'])
"""
Returns a list of elements that have the class 'price' AND 'dollarsToPay'. I do not know of a
find_all argument that does the same
"""
andClassMatch = parser.select(".priceValue.dollarsToPay")
"""
Returns the element that has a class of 'v' INSIDE the element of class 't'. This can also be done with ID attributes, but this function only works when the first function is .find(...) or when you are grabbing an element by index after calling .find_all(...). Because .find(...) only returns one element, it will only be returning the first instance of that class name. The code below return the same thing, however 'inMatch3' returns a list
"""
inMatch = parser.find(class_="t").find(class_="v") # Most basic way to do it
inMatch2 = parser.find_all(class_="t")[0].find_all(class_="v")[0] # Because .find_all(...) works on the final element, the '[0]' is unnecessary, we just do it so we don't get a one-element list
inMatch3 = parser.find_all(class_="t")[0].find_all(class_="v") # Returns a one-element list
Now that we know how to search elements, we can finally implement this in our price drop notifier!
Let’s see if our request is successful. We will be printing out the entire file to check.
print(parser.find("html").prettify())
And we are not.
Hmmm, so we have to bypass Amazon’s CAPTCHA somehow, so let’s try adding headers that mimic a normal browser!
I will be adding headers to rq.get()
. Make sure to replace my AV channel receiver link with the product you want to scrape.
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","accept-encoding":"gzip, deflate, br","accept-language":"en-US,en;q=0.9","Sec-Ch-Ua":'"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',"Sec-Ch-Ua-Mobile":"?0","Sec-Ch-Ua-Platform":"\"Windows\""}
request = rq.get("https://www.amazon.com/Denon-AVR-X1700H-Channel-Receiver-Built/dp/B09HFN8T64/",headers=headers)
Let’s try now…
Nope. Still nothing. Well, time for plan B, ditching requests
completely and using selenium
.
Sign up for our newsletter!
Basic Scraping: Implementation of Selenium
Firstly, it is important to know that Selenium has its own methods for finding elements in a HTML document, but for the sake of this guide, we will just be passing the source code of the website to our parser.
Think of Selenium as a browser running in the background with some selection abilities. Instead of sending the requests to the website by crafting our own headers, we can use Selenium to spin up an invisible browser that crafts the headers for us. We should no longer get a CAPTCHA screen because Amazon shouldn’t be suspicious that a robot is browsing the page – we are technically using a legitimate browser, but with parsing capabilities.
Installation of Selenium can be done with the command below. We will also be installing win10toast
so you get a proper toast notification whenever a price drop is detected.
pip install selenium
pip install win10toast
If you are looking for how you can uninstall Requests because you don’t need it anymore, think twice because Selenium depends on Requests anyways.
Now, clear your entire Python file because we are going to need to do a short and quick rewrite of our code to use Selenium.
Like always, we will start by importing the required modules. Make sure you replace chrome
with the name of a browser you have installed on your system, preferably the most resource efficient one.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options # Imports the module we will use to change the settings for our browser
import time # This is what we will use to set delays so we don't use too many system resources
from win10toast import ToastNotifier # This is what we will use to notify if a price drop occurs.
notifier = ToastNotifier() # Assign our notifier class to a variable
Then, we will need to set some preferences for the browser we are about to start. Let’s start by declaring an Options
class and using it to make the browser invisible or run it in “headless” mode. While the arguments below are for specific browsers, I would just execute them all because I have not tested each argument individually.
browserOptions = Options()
browserOptions.headless = True # Makes Firefox run headless
browserOptions.add_argument("--headless=new") # Makes newer versions of Chrome run headless
browserOptions.add_argument("--headless") # Makes older versions of Chrome run headless
browserOptions.add_argument("--log-level=3") # Only log fatal errors
Now, we will initiate the browser in the background. Again, make sure you replace Chrome
with whichever browser you want to use for this project.
browser = webdriver.Chrome(options=browserOptions)
Now, we can navigate our browser to the page we want to scrape and get its source, which we can pass to BeautifulSoup.
browser.get("https://www.amazon.com/Denon-AVR-X1700H-Channel-Receiver-Built/dp/B09HFN8T64/")
parser = BeautifulSoup(browser.page_source, "html.parser")
Then, we can use what we already know about BeautifulSoup to grab the price of our element. Remember to replace the code below with one tailored to your sample.
price = parser.select(".a-price.aok-align-center.reinventPricePriceToPayMargin.priceToPay")[0].find_all(class_="a-offscreen")[0].text
Next, let’s strip the $
symbol from the price and convert it into a floating-point decimal.
price = float(price.strip("$"))
Then, we can set a variable to compare with the current price.
previousPrice = price
Now, we loop infinitely to see whether the price changed.
while True:
Insert a new line and then indent the code we will write from this point forward.
Now, every two minutes (120 seconds), we refresh the page and compare the price we just got to our previous price.
browser.refresh() # Refreshes the browser
# Now that we may have a new price, we have to redfine our parser and price variables to adapt to that new page code
parser = BeautifulSoup(browser.page_source, "html.parser")
price = parser.select(".a-price.aok-align-center.reinventPricePriceToPayMargin.priceToPay")[0].find_all(class_="a-offscreen")[0].text
price = float(price.strip("$"))
# Next, we compare the two prices. If we find one, we alert the user and update our price threshold. We will also be looking for price increases.
if (price<previousPrice):
print(f"Price DECREASED from ${previousPrice} to ${price}!")
notifier.show_toast("Price Drop!", f"The price decreased from ${previousPrice} to ${price}!")
elif (price>previousPrice):
print(f"Price INCREASED from ${previousPrice} to ${price}!")
notifier.show_toast(":(", f"The price increased from ${previousPrice} to ${price} :(")
# Now, we can tell the user we refreshed
print(f"Refreshed! Previous Price: ${previousPrice}, and new price ${price}")
previousPrice = price
# And then we wait for two minutes
time.sleep(120)
And just like that, you are finished! I hoped this project was useful to you!