
Automatically finding and detecting great deals and coupons promptly can save you money and time. slickdeals.net is a community-based website where people share deals information. Slickdeals has the “deal alert” option to detect deal information automatically. However, the choice is limited. It doesn’t support if you’d want customized filters like price, discount amount, or shopping site. We can quickly build a weekend project that automatically finds the best deals on time.
Three years ago, I wrote an article about how you can set up a local server and web-scraping shopping deals information using Raspberry Pi and Scrapy. With the new framework and tools available, I want to refresh the old version with the latest tools.
What Tools We Will Use
- R. The previous version was done in Python. I started to love writing in R for my personal data-related project, and I think writing in R is fun this time.
- rvest: Scrapy is still one of Python’s most popular open-source scrapping tools. Since I chose R this time, I will use rvest this time.
- Mage: Mage is an engineer-friendly data workflow orchestrator and scheduling tool. I want to show a nice UI and scheduling this time instead of crontab.
Web Scraping 101
The key aspect of web scraping that we’d need to do manually is to find the proper data we are interested in, get its HTML identifier (id, class, XPath), then let the web scraping framework handle the rest.
Let’s take slickdeals.net as an example and step by step on how to find the information you’d need to grab.
- Go to the website of slickdeals.net
- The landing page we will perform web scraping is called Frontpage Slickdeals. Each item has the following information includes product image, product title, product store/website, current price, original price, and thumbs up/likes.

3. Use the developer tool on the browser or inspect an element on the website. After some observation, we noticed that each deal is a card on the website. We’d need to find what HTML element we can iterate on to grab all of them in a loop. In this particular case, a div tag with class dealCard dealCard
is recognized for each item — <div class="dealCard dealCard">

4. We can derive the rest of the data from its parent on a given card. To get the name of each class, you can repeat the steps mentioned above using Developer Tools in the browser, get all the fields we are interested in, and extract them.

Develop Web Scraping with R and Mage
Mage supports R development by default. With my more interactive development experience with Mage, I will try writing code in Mage instead of copying and pasting from RStudio.
To learn more about Mage and its features that compare with Airflow
Is Apache Airflow Due for Replacement? The First Impression Of mage-ai
To start Mage, we can either use docker with its recent image or pip, and replace YOUR_PROJECT_NAME
to a desired name you want.
pip install mage-ai
mage start YOUR_PROJECT_NAME
By default, Mage will run at http://localhost:6789.
Once you open Mage’s UI, then we can create a new data pipeline. Most of the data projects follow the ETL pattern — Exact data from the data source, Transform data into insights, and Load the final data into another data source that is easy to consume.
We will also break our web scraping pipeline into 3 sections:
- Extract web data by web scraping using rvest
- Transform the scrapped data into insights. In our case, we’d like to perform some filtering
- Load filtered data into a CSV file
Extract web data by web scraping using rvest
First, we clicked on the Data loader tab on Mage’s Pipeline UI and gave a name for the data extraction tasks. Let’s call it extract_rvest
as our Extraction task
If you didn’t provide a file name, Mage would automatically assign a random name for you. You might be curious about why the naming is critical in Mage? Mage uses the file name as UUID to map all the tasks’ dependencies and form a DAG. If you click on the pipeline’s metadata.yaml
You’d see something below that UUID will convert to lowercase for the original name you provided.
blocks:
- all_upstream_blocks_executed: true
configuration: {}
downstream_blocks: []
executor_config: null
executor_type: local_python
language: r
name: Extract_RVest
status: updated
type: data_loader
upstream_blocks: []
uuid: extract_rvest
data_integration: null
name: SlickDeal
type: python
uuid: slickdeal
widgets: []
Next, we can write down the code for scraping on the target website and pull data.
library(rvest)
slickdeal_raw <- read_html("https://slickdeals.net/")
deals <- slickdeal_raw %>% html_elements(".dealCard")
df = NULL
for (deal in deals) {
title <- deal %>% html_elements(".dealCard__title") %>% html_text()
original_price <- deal %>% html_elements(".dealCard__originalPrice") %>% html_text()
price <- deal %>% html_elements(".dealCard__price") %>% html_text()
votes <- deal %>% html_elements(".dealCardSocialControls__voteCount") %>% html_text()
df <- rbind(df, data.frame(title,original_price,price,votes))
}
df %>% head()
The code above does two things, pulling HTML from slickdeal website, then iterating on every card with the class named dealCard
. Once rvest
We can find all the items, iterate that list, find the proper value, and persist them into a dataframe.
A nice thing about Mage is that each task is executable. Without triggering the entire pipeline, we can perform more testing for each task before we build the entire DAG.

Transform Data Into Insights
Now we have the dataframe in R, and we can gain more insights from the scrapped data. We can transform our data to see which item has the most votes and which are the most expensive.
If you write everything in one R script in RStudio, you will usually continue leveraging the existing dataframe that has been used before. Since we are using Mage, we can separate the tasks to make them more modular. This can be very easier to debug and unit test.
Let’s do a one-liner of the R script to sort and return the dataframe by votes in descending order.
library(dplyr)
transform <- function(df, ...) {
df %>% arrange(desc(votes))
}
Once we have created a Transformer block, it automatically connects with the previous data source. In this case, as we have already executed the previous task block, we can easily rerun everything without rerunning everything from the top.

Load Data into CSV
The final loading data stage is similar to what we did before, and we’d need to write it out as a CSV file. We can grab Mage’s metadata and put it in as part of the file name to avoid the CSV file being overwritten.
export_data <- function(df, ...) {
execution_date <- global_vars['execution_date']
write.csv(df, paste0("~/Downloads/my_slickdeal_",execution_date,".csv"))
}

Mage DAG - Pipeline
If you’d want to make it a more complex DAG, you can build the DAG in a graph view instead of defining dependency among blocks. Below is how the DAG is automatically created by the tasks we described earlier.

Schedule The Pipeline
Since the slickdeal page will update frequently based on the latest deal available. We can take a snapshot of the information every 1 hour and build trend charts or time series analysis.
You can easily set up a pipeline to run hourly.

Final Thoughts
Congratulations! We have built a web scraping written in R running 24/7. With the data pulled every hour, many things we can accomplish. You can take a snapshot every hour and monitor the trend of given deals getting popular. You can build a robust application and send notifications with customized queries. You can create a data science project by making a classification model on top of this data. This story helps inspire you to use R and Mage to build weekend side projects. Cheers!
About Me
I hope my stories are helpful to you.
For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.