How to Find the Best Deals On Time with R and Mage

Image Create By Midjourney
Image Create By Midjourney

Automatically finding and detecting great deals and coupons promptly can save you money and time. slickdeals.net is a community-based website where people share deals information. Slickdeals has the “deal alert” option to detect deal information automatically. However, the choice is limited. It doesn’t support if you’d want customized filters like price, discount amount, or shopping site. We can quickly build a weekend project that automatically finds the best deals on time. 

Three years ago, I wrote an article about how you can set up a local server and web-scraping shopping deals information using Raspberry Pi and Scrapy. With the new framework and tools available, I want to refresh the old version with the latest tools. 

How To Detect The Best Deals With Web Scraping

What Tools We Will Use

  • R. The previous version was done in Python. I started to love writing in R for my personal data-related project, and I think writing in R is fun this time. 
  • rvest: Scrapy is still one of Python’s most popular open-source scrapping tools. Since I chose R this time, I will use rvest this time.
  • Mage: Mage is an engineer-friendly data workflow orchestrator and scheduling tool. I want to show a nice UI and scheduling this time instead of crontab. 

Web Scraping 101

The key aspect of web scraping that we’d need to do manually is to find the proper data we are interested in, get its HTML identifier (id, class, XPath), then let the web scraping framework handle the rest. 

Let’s take slickdeals.net as an example and step by step on how to find the information you’d need to grab. 

  1. Go to the website of slickdeals.net
  2. The landing page we will perform web scraping is called Frontpage Slickdeals. Each item has the following information includes product image, product title, product store/website, current price, original price, and thumbs up/likes.
Slickdeal.net Landing Page | Image By Author
Slickdeal.net Landing Page | Image By Author

3. Use the developer tool on the browser or inspect an element on the website. After some observation, we noticed that each deal is a card on the website. We’d need to find what HTML element we can iterate on to grab all of them in a loop. In this particular case, a div tag with class dealCard dealCard is recognized for each item — <div class="dealCard dealCard">

Inspect Website | Image By Author
Inspect Website | Image By Author

4. We can derive the rest of the data from its parent on a given card. To get the name of each class, you can repeat the steps mentioned above using Developer Tools in the browser, get all the fields we are interested in, and extract them.

Inspect Website Elements For Price Information | Image By Author
Inspect Website Elements For Price Information | Image By Author

Develop Web Scraping with R and Mage

Mage supports R development by default. With my more interactive development experience with Mage, I will try writing code in Mage instead of copying and pasting from RStudio. 

To learn more about Mage and its features that compare with Airflow

Foto von Enis Yavuz auf Unsplash

Is Apache Airflow Due for Replacement? The First Impression Of mage-ai

Airflow has been widespread for years. Is Apache Airflow due for a replacement? mage-ai is the new ETL tool for data engineers to check out ...
Read More →

To start Mage, we can either use docker with its recent image or pip, and replace YOUR_PROJECT_NAME to a desired name you want. 

				
					pip install mage-ai
mage start YOUR_PROJECT_NAME
				
			

By default, Mage will run at http://localhost:6789. 

Once you open Mage’s UI, then we can create a new data pipeline. Most of the data projects follow the ETL pattern — Exact data from the data source, Transform data into insights, and Load the final data into another data source that is easy to consume. 

We will also break our web scraping pipeline into 3 sections: 

  • Extract web data by web scraping using rvest
  • Transform the scrapped data into insights. In our case, we’d like to perform some filtering
  • Load filtered data into a CSV file

Extract web data by web scraping using rvest

First, we clicked on the Data loader tab on Mage’s Pipeline UI and gave a name for the data extraction tasks. Let’s call it extract_rvest as our Extraction task

If you didn’t provide a file name, Mage would automatically assign a random name for you. You might be curious about why the naming is critical in Mage? Mage uses the file name as UUID to map all the tasks’ dependencies and form a DAG. If you click on the pipeline’s metadata.yaml You’d see something below that UUID will convert to lowercase for the original name you provided. 

				
					blocks:
- all_upstream_blocks_executed: true
  configuration: {}
  downstream_blocks: []
  executor_config: null
  executor_type: local_python
  language: r
  name: Extract_RVest
  status: updated
  type: data_loader
  upstream_blocks: []
  uuid: extract_rvest
data_integration: null
name: SlickDeal
type: python
uuid: slickdeal
widgets: []
				
			

Next, we can write down the code for scraping on the target website and pull data.

				
					library(rvest)
slickdeal_raw <- read_html("https://slickdeals.net/")
deals <- slickdeal_raw %>% html_elements(".dealCard")
df = NULL
for (deal in deals) {
  title <- deal %>% html_elements(".dealCard__title") %>%  html_text()
  original_price <- deal %>% html_elements(".dealCard__originalPrice") %>%  html_text()
  price <- deal %>% html_elements(".dealCard__price") %>%  html_text()
  votes <- deal %>% html_elements(".dealCardSocialControls__voteCount") %>%  html_text()
  
  df <- rbind(df, data.frame(title,original_price,price,votes))
}
df %>% head()
				
			

The code above does two things, pulling HTML from slickdeal website, then iterating on every card with the class named dealCard . Once rvest We can find all the items, iterate that list, find the proper value, and persist them into a dataframe. 

A nice thing about Mage is that each task is executable. Without triggering the entire pipeline, we can perform more testing for each task before we build the entire DAG.

Run Extract Code Block | Image By Author
Run Extract Code Block | Image By Author

Transform Data Into Insights

Now we have the dataframe in R, and we can gain more insights from the scrapped data. We can transform our data to see which item has the most votes and which are the most expensive. 

If you write everything in one R script in RStudio, you will usually continue leveraging the existing dataframe that has been used before. Since we are using Mage, we can separate the tasks to make them more modular. This can be very easier to debug and unit test. 

Let’s do a one-liner of the R script to sort and return the dataframe by votes in descending order.

				
					library(dplyr)
transform <- function(df, ...) {
    df %>% arrange(desc(votes))
}
				
			

Once we have created a Transformer block, it automatically connects with the previous data source. In this case, as we have already executed the previous task block, we can easily rerun everything without rerunning everything from the top. 

Run Transformer Code Block | Image By Author
Run Transformer Code Block | Image By Author

Load Data into CSV

The final loading data stage is similar to what we did before, and we’d need to write it out as a CSV file. We can grab Mage’s metadata and put it in as part of the file name to avoid the CSV file being overwritten.

				
					export_data <- function(df, ...) {
    execution_date <- global_vars['execution_date']
    write.csv(df, paste0("~/Downloads/my_slickdeal_",execution_date,".csv"))
}
				
			
Output CSV Slickdeal | Image By Author

Mage DAG - Pipeline

If you’d want to make it a more complex DAG, you can build the DAG in a graph view instead of defining dependency among blocks. Below is how the DAG is automatically created by the tasks we described earlier. 

Mage Pipeline | Image By Author

Schedule The Pipeline

Since the slickdeal page will update frequently based on the latest deal available. We can take a snapshot of the information every 1 hour and build trend charts or time series analysis.

You can easily set up a pipeline to run hourly.

Mage Schedule a run by Hour | Image Author
Mage Schedule a run by Hour | Image Author

Final Thoughts

Congratulations! We have built a web scraping written in R running 24/7. With the data pulled every hour, many things we can accomplish. You can take a snapshot every hour and monitor the trend of given deals getting popular. You can build a robust application and send notifications with customized queries. You can create a data science project by making a classification model on top of this data. This story helps inspire you to use R and Mage to build weekend side projects. Cheers!

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1) | Image By Author

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1)

We will discuss a neglected part of Apache Spark Performance between coalesce(1) and repartition(1), and it could be one of the things to be attentive ...
Read More →
Photo by Susan Q Yin on Unsplash

The Essential Reading List for Data Engineers: 10 Classic Books You Can’t Miss

Discover the Essential Reading List for Data Engineers: 10 Classic Books You Can't Miss. While many free online resources are available, they often lack the ...
Read More →
Photo by Volodymyr Hryshchenko on Unsplash

Mastering Gantt Charts: Learn How to Build Them Using Code Alone

Learn how to master Gantt charts using code alone in project management tracking. Many UI-based tools for Gantt chart creation can take a lot of ...
Read More →

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link