How to Find the Best Deals On Time with R and Mage

Automatically finding and detecting great deals and coupons promptly can save you money and time. slickdeals.net is a community-based website where people share deals information. Slickdeals has the “deal alert” option to detect deal information automatically. However, the choice is limited. It doesn’t support if you’d want customized filters like price, discount amount, or shopping site. We can quickly build a weekend project that automatically finds the best deals on time.

Three years ago, I wrote an article about how you can set up a local server and web-scraping shopping deals information using Raspberry Pi and Scrapy. With the new framework and tools available, I want to refresh the old version with the latest tools.

How To Detect The Best Deals With Web Scraping

What Tools We Will Use

R. The previous version was done in Python. I started to love writing in R for my personal data-related project, and I think writing in R is fun this time.
rvest: Scrapy is still one of Python’s most popular open-source scrapping tools. Since I chose R this time, I will use rvest this time.
Mage: Mage is an engineer-friendly data workflow orchestrator and scheduling tool. I want to show a nice UI and scheduling this time instead of crontab.

Web Scraping 101

The key aspect of web scraping that we’d need to do manually is to find the proper data we are interested in, get its HTML identifier (id, class, XPath), then let the web scraping framework handle the rest.

Let’s take slickdeals.net as an example and step by step on how to find the information you’d need to grab.

Go to the website of slickdeals.net
The landing page we will perform web scraping is called Frontpage Slickdeals. Each item has the following information includes product image, product title, product store/website, current price, original price, and thumbs up/likes.

3. Use the developer tool on the browser or inspect an element on the website. After some observation, we noticed that each deal is a card on the website. We’d need to find what HTML element we can iterate on to grab all of them in a loop. In this particular case, a div tag with class dealCard dealCard is recognized for each item — <div class="dealCard dealCard">

4. We can derive the rest of the data from its parent on a given card. To get the name of each class, you can repeat the steps mentioned above using Developer Tools in the browser, get all the fields we are interested in, and extract them.

Develop Web Scraping with R and Mage

Mage supports R development by default. With my more interactive development experience with Mage, I will try writing code in Mage instead of copying and pasting from RStudio.

To learn more about Mage and its features that compare with Airflow

Jan12

Is Apache Airflow Due for Replacement? The First Impression Of mage-ai

Airflow has been widespread for years. Is Apache Airflow due for a replacement? mage-ai is the new ETL tool for data engineers to check out ...

To start Mage, we can either use docker with its recent image or pip, and replace YOUR_PROJECT_NAME to a desired name you want.

				
					pip install mage-ai
mage start YOUR_PROJECT_NAME

By default, Mage will run at http://localhost:6789.

Once you open Mage’s UI, then we can create a new data pipeline. Most of the data projects follow the ETL pattern — Exact data from the data source, Transform data into insights, and Load the final data into another data source that is easy to consume.

We will also break our web scraping pipeline into 3 sections:

Extract web data by web scraping using rvest
Transform the scrapped data into insights. In our case, we’d like to perform some filtering
Load filtered data into a CSV file

Extract web data by web scraping using rvest

First, we clicked on the Data loader tab on Mage’s Pipeline UI and gave a name for the data extraction tasks. Let’s call it extract_rvest as our Extraction task

If you didn’t provide a file name, Mage would automatically assign a random name for you. You might be curious about why the naming is critical in Mage? Mage uses the file name as UUID to map all the tasks’ dependencies and form a DAG. If you click on the pipeline’s metadata.yaml You’d see something below that UUID will convert to lowercase for the original name you provided.

				
					blocks:
- all_upstream_blocks_executed: true
  configuration: {}
  downstream_blocks: []
  executor_config: null
  executor_type: local_python
  language: r
  name: Extract_RVest
  status: updated
  type: data_loader
  upstream_blocks: []
  uuid: extract_rvest
data_integration: null
name: SlickDeal
type: python
uuid: slickdeal
widgets: []

Next, we can write down the code for scraping on the target website and pull data.

				
					library(rvest)

slickdeal_raw <- read_html("https://slickdeals.net/")
deals <- slickdeal_raw %>% html_elements(".dealCard")

df = NULL

for (deal in deals) {
  title <- deal %>% html_elements(".dealCard__title") %>%  html_text()
  original_price <- deal %>% html_elements(".dealCard__originalPrice") %>%  html_text()
  price <- deal %>% html_elements(".dealCard__price") %>%  html_text()
  votes <- deal %>% html_elements(".dealCardSocialControls__voteCount") %>%  html_text()
  
  df <- rbind(df, data.frame(title,original_price,price,votes))
}

df %>% head()

The code above does two things, pulling HTML from slickdeal website, then iterating on every card with the class named dealCard . Once rvest We can find all the items, iterate that list, find the proper value, and persist them into a dataframe.

A nice thing about Mage is that each task is executable. Without triggering the entire pipeline, we can perform more testing for each task before we build the entire DAG.

Transform Data Into Insights

Now we have the dataframe in R, and we can gain more insights from the scrapped data. We can transform our data to see which item has the most votes and which are the most expensive.

If you write everything in one R script in RStudio, you will usually continue leveraging the existing dataframe that has been used before. Since we are using Mage, we can separate the tasks to make them more modular. This can be very easier to debug and unit test.

Let’s do a one-liner of the R script to sort and return the dataframe by votes in descending order.

				
					library(dplyr)

transform <- function(df, ...) {
    df %>% arrange(desc(votes))
}

Once we have created a Transformer block, it automatically connects with the previous data source. In this case, as we have already executed the previous task block, we can easily rerun everything without rerunning everything from the top.

Load Data into CSV

The final loading data stage is similar to what we did before, and we’d need to write it out as a CSV file. We can grab Mage’s metadata and put it in as part of the file name to avoid the CSV file being overwritten.

				
					export_data <- function(df, ...) {
    execution_date <- global_vars['execution_date']
    write.csv(df, paste0("~/Downloads/my_slickdeal_",execution_date,".csv"))
}

Mage DAG - Pipeline

If you’d want to make it a more complex DAG, you can build the DAG in a graph view instead of defining dependency among blocks. Below is how the DAG is automatically created by the tasks we described earlier.

Schedule The Pipeline

Since the slickdeal page will update frequently based on the latest deal available. We can take a snapshot of the information every 1 hour and build trend charts or time series analysis.

You can easily set up a pipeline to run hourly.

Final Thoughts

Congratulations! We have built a web scraping written in R running 24/7. With the data pulled every hour, many things we can accomplish. You can take a snapshot every hour and monitor the trend of given deals getting popular. You can build a robust application and send notifications with customized queries. You can create a data science project by making a classification model on top of this data. This story helps inspire you to use R and Mage to build weekend side projects. Cheers!

About Me

I hope my stories are helpful to you.

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

Photo by Daria Nepriakhina 🇺🇦 on Unsplash

Jan28

How to Find the Best Deals On Time with R and Mage

What Tools We Will Use

Web Scraping 101

Develop Web Scraping with R and Mage

Is Apache Airflow Due for Replacement? The First Impression Of mage-ai

Extract web data by web scraping using rvest

Transform Data Into Insights

Load Data into CSV

Mage DAG - Pipeline

Schedule The Pipeline

Final Thoughts

About Me

More Articles

5 Fantastic Data Pipeline Orchestration Tools For R

4 Free Fantastic Diagramming Tools To Make Yours Stand Out

Don’t Get Tripped Up! 10 Common Data Engineering Pitfalls

About The Author

Chengzhi Zhao