MySQL vs dplyr: A Syntax Comparison

Author

Otulaja Paul

Introduction

In the ever-evolving landscape of data analysis and manipulation, the choice of tools and techniques can greatly influence the efficiency and effectiveness of the analysis process. This report delves into a comparative exploration of two prominent data manipulation approaches: MySQL and the dplyr package in the R programming language. With a focus on their syntax, functionalities, and applications, this study aims to provide a comprehensive understanding of the strengths and nuances associated with each method.

dplyr is a widely used R package that provides a concise and intuitive framework for data manipulation and transformation. Developed by Hadley Wickham, dplyr aims to simplify the process of working with data by offering a consistent set of functions that can handle common data manipulation tasks efficiently and effectively. It enhances the capabilities of R for data analysis by providing a set of verbs that allow users to express complex data manipulations in a more readable and concise manner.

The purpose of this report is to offer a brief comparison between MySQL and dplyr syntax. As data analysts and researchers strive to extract meaningful insights from increasingly complex datasets, the selection of an appropriate data manipulation tool becomes pivotal. This paper is created to provide a brief example on an alternative to querying database using SQL commands.

The Setup

The following code is used to establish a connection to a MySQL database using the DBI (Database Interface) and odbc packages. It also sets up the connection details required to interact with the MySQL database. For this experiment, we shall use the sakila database found in MySQL server.

# The libraries needed
library(DBI)  # For database connection
library(tidyverse) # Contains dplyr for tidy-querying and ggplot for data Viz...
con <- dbConnect(odbc::odbc(), 
                 .connection_string = 
                   "Driver={MySQL ODBC 8.1 Unicode Driver};", 
                 server = "localhost", 
                 Database = "sakila", 
                 UID = "root", 
                 PWD = "test1234",
                 port = 3306)

First Query: List the top 5 most frequently rented films and their rental count

We use MySQL syntax to obtain the table requested.

# List the top 5 most frequently rented films and their rental count...
sql1 <-
"SELECT f.film_id, f.title, COUNT(*) AS rental_count
FROM film f
JOIN inventory i ON f.film_id = i.film_id
JOIN rental r ON i.inventory_id = r.inventory_id
GROUP BY f.film_id, f.title
ORDER BY rental_count DESC
LIMIT 5;"

dbGetQuery(con, sql1) |> 
  knitr::kable(caption = 'Frequently rented movies (top 5) SQL version')
Frequently rented movies (top 5) SQL version
film_id title rental_count
103 BUCKET BROTHERHOOD 34
738 ROCKETEER MOTHER 33
331 FORWARD TEMPLE 32
382 GRIT CLOCKWORK 32
489 JUGGLER HARDLY 32

We generate same table using dplyr syntax.

# Obtain the tables from the database
query0 <- "SELECT * FROM film"
film <- dbGetQuery(con, query0)

query1 <- "SELECT * FROM inventory"
inventory <- dbGetQuery(con, query1)

query2 <- "SELECT * FROM rental"
rental <- dbGetQuery(con, query2)
# List the top 5 most frequently rented films and their rental count...
film |> 
  inner_join(inventory, by = 'film_id') |> 
  inner_join(rental, by = 'inventory_id') |> 
  group_by(film_id, title) |> 
  summarise(rental_count = n()) |> 
  arrange(desc(rental_count)) |> 
  head(5)|> 
  knitr::kable(caption = 'Frequently rented movies (top 5) dplyr version')
Frequently rented movies (top 5) dplyr version
film_id title rental_count
103 BUCKET BROTHERHOOD 34
738 ROCKETEER MOTHER 33
331 FORWARD TEMPLE 32
382 GRIT CLOCKWORK 32
489 JUGGLER HARDLY 32

Second Query: Find the revenue generated by each store.

The table below shows the revenue generated by the stores using MySQL syntax.

# find the revenue generated by each store
sql2<- 
"SELECT s.store_id, SUM(p.amount) AS total_revenue
FROM store s
JOIN staff st ON s.store_id = st.store_id
JOIN payment p ON st.staff_id = p.staff_id
GROUP BY s.store_id;"

dbGetQuery(con, sql2)|> 
  knitr::kable(caption = 'Revenue generated by each store. SQL version')
Revenue generated by each store. SQL version
store_id total_revenue
1 33482.50
2 33924.06

Similarly, we display the revenue generated by the stores using dplyr syntax.

# Obtain the tables needed from the database
query3 <- "SELECT * FROM staff"
staff <- dbGetQuery(con, query3)

query4 <- "SELECT * FROM payment"
payment <- dbGetQuery(con, query4)

query5 <- "SELECT * FROM store"
store <- dbGetQuery(con, query5)
# find the revenue generated by each store:
store |> 
  inner_join(staff, by = 'store_id') |> 
  inner_join(payment, by = 'staff_id') |> 
  group_by(store_id) |> 
  summarize(total_revenue = sum(amount)) |> 
  head(5)|> 
  knitr::kable(caption = 'Revenue generated by each store. dplyr version')
Revenue generated by each store. dplyr version
store_id total_revenue
1 33482.50
2 33924.06

Third Query: Find the top 10 grossing films (by revenue)

In this third query, we shall obtain top 10 grossing films by revenue generated and proceed to plot it using bar chart. Below is the result using MySQL syntax:

# find the top 10 grossing films (by revenue):
sql3 <-
  "SELECT f.film_id, f.title, SUM(p.amount) AS total_revenue
FROM film f
JOIN inventory i ON f.film_id = i.film_id
JOIN rental r ON i.inventory_id = r.inventory_id
JOIN payment p ON r.rental_id = p.rental_id
GROUP BY f.film_id, f.title
ORDER BY total_revenue DESC
LIMIT 10;"

dbGetQuery(con, sql3)|> 
  # The plot added
  ggplot(aes(x = total_revenue, y = reorder(title,total_revenue)))+
  geom_bar(stat = 'identity', fill = 'darkblue')+
  theme_minimal()+
  theme(legend.position = 'none')+
  labs(y = 'Movies', x= 'Total revenue',
       caption= 'Top 10 grossing films (MySQL version)')

We plot the same chart using the dplyr syntax below:

# Find the top 10 grossing films (by revenue):
film |> 
  inner_join(inventory, by = 'film_id') |> 
  inner_join(rental, by = 'inventory_id') |> 
  inner_join(payment, by = 'rental_id') |> 
  group_by(film_id, title) |> 
  summarise(total_revenue = sum(amount)) |> 
  arrange(desc(total_revenue)) |> 
  head(10) |> 
  # The plot added
  ggplot(aes(x = total_revenue, y = reorder(title,total_revenue)))+
  geom_bar(stat = 'identity', fill = 'darkred')+
  theme_minimal()+
  theme(legend.position = 'none')+
  labs(y = 'Movies', x= 'Total revenue',
       caption = 'Top 10 grossing films (dplyr version)')

Endnote

We have been able to demonstrate how the dplyr operates compared to MySQL in terms of data manipulation. We also intended to make you see the possibility of querying databases using R, especially if its the language you are more comfortable using like me. It is worth noting that the dplyr syntax looks a bit longer because we first have to load the tables from the database.
If you read to the end, thank you.