# The libraries needed
library(DBI) # For database connection
library(tidyverse) # Contains dplyr for tidy-querying and ggplot for data Viz...
MySQL vs dplyr: A Syntax Comparison
Introduction
In the ever-evolving landscape of data analysis and manipulation, the choice of tools and techniques can greatly influence the efficiency and effectiveness of the analysis process. This report delves into a comparative exploration of two prominent data manipulation approaches: MySQL and the dplyr package in the R programming language. With a focus on their syntax, functionalities, and applications, this study aims to provide a comprehensive understanding of the strengths and nuances associated with each method.
dplyr is a widely used R package that provides a concise and intuitive framework for data manipulation and transformation. Developed by Hadley Wickham, dplyr aims to simplify the process of working with data by offering a consistent set of functions that can handle common data manipulation tasks efficiently and effectively. It enhances the capabilities of R for data analysis by providing a set of verbs that allow users to express complex data manipulations in a more readable and concise manner.
The purpose of this report is to offer a brief comparison between MySQL and dplyr syntax. As data analysts and researchers strive to extract meaningful insights from increasingly complex datasets, the selection of an appropriate data manipulation tool becomes pivotal. This paper is created to provide a brief example on an alternative to querying database using SQL commands.
The Setup
The following code is used to establish a connection to a MySQL database using the DBI (Database Interface) and odbc packages. It also sets up the connection details required to interact with the MySQL database. For this experiment, we shall use the sakila
database found in MySQL server.
<- dbConnect(odbc::odbc(),
con .connection_string =
"Driver={MySQL ODBC 8.1 Unicode Driver};",
server = "localhost",
Database = "sakila",
UID = "root",
PWD = "test1234",
port = 3306)
First Query: List the top 5 most frequently rented films and their rental count
We use MySQL syntax to obtain the table requested.
# List the top 5 most frequently rented films and their rental count...
<-
sql1 "SELECT f.film_id, f.title, COUNT(*) AS rental_count
FROM film f
JOIN inventory i ON f.film_id = i.film_id
JOIN rental r ON i.inventory_id = r.inventory_id
GROUP BY f.film_id, f.title
ORDER BY rental_count DESC
LIMIT 5;"
dbGetQuery(con, sql1) |>
::kable(caption = 'Frequently rented movies (top 5) SQL version') knitr
film_id | title | rental_count |
---|---|---|
103 | BUCKET BROTHERHOOD | 34 |
738 | ROCKETEER MOTHER | 33 |
331 | FORWARD TEMPLE | 32 |
382 | GRIT CLOCKWORK | 32 |
489 | JUGGLER HARDLY | 32 |
We generate same table using dplyr syntax.
# Obtain the tables from the database
<- "SELECT * FROM film"
query0 <- dbGetQuery(con, query0)
film
<- "SELECT * FROM inventory"
query1 <- dbGetQuery(con, query1)
inventory
<- "SELECT * FROM rental"
query2 <- dbGetQuery(con, query2)
rental # List the top 5 most frequently rented films and their rental count...
|>
film inner_join(inventory, by = 'film_id') |>
inner_join(rental, by = 'inventory_id') |>
group_by(film_id, title) |>
summarise(rental_count = n()) |>
arrange(desc(rental_count)) |>
head(5)|>
::kable(caption = 'Frequently rented movies (top 5) dplyr version') knitr
film_id | title | rental_count |
---|---|---|
103 | BUCKET BROTHERHOOD | 34 |
738 | ROCKETEER MOTHER | 33 |
331 | FORWARD TEMPLE | 32 |
382 | GRIT CLOCKWORK | 32 |
489 | JUGGLER HARDLY | 32 |
Second Query: Find the revenue generated by each store.
The table below shows the revenue generated by the stores using MySQL syntax.
# find the revenue generated by each store
<-
sql2"SELECT s.store_id, SUM(p.amount) AS total_revenue
FROM store s
JOIN staff st ON s.store_id = st.store_id
JOIN payment p ON st.staff_id = p.staff_id
GROUP BY s.store_id;"
dbGetQuery(con, sql2)|>
::kable(caption = 'Revenue generated by each store. SQL version') knitr
store_id | total_revenue |
---|---|
1 | 33482.50 |
2 | 33924.06 |
Similarly, we display the revenue generated by the stores using dplyr syntax.
# Obtain the tables needed from the database
<- "SELECT * FROM staff"
query3 <- dbGetQuery(con, query3)
staff
<- "SELECT * FROM payment"
query4 <- dbGetQuery(con, query4)
payment
<- "SELECT * FROM store"
query5 <- dbGetQuery(con, query5)
store # find the revenue generated by each store:
|>
store inner_join(staff, by = 'store_id') |>
inner_join(payment, by = 'staff_id') |>
group_by(store_id) |>
summarize(total_revenue = sum(amount)) |>
head(5)|>
::kable(caption = 'Revenue generated by each store. dplyr version') knitr
store_id | total_revenue |
---|---|
1 | 33482.50 |
2 | 33924.06 |
Third Query: Find the top 10 grossing films (by revenue)
In this third query, we shall obtain top 10 grossing films by revenue generated and proceed to plot it using bar chart. Below is the result using MySQL syntax:
# find the top 10 grossing films (by revenue):
<-
sql3 "SELECT f.film_id, f.title, SUM(p.amount) AS total_revenue
FROM film f
JOIN inventory i ON f.film_id = i.film_id
JOIN rental r ON i.inventory_id = r.inventory_id
JOIN payment p ON r.rental_id = p.rental_id
GROUP BY f.film_id, f.title
ORDER BY total_revenue DESC
LIMIT 10;"
dbGetQuery(con, sql3)|>
# The plot added
ggplot(aes(x = total_revenue, y = reorder(title,total_revenue)))+
geom_bar(stat = 'identity', fill = 'darkblue')+
theme_minimal()+
theme(legend.position = 'none')+
labs(y = 'Movies', x= 'Total revenue',
caption= 'Top 10 grossing films (MySQL version)')
We plot the same chart using the dplyr syntax below:
# Find the top 10 grossing films (by revenue):
|>
film inner_join(inventory, by = 'film_id') |>
inner_join(rental, by = 'inventory_id') |>
inner_join(payment, by = 'rental_id') |>
group_by(film_id, title) |>
summarise(total_revenue = sum(amount)) |>
arrange(desc(total_revenue)) |>
head(10) |>
# The plot added
ggplot(aes(x = total_revenue, y = reorder(title,total_revenue)))+
geom_bar(stat = 'identity', fill = 'darkred')+
theme_minimal()+
theme(legend.position = 'none')+
labs(y = 'Movies', x= 'Total revenue',
caption = 'Top 10 grossing films (dplyr version)')
Endnote
We have been able to demonstrate how the dplyr operates compared to MySQL in terms of data manipulation. We also intended to make you see the possibility of querying databases using R, especially if its the language you are more comfortable using like me. It is worth noting that the dplyr syntax looks a bit longer because we first have to load the tables from the database.
If you read to the end, thank you.