Mini Project #1: Protein and Fat Composition Analysis

Tom Baudry

08 October 2025

1. Introduction

The goal of this project is to explore how protein and fat composition vary across recipes with diﬀerent

protein levels. Using the all_recipes dataset, I analyze whether meals with higher protein content tend to

have lower relative fat content. This question connects to nutrition balance and the concept of macronutrient

substitution.

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --

## v dplyr 1.1.4 v readr 2.1.5

## v forcats 1.0.0 v stringr 1.5.1

## v ggplot2 3.5.2 v tibble 3.3.0

## v lubridate 1.9.4 v tidyr 1.3.1

## v purrr 1.1.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --

## x dplyr::filter() masks stats::filter()

## x dplyr::lag() masks stats::lag()

## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Rows: 15,593

## Columns: 22

## $ name <chr> "Chewy Whole Wheat Peanut Butter Brownies", "Pumpkin~

## $ url <chr> "https://www.allrecipes.com/recipe/140717/chewy-whol~

## $ author <chr> "DMOMMY", "Bobbie Susan", "Bren", "Sarah Brekke", "P~

## $ date_published <chr> "2020-06-18", "2022-09-26", "2018-06-08", "2025-03-0~

## $ calories <int> 222, 477, 354, 356, 366, 709, 466, 782, 355, 395, 37~

## $ fat <int> 13, 31, 18, 9, 22, 47, 27, 61, 15, 12, 14, 7, 8, 20,~

## $ carbs <int> 24, 43, 32, 53, 23, 31, 1, 19, 33, 33, 50, 16, 29, 1~

## $ protein <int> 6, 8, 20, 19, 19, 37, 52, 40, 23, 37, 17, 4, 4, 28, ~

## $ avg_rating <dbl> 4.4, 5.0, 4.8, 4.3, 4.7, 4.2, 4.4, 4.6, 4.7, 4.7, 4.~

## $ total_ratings <int> 47, 1, 4, 14, 84, 5, 648, 347, 129, 195, 33, 2, 84, ~

## $ reviews <int> 36, 1, 4, 13, 67, 3, 468, 259, 102, 153, 23, 1, 71, ~

## $ prep_time <int> 20, 10, 10, 20, 30, 45, 15, 10, 30, 20, 5, 5, 15, 15~

## $ cook_time <int> 35, 5, 75, 40, 95, 80, 45, 190, 480, 55, 0, 5, 60, 3~

## $ total_time <int> 55, 495, 85, 60, 125, 155, 60, 200, 510, 75, 485, 10~

## $ servings <int> 16, 8, 4, 8, 8, 12, 6, 8, 12, 6, 1, 4, 15, 6, 12, 4,~

## $ cuisine <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

## $ published_year <dbl> 2020, 2022, 2018, 2025, 2024, 2019, 2023, 2022, 2023~

## $ published_month <dbl> 6, 9, 6, 3, 12, 4, 1, 7, 1, 11, 2, 11, 11, 1, 11, 1,~

## $ published_day <int> 18, 26, 8, 3, 11, 3, 4, 14, 19, 14, 5, 26, 12, 18, 3~

## $ published_weekday <ord> Thursday, Monday, Friday, Monday, Wednesday, Wednesd~

## $ contains_eggs <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, ~

## $ uses_cans <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, ~

## # A tibble: 15,593 x 22

## name url author date_published calories fat carbs protein avg_rating

## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl>

## 1 Chewy Wh~ http~ DMOMMY 2020-06-18 222 13 24 6 4.4

## 2 Pumpkin ~ http~ Bobbi~ 2022-09-26 477 31 43 8 5

## 3 Eggs Poa~ http~ Bren 2018-06-08 354 18 32 20 4.8

## 4 Minestro~ http~ Sarah~ 2025-03-03 356 9 53 19 4.3

## 5 Yummy St~ http~ Procr~ 2024-12-11 366 22 23 19 4.7

## 6 Prime Ri~ http~ Cajun~ 2019-04-03 709 47 31 37 4.2

## 7 Parmesan~ http~ Anika 2023-01-04 466 27 1 52 4.4

## 8 Chicken ~ http~ Bob C~ 2022-07-14 782 61 19 40 4.6

## 9 Sweet Po~ http~ Dean 2023-01-19 355 15 33 23 4.7

## 10 Quick Ba~ http~ Chris~ 2024-11-14 395 12 33 37 4.7

## # i 15,583 more rows

## # i 13 more variables: total_ratings <int>, reviews <int>, prep_time <int>,

## # cook_time <int>, total_time <int>, servings <int>, cuisine <chr>,

## # published_year <dbl>, published_month <dbl>, published_day <int>,

## # published_weekday <ord>, contains_eggs <lgl>, uses_cans <lgl>

2. Data preparation

We ﬁrst inspect the dataset to understand its structure. The column ingredients contains long text with

special characters that can cause rendering issues, so it is excluded from the initial preview.

# print the dataset structure (excluding ingredients)

print(glimpse(select(all_recipes, -ingredients)))