Mini Project #1: Protein and Fat Composition Analysis
Tom Baudry
08 October 2025
1. Introduction
The goal of this project is to explore how protein and fat composition vary across recipes with different
protein levels. Using the all_recipes dataset, I analyze whether meals with higher protein content tend to
have lower relative fat content. This question connects to nutrition balance and the concept of macronutrient
substitution.
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.4 v readr 2.1.5
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.5.2 v tibble 3.3.0
## v lubridate 1.9.4 v tidyr 1.3.1
## v purrr 1.1.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 15,593
## Columns: 22
## $ name <chr> "Chewy Whole Wheat Peanut Butter Brownies", "Pumpkin~
## $ url <chr> "https://www.allrecipes.com/recipe/140717/chewy-whol~
## $ author <chr> "DMOMMY", "Bobbie Susan", "Bren", "Sarah Brekke", "P~
## $ date_published <chr> "2020-06-18", "2022-09-26", "2018-06-08", "2025-03-0~
## $ calories <int> 222, 477, 354, 356, 366, 709, 466, 782, 355, 395, 37~
## $ fat <int> 13, 31, 18, 9, 22, 47, 27, 61, 15, 12, 14, 7, 8, 20,~
## $ carbs <int> 24, 43, 32, 53, 23, 31, 1, 19, 33, 33, 50, 16, 29, 1~
## $ protein <int> 6, 8, 20, 19, 19, 37, 52, 40, 23, 37, 17, 4, 4, 28, ~
## $ avg_rating <dbl> 4.4, 5.0, 4.8, 4.3, 4.7, 4.2, 4.4, 4.6, 4.7, 4.7, 4.~
## $ total_ratings <int> 47, 1, 4, 14, 84, 5, 648, 347, 129, 195, 33, 2, 84, ~
## $ reviews <int> 36, 1, 4, 13, 67, 3, 468, 259, 102, 153, 23, 1, 71, ~
## $ prep_time <int> 20, 10, 10, 20, 30, 45, 15, 10, 30, 20, 5, 5, 15, 15~
## $ cook_time <int> 35, 5, 75, 40, 95, 80, 45, 190, 480, 55, 0, 5, 60, 3~
## $ total_time <int> 55, 495, 85, 60, 125, 155, 60, 200, 510, 75, 485, 10~
## $ servings <int> 16, 8, 4, 8, 8, 12, 6, 8, 12, 6, 1, 4, 15, 6, 12, 4,~
## $ cuisine <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ published_year <dbl> 2020, 2022, 2018, 2025, 2024, 2019, 2023, 2022, 2023~
## $ published_month <dbl> 6, 9, 6, 3, 12, 4, 1, 7, 1, 11, 2, 11, 11, 1, 11, 1,~
## $ published_day <int> 18, 26, 8, 3, 11, 3, 4, 14, 19, 14, 5, 26, 12, 18, 3~
## $ published_weekday <ord> Thursday, Monday, Friday, Monday, Wednesday, Wednesd~
## $ contains_eggs <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, ~
## $ uses_cans <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, ~
## # A tibble: 15,593 x 22
## name url author date_published calories fat carbs protein avg_rating
1
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Chewy Wh~ http~ DMOMMY 2020-06-18 222 13 24 6 4.4
## 2 Pumpkin ~ http~ Bobbi~ 2022-09-26 477 31 43 8 5
## 3 Eggs Poa~ http~ Bren 2018-06-08 354 18 32 20 4.8
## 4 Minestro~ http~ Sarah~ 2025-03-03 356 9 53 19 4.3
## 5 Yummy St~ http~ Procr~ 2024-12-11 366 22 23 19 4.7
## 6 Prime Ri~ http~ Cajun~ 2019-04-03 709 47 31 37 4.2
## 7 Parmesan~ http~ Anika 2023-01-04 466 27 1 52 4.4
## 8 Chicken ~ http~ Bob C~ 2022-07-14 782 61 19 40 4.6
## 9 Sweet Po~ http~ Dean 2023-01-19 355 15 33 23 4.7
## 10 Quick Ba~ http~ Chris~ 2024-11-14 395 12 33 37 4.7
## # i 15,583 more rows
## # i 13 more variables: total_ratings <int>, reviews <int>, prep_time <int>,
## # cook_time <int>, total_time <int>, servings <int>, cuisine <chr>,
## # published_year <dbl>, published_month <dbl>, published_day <int>,
## # published_weekday <ord>, contains_eggs <lgl>, uses_cans <lgl>
2. Data preparation
We first inspect the dataset to understand its structure. The column ingredients contains long text with
special characters that can cause rendering issues, so it is excluded from the initial preview.
# print the dataset structure (excluding ingredients)
print(glimpse(select(all_recipes, -ingredients)))
## Rows: 15,593
## Columns: 22
## $ name <chr> "Chewy Whole Wheat Peanut Butter Brownies", "Pumpkin~
## $ url <chr> "https://www.allrecipes.com/recipe/140717/chewy-whol~
## $ author <chr> "DMOMMY", "Bobbie Susan", "Bren", "Sarah Brekke", "P~
## $ date_published <chr> "2020-06-18", "2022-09-26", "2018-06-08", "2025-03-0~
## $ calories <int> 222, 477, 354, 356, 366, 709, 466, 782, 355, 395, 37~
## $ fat <int> 13, 31, 18, 9, 22, 47, 27, 61, 15, 12, 14, 7, 8, 20,~
## $ carbs <int> 24, 43, 32, 53, 23, 31, 1, 19, 33, 33, 50, 16, 29, 1~
## $ protein <int> 6, 8, 20, 19, 19, 37, 52, 40, 23, 37, 17, 4, 4, 28, ~
## $ avg_rating <dbl> 4.4, 5.0, 4.8, 4.3, 4.7, 4.2, 4.4, 4.6, 4.7, 4.7, 4.~
## $ total_ratings <int> 47, 1, 4, 14, 84, 5, 648, 347, 129, 195, 33, 2, 84, ~
## $ reviews <int> 36, 1, 4, 13, 67, 3, 468, 259, 102, 153, 23, 1, 71, ~
## $ prep_time <int> 20, 10, 10, 20, 30, 45, 15, 10, 30, 20, 5, 5, 15, 15~
## $ cook_time <int> 35, 5, 75, 40, 95, 80, 45, 190, 480, 55, 0, 5, 60, 3~
## $ total_time <int> 55, 495, 85, 60, 125, 155, 60, 200, 510, 75, 485, 10~
## $ servings <int> 16, 8, 4, 8, 8, 12, 6, 8, 12, 6, 1, 4, 15, 6, 12, 4,~
## $ cuisine <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ published_year <dbl> 2020, 2022, 2018, 2025, 2024, 2019, 2023, 2022, 2023~
## $ published_month <dbl> 6, 9, 6, 3, 12, 4, 1, 7, 1, 11, 2, 11, 11, 1, 11, 1,~
## $ published_day <int> 18, 26, 8, 3, 11, 3, 4, 14, 19, 14, 5, 26, 12, 18, 3~
## $ published_weekday <ord> Thursday, Monday, Friday, Monday, Wednesday, Wednesd~
## $ contains_eggs <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, ~
## $ uses_cans <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, ~
## # A tibble: 15,593 x 22
## name url author date_published calories fat carbs protein avg_rating
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Chewy Wh~ http~ DMOMMY 2020-06-18 222 13 24 6 4.4
## 2 Pumpkin ~ http~ Bobbi~ 2022-09-26 477 31 43 8 5
## 3 Eggs Poa~ http~ Bren 2018-06-08 354 18 32 20 4.8
2
## 4 Minestro~ http~ Sarah~ 2025-03-03 356 9 53 19 4.3
## 5 Yummy St~ http~ Procr~ 2024-12-11 366 22 23 19 4.7
## 6 Prime Ri~ http~ Cajun~ 2019-04-03 709 47 31 37 4.2
## 7 Parmesan~ http~ Anika 2023-01-04 466 27 1 52 4.4
## 8 Chicken ~ http~ Bob C~ 2022-07-14 782 61 19 40 4.6
## 9 Sweet Po~ http~ Dean 2023-01-19 355 15 33 23 4.7
## 10 Quick Ba~ http~ Chris~ 2024-11-14 395 12 33 37 4.7
## # i 15,583 more rows
## # i 13 more variables: total_ratings <int>, reviews <int>, prep_time <int>,
## # cook_time <int>, total_time <int>, servings <int>, cuisine <chr>,
## # published_year <dbl>, published_month <dbl>, published_day <int>,
## # published_weekday <ord>, contains_eggs <lgl>, uses_cans <lgl>
When I looked at the protein concentration I saw that most of the dishes have between 0 and 100 grams
of protein. Only a few are out of this range, 19 out of 15000+ dishes. The problem with these 19 dishes is
that they are outliners so they extend massively our range and either false or make it harder to get readable
or understandable data. By using the filter() function from R we can specify that we want the dishes that
contain between 0 and 100 grams of protein.
filter(protein >= 0, protein < 100, fat >= 0, fat < 100)%>%
# Filter and create protein categories
df_cat <- all_recipes %>%
select(-ingredients)%>%
filter(protein >= 0, protein < 100, fat >= 0, fat < 100) %>%
mutate(protein_level = cut(protein,
breaks = c(0, 5, 10, 15, 20, 30, 40, 60, 80, 100, Inf),
labels = c("0-5", "5-10", "10-15", "15-20", "20-30", "30-40",
"40-60", "60-80", "80-100", "100+"),
# Intervals are open on the left and cut on the right (a, b]
right = FALSE
))
df_cat
## # A tibble: 15,126 x 23
## name url author date_published calories fat carbs protein avg_rating
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Chewy Wh~ http~ DMOMMY 2020-06-18 222 13 24 6 4.4
## 2 Pumpkin ~ http~ Bobbi~ 2022-09-26 477 31 43 8 5
## 3 Eggs Poa~ http~ Bren 2018-06-08 354 18 32 20 4.8
## 4 Minestro~ http~ Sarah~ 2025-03-03 356 9 53 19 4.3
## 5 Yummy St~ http~ Procr~ 2024-12-11 366 22 23 19 4.7
## 6 Prime Ri~ http~ Cajun~ 2019-04-03 709 47 31 37 4.2
## 7 Parmesan~ http~ Anika 2023-01-04 466 27 1 52 4.4
## 8 Chicken ~ http~ Bob C~ 2022-07-14 782 61 19 40 4.6
## 9 Sweet Po~ http~ Dean 2023-01-19 355 15 33 23 4.7
## 10 Quick Ba~ http~ Chris~ 2024-11-14 395 12 33 37 4.7
## # i 15,116 more rows
## # i 14 more variables: total_ratings <int>, reviews <int>, prep_time <int>,
## # cook_time <int>, total_time <int>, servings <int>, cuisine <chr>,
## # published_year <dbl>, published_month <dbl>, published_day <int>,
## # published_weekday <ord>, contains_eggs <lgl>, uses_cans <lgl>,
## # protein_level <fct>
3
2. Data preparation
We also compute the average protein and average fat for each group and transform the data to a tidy format
suitable for plotting. With these calculs and values we are ready to move on the Visualization part.
# Calculate means and prepare data for plotting
df_summary <- df_cat %>%
group_by(protein_level) %>%
# na.rm help us delete and ignore the NA values so that our mean isn't NA
summarise(protein = mean(protein, na.rm = TRUE),
fat = mean(fat, na.rm = TRUE)) %>%
# The pivot longer function helps us to simplify our dataset by putting the fat and protein in one collum named
# nutrient that takes the tag "protein" or "fat".
pivot_longer(c(protein, fat), names_to = "nutrient", values_to = "value") %>%
group_by(protein_level) %>%
# Here we add a col to our df_summary named share that will represent the percent of spce taken by fat or protein
mutate(share = value / sum(value))
df_summary
## # A tibble: 18 x 4
## # Groups: protein_level [9]
## protein_level nutrient value share
## <fct> <chr> <dbl> <dbl>
## 1 0-5 protein 2.30 0.218
## 2 0-5 fat 8.23 0.782
## 3 5-10 protein 6.61 0.296
## 4 5-10 fat 15.7 0.704
## 5 10-15 protein 11.9 0.400
## 6 10-15 fat 17.8 0.600
## 7 15-20 protein 16.9 0.461
## 8 15-20 fat 19.8 0.539
## 9 20-30 protein 24.3 0.510
## 10 20-30 fat 23.3 0.490
## 11 30-40 protein 33.9 0.542
## 12 30-40 fat 28.7 0.458
## 13 40-60 protein 46.7 0.573
## 14 40-60 fat 34.9 0.427
## 15 60-80 protein 66.8 0.615
## 16 60-80 fat 41.9 0.385
## 17 80-100 protein 88.9 0.647
## 18 80-100 fat 48.6 0.353
3. Visualization
The following visualization demonstrates how protein and fat composition varies across recipes with different
protein levels :
# Filter and create protein categories
df_cat <- all_recipes %>%
filter(protein >= 0, protein < 100, fat >= 0, fat < 100) %>%
mutate(protein_level = cut(protein,
breaks = c(0, 5, 10, 15, 20, 30, 40, 60, 80, 100, Inf),
labels = c("0-5", "5-10", "10-15", "15-20", "20-30", "30-40",
"40-60", "60-80", "80-100", "100+"),
# Intervals are open on the left and cut on the right (a, b]
4
right = FALSE
))
# Calculate means and prepare data for plotting
df_summary <- df_cat %>%
group_by(protein_level) %>%
# na.rm help us delete and ignore the NA values so that our mean isn't NA
summarise(protein = mean(protein, na.rm = TRUE),
fat = mean(fat, na.rm = TRUE)) %>%
# The pivot longer function helps us to simplify our dataset by putting the fat and protein in one collum named
# nutrient that takes the tag "protein" or "fat".
pivot_longer(c(protein, fat), names_to = "nutrient", values_to = "value") %>%
group_by(protein_level) %>%
# Here we add a col to our df_summary named share that will represent the percent of spce taken by fat or protein
mutate(share = value / sum(value))
# Create plot
ggplot(df_summary, aes(x = protein_level, y = share, fill = nutrient)) +
geom_col(color = "white", width = 0.8) +
# Set the color of each categorie protein and fat
scale_fill_manual(values = c(protein = "#4F46E5", fat = "#F59E0B")) +
# Set the y axe on percent
scale_y_continuous(labels = scales::percent) +
labs(title = "Protein vs Fat Proportions",
subtitle = "How Protein and Fat Composition Varies with Protein Content Level",
x = "Protein Content Categories (g)",
y = "Average Proportion (%)",
fill = "Nutrients :") +
theme_minimal(base_size = 13) +
theme(legend.position = "top",
legend.justification = "left",
legend.background = element_blank(),
legend.title = element_text(size = 11, face = "bold"),
legend.text = element_text(size = 11),
legend.key = element_blank(),
legend.margin = margin(t = 5, r = 0, b = 5, l = 10),
legend.spacing.x = unit(0.5, "cm"),
legend.box.margin = margin(0, 0, 0, 0),
axis.text.x = element_text(angle = 20, hjust = 1),
plot.title = element_text(face = "bold", size = 20, margin = margin(t = 25, r = 0, b = 7, l = 10)),
plot.subtitle = element_text(face = "plain", size = 12, margin = margin(t = 0, r = 0, b = 8, l = 10)),
axis.title.x = element_text(face = "plain", size = 11, margin = margin(t = 10)),
axis.title.y = element_text(face = "plain", size = 11, margin = margin(r = 10)))
5
4. Interpretation of results
The plot shows how the relative proportion of protein and fat changes across categories. Several clear patterns
emerge:
For low-protein meals (0–10 g), fat makes up the majority of the macronutrient composition.
As protein levels increase, the share of fat decreases, indicating a shift toward leaner recipes.
Beyond 30–40 g of protein, the ratio stabilizes: these dishes are typically balanced or protein-dominant (e.g.,
meats, legumes, protein-rich dishes).
Low protein dishes (0-15g): Show higher relative fat content compared to protein
Medium protein dishes (15-50g): Exhibit more balanced protein-fat ratios
High protein dishes (50g+)
: Demonstrate dominant protein proportions with reduced fat percent-
ages
This suggests a macronutrient substitution effect, where protein replaces fat as the main source of energy.
The trend can also help identify dietary profiles—for example, recipes suitable for athletes or low-fat diets.
5. Design explanation
This visualization was designed to be clear, minimal, and accessible:
Chart type : I choose a stacked bar chart because I think that it clearly and directly shows the proportion
between the fat and the protein. By just looking at it we can directly say if we have more or less protein in
proportion.
Color palette : Blue for protein (associated with strength and health) and orange for fat (energy-dense) create
strong visual contrast and make it easy to understand.
Minimal theme: Removes unnecessary grid lines to focus on the data (high data-to-ink ratio) and white
background.
6
Legibility : I slightly rotated the Axis labels for readability, the legend is positioned at the top for quick
identification and I added spacing for better understanding.
Structure: The plot follows the visual design principles of contrast, alignment, and proximity required for
high communication value the goal is to be easier and pleasant to read but also provide precise data from a
large dataset.
6. Key takeaway
Overall, the analysis reveals a consistent pattern :
Recipes with higher protein content tend to have lower fat proportion.
This visualization effectively shows the inverse relationship between protein content and fat proportion across
recipe categories, providing clear insights into nutritional composition trends. It also highlights nutritional
diversity across recipes and offers a deeper story about how protein-rich foods are typically healthier for us
and more balanced.
7