Protein vs Fat Proportions

How Protein and Fat Composition Varies with Protein Content Level

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Part 1 : The data

Data Cleaning and Filtering

To solve this question, we first need to look at the dataset and select the important fields that we will need. Here we will need protein, fat, carbs, and calories.

Once we have our data, we need to clean and understand it. When I looked at the protein concentration, I saw that most dishes have between 0 and 100 grams of protein. Only a few are outside this range - 19 out of 15,000+ dishes. The problem with these 19 dishes is that they are outliers, so they extend our range massively and either falsify or make it harder to get readable or understandable data.

By using the filter() function from R, we can specify that we want the dishes that contain between 0 and 100 grams of protein.

filter(protein > 0, protein < 200, fat > 0, fat < 200, carbs > 0, carbs < 200) %>%

# Filter and create protein categories
df_cat <- all_recipes %>%
  filter(protein >= 0, protein < 100, fat >= 0, fat < 100) %>%
  mutate(protein_level = cut(protein,
    breaks = c(0, 5, 10, 15, 20, 30, 40, 60, 80, 100, Inf),
    labels = c("0-5", "5-10", "10-15", "15-20", "20-30", "30-40",
               "40-60", "60-80", "80-100", "100+"),
    # Intervals are open on the left and cut on the right (a, b]
    right = FALSE
  ))
df_cat

Data Transformation

We calculate average nutrient values and transform the data for visualization :

# Calculate means and prepare data for plotting
df_summary <- df_cat %>%
  group_by(protein_level) %>%
  # na.rm help us delete and ignore the NA values so that our mean isn't NA
  summarise(protein = mean(protein, na.rm = TRUE),
            fat = mean(fat, na.rm = TRUE)) %>%
  # The pivot longer function helps us to simplify our dataset by putting the fat and protein in one collum named 
  # nutrient that takes the tag "protein" or "fat".
  pivot_longer(c(protein, fat), names_to = "nutrient", values_to = "value") %>%
  group_by(protein_level) %>%
  # Here we add a col to our df_summary named share that will represent the percent of spce taken by fat or protein
  mutate(share = value / sum(value))

df_summary

The following visualization demonstrates how protein and fat composition varies across different protein content categories :

# Calculate means and prepare data for plotting
df_summary <- df_cat %>%
  group_by(protein_level) %>%
  # na.rm help us delete and ignore the NA values so that our mean isn't NA
  summarise(protein = mean(protein, na.rm = TRUE),
            fat = mean(fat, na.rm = TRUE)) %>%
  # The pivot longer function helps us to simplify our dataset by putting the fat and protein in one collum named 
  # nutrient that takes the tag "protein" or "fat".
  pivot_longer(c(protein, fat), names_to = "nutrient", values_to = "value") %>%
  group_by(protein_level) %>%
  # Here we add a col to our df_summary named share that will represent the percent of spce taken by fat or protein
  mutate(share = value / sum(value))

Part 2 : The plot

Stacked Bar Chart Interpretation

Once we have controlled and reduced our recipes to the important entries, we can start to talk about the graph. The goal is to clearly show how the amount of protein in a dish influences the fat content.

library(dplyr)
library(ggplot2)

# Filter and create protein categories
df_cat <- all_recipes %>%
  filter(protein >= 0, protein < 100, fat >= 0, fat < 100) %>%
  mutate(protein_level = cut(protein,
    breaks = c(0, 5, 10, 15, 20, 30, 40, 60, 80, 100, Inf),
    labels = c("0-5", "5-10", "10-15", "15-20", "20-30", "30-40",
               "40-60", "60-80", "80-100", "100+"),
    # Intervals are open on the left and cut on the right (a, b]
    right = FALSE
  ))

# Calculate means and prepare data for plotting
df_summary <- df_cat %>%
  group_by(protein_level) %>%
  # na.rm help us delete and ignore the NA values so that our mean isn't NA
  summarise(protein = mean(protein, na.rm = TRUE),
            fat = mean(fat, na.rm = TRUE)) %>%
  # The pivot longer function helps us to simplify our dataset by putting the fat and protein in one collum named 
  # nutrient that takes the tag "protein" or "fat".
  pivot_longer(c(protein, fat), names_to = "nutrient", values_to = "value") %>%
  group_by(protein_level) %>%
  # Here we add a col to our df_summary named share that will represent the percent of spce taken by fat or protein
  mutate(share = value / sum(value))

# Create plot
ggplot(df_summary, aes(x = protein_level, y = share, fill = nutrient)) +
  geom_col(color = "white", width = 0.8) +
  # Set the color of each categorie protein and fat
  scale_fill_manual(values = c(protein = "#4F46E5", fat = "#F59E0B")) +
  # Set the y axe on percent
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Protein vs Fat Proportions",
       subtitle = "How Protein and Fat Composition Varies with Protein Content Level",
       x = "Protein Content Categories (g)", 
       y = "Average Proportion (%)",
       fill = "Nutrients :") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top",
        legend.justification = "left",
        legend.background = element_blank(),
        legend.title = element_text(size = 11, face = "bold"),
        legend.text = element_text(size = 11),
        legend.key = element_blank(),
        legend.margin = margin(t = 5, r = 0, b = 5, l = 10),
        legend.spacing.x = unit(0.5, "cm"),
        legend.box.margin = margin(0, 0, 0, 0),
        axis.text.x = element_text(angle = 20, hjust = 1),
        plot.title = element_text(face = "bold", size = 18, margin = margin(t = 30, r = 0, b = 7, l = 10)),
        plot.subtitle = element_text(face = "plain", size = 12, margin = margin(t = 0, r = 0, b = 8, l = 10)),
        axis.title.x = element_text(face = "plain", size = 11, margin = margin(t = 10, b = 20)),
        axis.title.y = element_text(face = "plain", size = 11, margin = margin(r = 10)))

Key Findings

The stacked bar chart reveals several important patterns :

  • Low protein dishes (0-15g) : Higher relative fat content compared to protein
  • Medium protein dishes (15-50g) : Shows more balanced protein-fat ratios
  • High protein dishes (50g+) : More protein proportions with less fat percentages

This plot clearly shows that usually when a dish has more protein, it has less fat. This can help people trying to reduce their fat consumption understand that by including or selecting dishes that contain more protein, they will usually have less fat. Even going from 0-5 grams of protein to 10-15 grams helps reduce the fat proportion by more than 15%. Dishes containing 20 to 30 grams of protein reduce fat to as low as 25%.