
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Chewy Wh~ http~ DMOMMY 2020-06-18 222 13 24 6 4.4
## 2 Pumpkin ~ http~ Bobbi~ 2022-09-26 477 31 43 8 5
## 3 Eggs Poa~ http~ Bren 2018-06-08 354 18 32 20 4.8
## 4 Minestro~ http~ Sarah~ 2025-03-03 356 9 53 19 4.3
## 5 Yummy St~ http~ Procr~ 2024-12-11 366 22 23 19 4.7
## 6 Prime Ri~ http~ Cajun~ 2019-04-03 709 47 31 37 4.2
## 7 Parmesan~ http~ Anika 2023-01-04 466 27 1 52 4.4
## 8 Chicken ~ http~ Bob C~ 2022-07-14 782 61 19 40 4.6
## 9 Sweet Po~ http~ Dean 2023-01-19 355 15 33 23 4.7
## 10 Quick Ba~ http~ Chris~ 2024-11-14 395 12 33 37 4.7
## # i 15,583 more rows
## # i 13 more variables: total_ratings <int>, reviews <int>, prep_time <int>,
## # cook_time <int>, total_time <int>, servings <int>, cuisine <chr>,
## # published_year <dbl>, published_month <dbl>, published_day <int>,
## # published_weekday <ord>, contains_eggs <lgl>, uses_cans <lgl>
2. Data preparation
We first inspect the dataset to understand its structure. The column ingredients contains long text with
special characters that can cause rendering issues, so it is excluded from the initial preview.
# print the dataset structure (excluding ingredients)
print(glimpse(select(all_recipes, -ingredients)))
## Rows: 15,593
## Columns: 22
## $ name <chr> "Chewy Whole Wheat Peanut Butter Brownies", "Pumpkin~
## $ url <chr> "https://www.allrecipes.com/recipe/140717/chewy-whol~
## $ author <chr> "DMOMMY", "Bobbie Susan", "Bren", "Sarah Brekke", "P~
## $ date_published <chr> "2020-06-18", "2022-09-26", "2018-06-08", "2025-03-0~
## $ calories <int> 222, 477, 354, 356, 366, 709, 466, 782, 355, 395, 37~
## $ fat <int> 13, 31, 18, 9, 22, 47, 27, 61, 15, 12, 14, 7, 8, 20,~
## $ carbs <int> 24, 43, 32, 53, 23, 31, 1, 19, 33, 33, 50, 16, 29, 1~
## $ protein <int> 6, 8, 20, 19, 19, 37, 52, 40, 23, 37, 17, 4, 4, 28, ~
## $ avg_rating <dbl> 4.4, 5.0, 4.8, 4.3, 4.7, 4.2, 4.4, 4.6, 4.7, 4.7, 4.~
## $ total_ratings <int> 47, 1, 4, 14, 84, 5, 648, 347, 129, 195, 33, 2, 84, ~
## $ reviews <int> 36, 1, 4, 13, 67, 3, 468, 259, 102, 153, 23, 1, 71, ~
## $ prep_time <int> 20, 10, 10, 20, 30, 45, 15, 10, 30, 20, 5, 5, 15, 15~
## $ cook_time <int> 35, 5, 75, 40, 95, 80, 45, 190, 480, 55, 0, 5, 60, 3~
## $ total_time <int> 55, 495, 85, 60, 125, 155, 60, 200, 510, 75, 485, 10~
## $ servings <int> 16, 8, 4, 8, 8, 12, 6, 8, 12, 6, 1, 4, 15, 6, 12, 4,~
## $ cuisine <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ published_year <dbl> 2020, 2022, 2018, 2025, 2024, 2019, 2023, 2022, 2023~
## $ published_month <dbl> 6, 9, 6, 3, 12, 4, 1, 7, 1, 11, 2, 11, 11, 1, 11, 1,~
## $ published_day <int> 18, 26, 8, 3, 11, 3, 4, 14, 19, 14, 5, 26, 12, 18, 3~
## $ published_weekday <ord> Thursday, Monday, Friday, Monday, Wednesday, Wednesd~
## $ contains_eggs <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, ~
## $ uses_cans <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, ~
## # A tibble: 15,593 x 22
## name url author date_published calories fat carbs protein avg_rating
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Chewy Wh~ http~ DMOMMY 2020-06-18 222 13 24 6 4.4
## 2 Pumpkin ~ http~ Bobbi~ 2022-09-26 477 31 43 8 5
## 3 Eggs Poa~ http~ Bren 2018-06-08 354 18 32 20 4.8
2