Recipe box

Online recipes typically consist of several components: a recipe title, a list of ingredients and measurements, instructions for preparation, and a picture of the resulting dish. I haven’t been able to find any open datasets containing each of these elements, so I scraped ~125,000 recipes from various food websites¹.

A typical recipe looks something like this:

Title: Guacamole
Ingredients:
- 3 Haas avocados, halved, seeded and peeled
- 1 lime, juiced
- 1/2 teaspoon kosher salt
- 1/2 teaspoon ground cumin
- 1/2 teaspoon cayenne
- 1/2 medium onion, diced
- 1/2 jalapeño pepper, seeded and minced
- 2 Roma tomatoes, seeded and diced
- 1 tablespoon chopped cilantro
- 1 clove garlic, minced
Instructions: In a large bowl place the scooped avocado pulp and lime juice, toss to coat. Drain, and reserve the lime juice, after all of the avocados have been coated. Using a potato masher add the salt, cumin, and cayenne and mash. Then, fold in the onions, tomatoes, cilantro, and garlic. Add 1 tablespoon of the reserved lime juice. Let sit at room temperature for 1 hour and then serve.
Source²: http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe
Picture:

This dataset is particularly interesting for machine learning because each recipe contains multiple elements, each of which provides additional information about the recipe. Current deep learning models excel at learning the relationship between one element and a single other element (e.g., image-to-text, text-to-image, text-to-summarized-text). This dataset has been used for several deep learning projects so far:

Deep Learning for Emojis with VS Code Tools for AI [Microsoft Machine Learning Blog]: recipe prediction using word and emoji embeddings
Recipe summarization: generate a title for a recipe given the corresponding ingredients and instructions
Food GAN: generate novel food images using Generative Adversarial Networks

However, these models don’t make full use of a recipe’s entire information set and structure; I’m particularly interested in utilizing multiple recipe elements simultaneously to learn high dimensional representations of recipes for use in recipe summarization, interpolation, and generation. I’d love to hear from you if you’ve worked on a similar problem.

Learn more

Footnotes

Roughly 70,000 of these recipes have images associated with them. Comments and ratings data are not included. ↩
I’ve hashed all source URLs in hosted datasets I’ve made available. The original source URLs can be downloaded by re-running the scrapers, as documented in the project documentation. ↩