Online recipes typically consist of several components: a recipe title, a list of ingredients and measurements, instructions for preparation, and a picture of the resulting dish. I haven’t been able to find any open datasets containing each of these elements, so I scraped ~125,000 recipes from various food websites1.

A typical recipe looks something like this:

  • Title: Guacamole
  • Ingredients:
    • 3 Haas avocados, halved, seeded and peeled
    • 1 lime, juiced
    • 1/2 teaspoon kosher salt
    • 1/2 teaspoon ground cumin
    • 1/2 teaspoon cayenne
    • 1/2 medium onion, diced
    • 1/2 jalapeño pepper, seeded and minced
    • 2 Roma tomatoes, seeded and diced
    • 1 tablespoon chopped cilantro
    • 1 clove garlic, minced
  • Instructions: In a large bowl place the scooped avocado pulp and lime juice, toss to coat. Drain, and reserve the lime juice, after all of the avocados have been coated. Using a potato masher add the salt, cumin, and cayenne and mash. Then, fold in the onions, tomatoes, cilantro, and garlic. Add 1 tablespoon of the reserved lime juice. Let sit at room temperature for 1 hour and then serve.
  • Source2:
  • Picture:

This dataset is particularly interesting for machine learning because each recipe contains multiple elements, each of which provides additional information about the recipe. Current deep learning models excel at learning the relationship between one element and a single other element (e.g., image-to-text, text-to-image, text-to-summarized-text). This dataset has been used for several deep learning projects so far:

However, these models don’t make full use of a recipe’s entire information set and structure; I’m particularly interested in utilizing multiple recipe elements simultaneously to learn high dimensional representations of recipes for use in recipe summarization, interpolation, and generation. I’d love to hear from you if you’ve worked on a similar problem.

Learn more


  1. Roughly 70,000 of these recipes have images associated with them. Comments and ratings data are not included. 

  2. I’ve hashed all source URLs in hosted datasets I’ve made available. The original source URLs can be downloaded by re-running the scrapers, as documented in the project documentation.