Feeding the machines

The AI data licensing market: insights and deal trends

Jun 10, 2025

Feeding the machines

This post is written by Jessica Li Gebert, who works with Neudata on data monetization projects.

Can you imagine that ChatGPT was once useless? 

I was writing my master's thesis on the politics and regulations of civilian-military dual-use AI tools in November 2022. Naturally, I had to try this chat.openai.com website. Baby ChatGPT was novel but struggled to understand user prompts and hallucinated. A lot. 

Today, can you imagine not using ChatGPT or another LLM? So…what happened? An obvious explanation is data. (Alongside improved reasoning architecture and computing capabilities, but that is a story for another day.)

As the old GIGO (garbage in, garbage out) dictum suggests, training and fine-tuning AI models with quality data is fundamental. Since late 2022, AI model makers have been silently and speedily inking content licensing deals to secure training sets. 

Yet, little is publicly known about the market for data used to train AI models. In this AI data series, I will be dissecting this elusive and opaque market in bite-sized posts. For data-rich companies looking for a new revenue opportunity, follow this series to stay updated on the most important AI data market developments and reach out to consulting@neudata.co to find out how you can feed the machines! 

To kick off, I have compiled 52 AI data licensing deals known to date. Below are my top five observations. 

Please note this post excludes the Chinese market, where complex geopolitics and regulations warrant a separate and detailed analysis in future posts. 

Graphic of known AI data licensing deals per year since the launch of ChatGPT in November 2022, showing 5 deals in 2023, 42 in 2024 and 4 until May 2025.

Table of AI data licensing deals, showing licensee, licensor, content type, data modlity, content language, deal year, data use case and deal size. Click to see the full table.
*Table compiled by Jessica Li Gebert at Neudata for editorial and informational purposes only, using publicly available sources. All rights and trademarks belong to their respective owners. Click on the table to see the full list of 52 deals

Observation 1: deal value driven by data volume, domain expertise, and dynamism

The largest known deals ranked are:

  1. Google-Reddit, $203m total contract value
  2. Microsoft-Taylor & Francis, $10m reported upfront plus $65m non-recurring revenue recognized in 2024
  3. Multiple-Shutterstock, $25m-$50m per deal

These deals suggest AI data prices are largely driven by:

  • Volume: Reddit has an estimated 500m-1b monthly average users, Taylor & Francis owns 3,000 academic journals, and Shutterstock is the world’s largest stock photo provider. 
  • Domain expertise: Taylor & Francis covers a wide range of academic research, while Shutterstock provides a variety of curated visual (and audio) data.
  • Dynamism: While Reddit may not offer curated and correct data like news media, its USP lies in the frequency and volume of new data posted on its platform.

Interestingly, the number of use cases does not seem to affect pricing - for instance, OpenAI offered merely $1m-$5m to use Le Monde and Prisa’s historical and current news data for training and information retrieval purposes. But this could also be attributed to a small volume per corpus. 

Observation 2: news media dominate deal-making by volume

Out of the 52 deals, 45 (87%) are textual data deals while 7 pertain to visual data. 

Specifically, 40 deals were signed with news and print media publishers (e.g. magazines), whereas, merely 3 with academic publishers and 2 with user-generated content platforms (e.g. social media). All 7 visual data deals are signed with Shutterstock for its image, video, and metadata. 

The primary drivers for this are:

  • News media offers vetted, curated, time-stamped, and grammatically correct content on a variety of topics - perfect for understanding natural language.
  • News media copyright structure is simple - the publishers own the right - thus making licensing negotiations easy. Imagine the copyright quagmire in academic publishing where researchers, universities, and publishers have varying degrees of copyright ownership. Or social media platforms’ lack of copyright over user-generated content - a licensing nightmare!
  • News media content carries fewer compliance risks because publishers are generally responsible for ensuring that any personal data (PII) or confidential and proprietary data included in articles complies with privacy laws.
  • Traditional news sources are often fact-checked by humans, thus reducing the ethical risk of training AI models on deeply biased, false, or obscene content that can run rampant on online forums, scraped blogs, or social media.
  • Finally, unlike user generated content, news media has generally well-known and tested legal standing for cross-border transfer. 

Observation 3: 77% (40) of the deals license data for real-time information retrieval (e.g. RAG systems)

while only 16 deals (31%) are for model training and 4 deals cover both. 

This coincides with the maturing of RAG architecture* and the rise of AI search engines, where an AI tool fetches new or live information from an external source to generate a response. 

From a data point of view, AI model makers are licensing news media content to surface relevant information to user queries - try asking Google, Perplexity or ChatGPT** a current affairs question and you will most likely get a link to a news source!

*RAG refers to Retrieval-Augmented Generation, an AI framework that combines a large language model with an external information retrieval system. Instead of relying solely on its training data, the model retrieves relevant documents or data in real time to generate more accurate and grounded responses.

** Note: Perplexity is a native RAG system but ChatGPT merely has RAG-like behaviors - that’s another story for another day.

Observation 4: Perplexity and OpenAI are the top data buyers in the current market

Perplexity signed 37% of the deals, followed by OpenAI at 29% - making them the top data buyers out of the 12 model makers that have signed publicly known data licensing deals. 

A closer look shows their prolific deal-making is driven by the need for live information retrieval:

  • Perplexity has been aggressively onboarding news data providers through its Publishers Program since 2024. As a native RAG search engine, working with news media is an expected development in Perplexity’s business model.
  • OpenAI’s shopping spree could explain the recent drastic improvements to GPT performance as it now fetches live information from the open web.
  • On a smaller scale, Microsoft has been working with news media too, to enhance its Copilot. Its purchasing behavior may have to do with its existing investment in and partnership with OpenAI.

Surprisingly, Google is not a prolific buyer - at least not based on publicly available information. Gemini’s data licenses may have been included in Google’s search engine arrangement with content owners such as publishers, news outlets, and website owners. (Hey Google, if you are reading this, please reach out! I would love to pick your brain.)

Observation 5: we have lots of work to do in terms of linguistic and cultural diversity in AI

In these 52 licensing deals, there are 38 unique data providers and 12 unique AI model makers. Yet, only five languages are represented, out of 7,000 spoken languages in the world. 

English, French, German, Japanese, and Spanish.

(And Chinese in the Chinese AI market. I will write about this separately in a future post.)

In AI, these are known as high-resources languages - they have extensive digital content and are the focus of most linguistic research. 

It is estimated that half of the world’s population speak high-resources languages. In other words, the other half - approximately 4 billion people - are digitally disenfranchised because our AI tools are simply not built to understand their languages and the nuances of their cultures. This is partially attributable to the fact that our current technology - the LLM - requires a large volume of training data that low-resources languages simply do not possess. 

 What does this all mean?

The AI data market is emerging fast and I am here for it! Deals known today imply the market is currently chasing large volumes of news data for two reasons: 

  1. Our current dominant AI model type is LLM, which is trained on large volumes of textual data 
  2. We are using AI chatbots as search engines. With increasing (AI-generated) disinformation, the need for fetching and attributing reliable news sources is of pressing and paramount importance

As underlying technologies progress and our collective understanding of the impact of AI grows, I expect to see data demand evolve in the following ways:

  • Multimodal models and native speech models drive the demand for studio recordings and natural speech data
  • The demand for expertise data, such as financial, scientific research, legal, medical, consumer preference data, etc. will continue to grow
  • The rise of small language models could bridge the digital linguistic divide and drive the demand for low-resources language data
  • Finally, data is the next battleground in great power politics. Data will become more politicized than it already is, additional cross-border data transfer regulations will likely be implemented worldwide

For data-rich companies, AI is not just a costly operating strategy to implement - it is a new, external revenue stream. To find out how you could monetize your enterprise data, please contact consulting@neudata.co

Blog suggestion

Suggest a topic for the Neudata blog

Suggest a blog topic