← Back to the schedule

Decentralized data markets for training AI

Calendar icon

Thursday 15th

Time icon

15:20 | 16:00

Location icon

Theatre 25


Keywords defining the session:

- AI

- TCRs

- Regulatory issues

Takeaway points of the session:

- Use cases where proprietary interests and social needs come into conflict for machine learning.

- How to leverage open source Ethereum-based TCRs to develop decentralized data markets.


This talk considers how *decentralized data markets* provide means to resolve difficult problems when training machine learning models, especially for use cases with large shared risks.

Consider that ML models have become ubiquitous, embedded in products and services used throughout our daily lives. Generally those models get deployed by large commercial interests, having been trained on proprietary datasets.

However, matters of ethics, privacy, safety, bias, and other concerns can have terrible impact on individuals. As the risk/reward trade-offs grow for products based on AI, as the pressures of compliance and accountability grow, at what point is it no longer acceptable for any one commercial entity to hold responsibility for so much shared risk?

For example, Google develops large sets of training data from crucial sensors in self-driving cars. In an almost adversarial way, the regulators on multiple continents focus on the impact of failure cases related to those sensors and associated ML models. Edge cases in test datasets prove to be disproportionately valuable, and potentially the basis for economic incentives.

Instead of entrusting each manufacturer to build “near perfect” training datasets while bearing large risks, can we incentivize manufacturers to combine their data? Rewards for contributing parties could then derive from a combination of training data and testing edge cases, as identified by regulators and other watchdog parties.

This talk examines *decentralized data markets* with components based on smart contracts, token-curated registries, DApps, voting mechanisms, etc. — blockchain technologies — which allow multiple parties to curate ML training datasets in ways that are transparent, auditable, secure, and allow equitable payouts that take social values into account.

We’ll look at open source libraries from based on Ethereum which are being used to develop data markets. One may adjust trade-offs between decentralized and centralized characteristics, as needed by the business use cases and indicated by ethical concerns. This addresses other areas of machine learning risk, such as in genomics and medical research, financial credit scores, etc., where proprietary interests and social needs often come into conflict.