Bringing transparency to the data used to train artificial intelligence

Summary

Discusses the Data Provenance Initiative, an MIT-led effort to audit AI training datasets and provide a tool for tracing data lineage, licenses, and sources to reduce legal and ethical risks.

Key quotes

Without transparency into the lineage of data used for artificial intelligence models, researchers, businesses, and other intended users may find themselves out of compliance with emerging regulations

Their goal: to improve transparency, documentation, and informed use of AI training data.

The article details the creation of the Data Provenance Explorer tool and a systematic audit of over 1,800 text datasets. It highlights critical gaps in license documentation and geographic bias in AI training data.