OpenAI CTO Comments on Sora’s Training Data Source

By Michael On Mar 17, 2024

OpenAI’s upcoming video-generating AI model, Sora, has raised concerns about the source of its training data. In a recent interview with The Wall Street Journal, the company’s Chief Technology Officer, Mira Murati, provided unclear answers regarding the data used to train Sora.

When questioned about Sora’s training data, Murati offered a vague response. She mentioned using “publicly available data and licensed data” but expressed uncertainty about the inclusion of social media platforms like YouTube, Instagram, or Facebook. Despite OpenAI’s partnership with stock image company Shutterstock, Murati declined to elaborate on whether their data was utilized for Sora’s training, later confirming its use to the Journal.

AI models rely on massive datasets to learn and develop capabilities. This training data helps the model recognize patterns, make predictions, and understand human language. In Sora’s case, the specific data used is crucial for the quality and accuracy of its video generation based on text instructions.

Don't Miss

Zimbabwe Central Bank Addresses Challenges with…

Jun 5, 2024

Coinbase dethroned? HTX Claims Top Spot in Crypto…

May 28, 2024

Does A Bitcoin Drop Affect The Stock Market?

May 22, 2024

Murati, at OpenAI since 2018, has spearheaded several prominent projects, including DALL-E 3 (image generation), Whisper (speech recognition), and GPT-4 (chatbot). Notably, she served as interim CEO in November 2023 after Sam Altman’s departure.

OpenAI has faced legal challenges concerning the training data used for its AI models. In July 2023, authors sued them, alleging ChatGPT’s content summaries infringed on their copyrighted works. Similarly, The New York Times filed suit against Microsoft and OpenAI in December, claiming their AI chatbots were trained on copyrighted newspaper content. Another lawsuit accused OpenAI of scraping private user information without consent to train ChatGPT.

Murati’s evasive responses regarding Sora’s training data raise transparency concerns. The lack of clarity about potential use of copyrighted material or private user information is a cause for concern. OpenAI should strive for greater transparency in its data sourcing practices to ensure ethical AI development and mitigate potential legal risks.

Disclaimer: The information provided in this article is for informational purposes only and should not be construed as financial or investment advice. Cryptocurrency investments are subject to market risks, and individuals should seek professional advice before making any investment decisions.