We have an official open-source AI definition now, but the fight is far from over

3 weeks ago 30

BOOK THIS SPACE FOR AD

ARTICLE AD

Yaroslav Kushta/Getty Images

RALEIGH, NC. — The Open Source Initiative (OSI) released Open Source AI Definition (OSAID) 1.0 on Oct. 28, 2024, at the All Things Open conference. Creating it wasn't easy.

It took the OSI almost two years to create and set up the OSAID. But with no change from the OSAID's last draft, it's finally done. Unfortunately, not everyone is happy with it, and even its creators admit it's a work in progress.

Also: OpenAI plans to offer its 250 million ChaptGPT users even more services

Why? Carlo Piana, the OSI's chairman and an attorney, explained in an interview that, "Our collective understanding of what AI does, what's required to modify language models is limited now. The more we use it, the more we'll understand. Right now our understanding is limited, and we don't know yet what the technology will look like in one year, two years, or three years."

Or, as Taylor Dolezal, head of ecosystem for the Cloud Native Computing Foundation (CNCF) put it, "Balancing open source principles with AI complexities can sometimes feel like trying to solve a Rubik's Cube blindfolded."

As to why people object to the new definition, broadly speaking, there are three groups who are concerned with OSAID: pragmatists, idealists, and faux-source business leaders.

Also: Google's new AI course will teach you to write more effective prompts

To start, you need to understand what the conflicts are about. Mark Collier, the OpenStack Foundation's COO who helped with drafting the OSAID, recently put it well in an essay:

One of the biggest challenges in creating the Open Source AI Definition is deciding how to treat datasets used during the training phase. At first, requiring all raw datasets to be made public might seem logical.

However, this analogy between datasets and source code is imperfect and starts to fall apart the closer you look. Training data influences models through patterns, while source code provides explicit instructions. AI models produce learned parameters (weights), whereas software is directly compiled from source code. … many AI models are trained on proprietary or legally ambiguous data, such as web-scraped content or sensitive datasets like medical records.

[Therefore] any publicly available data used for training should be accessible, alongside full transparency about all datasets used and the procedures followed for cleaning and labeling them. Striking the right balance on this issue is one of the toughest parts of creating the definition, especially with the rapid changes in the market and legal landscape.

So it is that the pragmatists wanted, and got, an open-source AI definition where not all the data needs to be open and shared. For their purposes, there only needs to be "sufficiently detailed information about the data used to train the system" rather than the full dataset itself. This approach aims to balance transparency with practical and legal considerations such as copyright and private medical data.

Besides the OSI, organizations like the Mozilla Foundation, the OpenInfra Foundation, Bloomberg Engineering, and SUSE have endorsed the OSAID. For example, Alan Clark of SUSE's CTO office said, "SUSE applauds the progress made by the OSI and its OSAID. The efforts are culminating in a very thorough definition, which is important for the quickly evolving AI landscape and the role of open source within it. We commend the process OSI is utilizing to arrive at the definition and the adherence to the open source methodologies."

Academics have also approved of this first OSAID release. Percy Liang, director of the Center for Research on Foundation Models at Stanford University, said, in a statement, "Coming up with the proper open-source definition is challenging, given restrictions on data, but I'm glad to see that the OSI v1.0 definition requires at least that the complete code for data processing (the primary driver of model quality) be open-source. The devil is in the details, so I'm sure we'll have more to say once we have concrete examples of people trying to apply this Definition to their models."

Speaking of that devil, the idealists strongly object to non-open data being allowed inside an open-source AI model. While Piana stated, "The board is confident that the process has resulted in a definition that meets the standards of Open Source as defined in the Open Source Definition and the Four Essential Freedoms," the idealists don't see it that way at all.

Also: Agentic AI is the top strategic technology trend for 2025

Tom Callaway, Principal Open-Source Technical Strategist at Amazon Web Services (AWS), summarized their objections well: "The simple fact remains... it allows you to build an AI system binary from proprietary data sources and call the result 'open source,' and that's simply wrong. It damages every established understanding of what 'open source' is, all in the name of hoping to attach that brand to a 'bigger tent' of things."

The OSI is well aware of these arguments. At an OSI panel discussion at All Things Open, an OSI representative said, "Members of our communities are upset. They felt like their voices were not heard as a part of this process." The OSI felt that it had to come up with a definition because laws were being passed both in the US and the EU about open-source AI without defining it. The OSI and many other groups felt the issue had to be addressed before companies went ahead with their own bogus open-source AI definitions. Looking ahead, the OSI will adjust the definition to address upcoming changes in AI.

In the meantime, at least one group, Digital Public Goods (DPG) is updating its DPG Standard for AI to mandate open training data for AI systems. Its proposal will appear on GitHub in early November and will be open for public comment for a 4-week community review period. There will be more such efforts.

Also: Could AI make data science obsolete?

The faux-source companies have a vested interest in their programs being considered open source. The laws and regulations for open-source AI are more lenient than those for proprietary AI systems. That means they can save a lot of money if their products are regulated under open-source rules.

For example, Meta's Llama 3's license doesn't make the open-source grade on several grounds. Nonetheless, Meta claimed, "There is no single open-source AI definition, and defining it is a challenge because previous open-source definitions do not encompass the complexities of today's rapidly advancing AI models." Meta and other major AI powers, such as OpenAI, will try to get governments to recognize their self-defined definitions. I expect them to come up with a faux-source AI definition to cover their proprietary products and services.

What all this means, from where I sit, is that while the OSAID has a standard that many groups will observe, the conflicts over what really is open-source AI have only just begun. I don't see any resolution to the conflict for years to come.

Now, most AI users won't care. They just want help with their homework or with writing Star Wars fanfic and to make their jobs easier. It's an entirely different story for companies and government agencies. For them, open-source AI is vital for both business and development purposes.

Read Entire Article