In February 2024, Reddit concluded an agreement with Google with a value of $ 60 million to make the search giant to employ data on the platform to train its artificial intelligence models. In particular, Reddit users who were sold were absent in discussions.
The contract reflected the reality of newfangled internet: huge technology companies have virtually all our data online and decide what to do with these data. It is not surprising that many platforms earn on their data, and the fastest growing way to achieve this today is the sale of AI companies, which themselves are huge technological companies using data for training increasingly stronger models.
The decentralized van platform, which began as a class project in myth, has a mission to give away to users. The company has created a network fully belonging to the user, which allows people to send their data and rule their employ. The creators of artificial intelligence can expose users to ideas for novel models, and if users agree to bring their data to training, they receive proportional property in models.
The idea is to give everyone AI systems that will shape our society more and more, while unlocking novel data pools to develop technology.
“This data is needed to create better AI systems,” says co -founder of Van Anna Kazlauskas ’19. “We have created a decentralized system to get better data – which today are found in large technology companies – at the same time allowing users to keep the highest property.”
From economics to blockchain
Many high school students have photos of pop stars or athletes on the bedroom walls. Kazlauskas had a photo of the former US Treasury secretary Janet Yellen.
Kazlauskas has certainly arrived that he would become an economist, but ultimately she was one of five students who joined the MIT Bitcoin club in 2015, and this experience led her to the world of blockchains and cryptocurrencies.
From her room in the dormitory in Macgregor House, she began to extract Ethereum cryptocurrency. From time to time, she searched the campus garbage cans in search of rejected computer systems.
“It all interested me around computer science and network,” says Kazlauskas. “This concerned from the prospect of blockchain, dispersed systems and how they can transfer economic power to individuals, as well as artificial intelligence and econometrics.”
Kazlauskas met Art Abal, who then attended Harvard University, in the former class of Media Lab Emergent Ventures, and the couple decided to work on novel ways to obtain data to train AI systems.
“Our question was: how could you have a large number of people contributing to these AI systems using more distributed network?” Kazlauskas recalls.
Kazlauskas and Abal tried to deal with the status quo, in which most models are trained by scraping public data on the Internet. Enormous technology companies often buy huge data sets from other companies.
The founders’ approach has evolved over the years and was informed by Kazlauskas’s experience, working at Financial Blockchain Company Celo after graduation. But Kazlauskas attributes his time in the myth, helping her think about these problems, and the Emergent Ventures instructor, Ramesh Raskar, still helps van to think about AI research questions.
“It was great to have an open opportunity to simply build, break into and discover,” says Kazlauskas. “I think that the ethos in the myth is really important. It’s only about building things, seeing what works and continuing iteration.”
Today, the van uses a little -known law that allows users of most huge technological platforms to directly export their data. Users can send this information to encrypted digital portfolios in the van and pay them for training models at their own discretion.
AI engineers can suggest ideas for novel Open Source models, and people can combine their data to aid in training the model. In the world of blockchain, data pools are called Data Dao, which means a decentralized autonomous organization. Data can also be used to create personalized AI and agents models.
In the van, data is used in a way that maintains user privacy, because the system does not disclose possible to identifying information. After creating the model, users keep ownership so that every time it is used, they are rewarded in proportion to how much their data helped.
“From the programmer’s perspective, you can now build these hyper-personal health applications that take into account exactly what you ate, how you slept, how you practice,” says Kazlauskas. “These applications are not possible today because of these brick gardens of large technology companies.”
CrowdSourced, belonging to the user artificial intelligence
Last year, a machine learning engineer proposed using the van user to train the AI model, which can generate Reddit posts. Over 140,000 vana users brought their Reddit data, which contained posts, comments, news and many others. Users have decided that the model could be used and they maintained the property of the model after its creation.
The van enabled similar initiatives from data controlled by user from the social media platform x; Data on sleep from sources such as Oura Rings; and more. There is also cooperation that connects the data pool to create wider AI applications.
“Suppose users have a spotify data, reddit data and fashion data” Kazlauskas explains. “Usually, Spotify will not cooperate with this type of companies, and there is actually regulation against this. But users can do it if they give access, so these interlacey data sets can be used to create really powerful models.”
The van has over 1 million users and over 20 DAO live. Users in the van system in the vana system have been proposed over 300 additional data stalls, and Kazlauskas claims that many will enter production this year.
“I think there are many promises in generalized AI models, personalized medicine and new consumer applications, because it is difficult to combine all these data or access them,” says Kazlauskas.
Pools of data allow users to achieve something that the most powerful technology companies are struggling with today.
“Today, large technology companies have built these moans of data, so the best data sets are not available to anyone,” says Kazlauskas. “This is a collective problem in which my data is not so valuable, but the pool of data with tens of thousands or millions of people is really valuable. The van allows you to build pools. This is beneficial to employ the win: users can employ the development of AI, because they are the owners of models. Then you do not finish in a script where you do not have any AI company.