Invisible Workers: The Unsung Heroes Behind AI Technology
Written on
The Essential Role of Data Labellers
AI tools like ChatGPT, Bard, Claude, and LLaMA can simplify complex topics, explaining them in ways even a young child could grasp. Similarly, AI image generators such as Midjourney, DALL-E, and Stable Diffusion transform text prompts into stunning visuals with ease. These remarkable advancements hinge on extensive datasets, meticulously crafted with the assistance of data labellers.
Data labellers—often referred to as annotators or data specialists—are responsible for tagging and categorizing the training data that powers AI models. The process of training a generative AI resembles the tale of the Very Hungry Caterpillar: a neural network ingests vast amounts of diverse data, leading to the emergence of a model and ultimately, a functional application.
For chatbots, this data is primarily sourced from the internet, encompassing materials from books, Wikipedia, arXiv, GitHub, and the Common Crawl web archive. Data labellers carry out essential manual annotations, enabling AI systems to grasp human interactions. A notable instance highlighting this necessity occurred when hackers managed to prompt ChatGPT to generate instructions for creating napalm. Although the relevant information was present in the training data, revealing it to users was deemed inappropriate. Data labellers provide the context that helps AI systems learn to avoid such disclosures.
The mechanism behind image generators is quite similar. DALL-E2 employs CLIP (Contrastive Language-Image Pre-training) to establish connections between text and images. CLIP learns from publicly available text-image pairs, such as a Wikipedia image of a dog labeled as ‘poodle.’ Other systems utilize ImageNet, a comprehensive image database with 14 million pictures categorized into 22,000 object classifications like ‘balloon’ or ‘strawberry.’ The creation of this database required the efforts of 25,000 data labellers.
As organizations embark on developing AIs for tailored applications, they must organize their datasets effectively. For instance, if you aim to create a customer service AI for an insurance firm, it needs to recognize keywords like ‘renew’ and ‘accident’ to connect customers with the appropriate department. Similarly, a car repair AI might need to identify parts in images for quoting purposes. In these scenarios, it’s common to delegate the labor-intensive task of preparing training data to specialized data labeling firms.
Most of these companies are based in the US or Europe, employing workers from the Global South, where wages are comparatively lower. While data annotation jobs can significantly improve living standards for families in countries like India and Kenya, the work can also be exploitative and emotionally taxing.
A recent TIME report recounts the experiences of annotators involved in a project aimed at detecting harmful content, such as napalm recipes, within ChatGPT. These labellers analyzed thousands of text snippets detailing graphic subjects like murder, suicide, torture, self-harm, incest, and child sex abuse. The company employing these workers, Sama, won the contract due to their expertise in content moderation for Facebook.
Sama asserts its commitment to worker welfare, claiming on their website to promote an ethical AI supply chain that meaningfully enhances employment and income for those facing significant barriers to work. They are also B Corp certified, offering fair wages and psychological support to their employees. However, some content can be too distressing; ultimately, Sama withdrew from the harmful-content detection project for ChatGPT and ceased its content moderation services earlier this year.
Another challenge lies in the opacity surrounding data annotation tasks. Investigative journalist Billy Perrigo, speaking at a Digital Futures Lab panel, shared instances where companies training AI for computer vision requested annotators to take photos of themselves under varying lighting conditions. The situation worsens with requests to photograph children of specific age groups without consent.
Concerns about transparency also affect gig economy workers in Western countries. During the ACM Conference on Fairness, Accountability, and Transparency, Amazon Mechanical Turk workers expressed discomfort over transparency issues. For example, labellers tagging photos of border crossings and satellite images often wondered about the potential misuse of their work. However, these challenges can be addressed if workers are allowed to voice their concerns.
Non-profit organization Karya actively engages its workforce to alleviate such uncertainties. Their mission is to enhance the incomes of rural Indians by providing flexible employment, fair compensation, and clarity regarding tasks. Furthermore, Karya aims to reduce inequality within the AI data sector. Labeled data holds substantial value, and Karya seeks to ensure that profits return to the workers.
Karya is experimenting with a model known as the Karya public license, whereby workers who generate data for general datasets retain ownership of that data. According to Karya’s Head of Research, Safiya Husain, “Every time we resell a dataset, we compensate the workers again for their initial contribution.”
This innovative approach has the potential to function on a large scale, akin to royalties for musicians and artists, utilizing blockchain technology. The AI data industry is still in its infancy and brimming with innovation, with its workforce being both essential and largely overlooked. As we enthusiastically embark on our own projects, it is vital to remember these workers. When selecting a partner for AI development, prioritize asking questions, demanding transparency, and ensuring fair compensation.
Video Insights on Data Labelling
Explore the hidden world of ghost work and the crucial role of human labor in AI technology.
Delve into the invisible workforce powering AI development and the ethical considerations involved.
This article is published on Generative AI. Connect with us on LinkedIn to stay updated with the latest stories and insights in AI. Together, let's shape the future of this transformative technology!