August 21, 2024 – Recently, Meta has quietly unleashed a new web crawler designed to scour the internet and amass vast amounts of data, bolstering its artificial intelligence models. This new crawler, dubbed Meta External Agent, was launched last month, as reported by three companies tracking web scrapers.
Resembling OpenAI’s GPTBot, Meta’s new crawler is capable of harvesting AI training data from the web, such as texts from news articles or conversations in online forums. Archival records indicate that Meta indeed updated a developer-facing company website in late July, with a tab revealing the presence of the new crawler. However, Meta has yet to publicly announce its new crawling robot.
Meta’s Llama stands as one of the largest llms. Although the company has not disclosed the training data used for its latest model, Llama 3, its initial model version relied on a massive dataset collected from various sources, including Common Crawl.
Earlier this year, Meta co-founder and CEO Mark Zuckerberg boasted during a financial earnings call that the company’s social platform has amassed a dataset for AI training that “exceeds Common Crawl.”
The emergence of the new crawler suggests that Meta’s extensive database might have become insufficient, as the company strives to update Llama and expand Meta AI. This typically demands fresh and high-quality training data to continuously enhance its functionalities.
According to data from Dark Visitors, nearly 25% of the world’s most popular websites have blocked GPTBot. In contrast, only 2% have blocked Meta’s new crawler robot, indicating a potentially wider acceptance or lack of awareness regarding Meta’s data collection efforts.