{"id":356264,"date":"2024-07-23T08:57:32","date_gmt":"2024-07-23T07:57:32","guid":{"rendered":"https:\/\/readwrite.com\/?p=356264"},"modified":"2024-07-23T08:57:32","modified_gmt":"2024-07-23T07:57:32","slug":"ai-scrapers-running-out-of-space-as-restrictions-close-the-net","status":"publish","type":"post","link":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/","title":{"rendered":"AI scrapers running out of space as restrictions close the net"},"content":{"rendered":"

AI scrapers are increasingly facing hostile online environments as data sources dry up.<\/p>\n

Crawling for data, also known as scraping, previously meant vast troves of text, images, and videos could be pulled from the internet without too much trouble. AI models could be trained on the seemingly infinite source but that is no longer the case.<\/p>\n

A study from AI research thinktank Data Provenance Initiative<\/a>, named “Consent In Crisis” has found a hostile environment now awaits website scrapers, especially those for the development of generative AI.<\/p>\n

Researchers probed the domains utilized in three of the most important datasets used for training AI models and that data is now more restricted than ever.<\/span><\/p>\n

14,000 web domains were assessed with the discovery of an “emerging crisis in consent” as online publishers have reacted to the presence of crawlers<\/a> and the harvest of data. The researchers outlined in the three data sets – known as C4, RefinedWeb, and Dolman – that around 5% of all data, and 25% of content from the best sources had enforced restrictions.<\/p>\n

In particular, OpenAI’s<\/a> GPTBot and Google-Extended crawlers provoked a reaction from websites to change their robot.txt restrictions. The study found between 20 and 33 percent of the top web domains have introduced extensive restrictions on scrapers, compared to a much lesser figure at the start of last year.<\/p>\n

Hard crawls resulting in full bans<\/h2>\n

Over the whole base of domains, 5-7% have enforced restrictions, up from just 1% across the same period.<\/p>\n

It was noted that many websites had changed their terms of service to completely prohibit crawling and lifting content for use in generative AI, but not to the extent of the restrictions on robot.txt.<\/p>\n

AI companies have possibly wasted time and resources due to excessive crawling that was likely not required. The researchers showed that while around 40% of the top sites used across the three datasets were related to news, over 30% of ChatGPT inquiries were for creative writing, compared to just 1% that featured news.<\/p>\n

Other notable requests included translation, coding help, and sexual roleplay.<\/span><\/p>\n

Image credit: Via Ideogram<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"

AI scrapers are increasingly facing hostile online environments as data sources dry up. Crawling for data, also known as scraping,… Continue reading AI scrapers running out of space as restrictions close the net<\/span><\/a><\/p>\n","protected":false},"author":26531,"featured_media":356266,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"_lmt_disableupdate":"no","_lmt_disable":"","footnotes":""},"categories":[9586],"tags":[],"table_tags":[],"acf":[],"yoast_head":"\nAI scrapers running out of space as restrictions close the net<\/title>\n<meta name=\"description\" content=\"AI scrapers increasingly face hostile online environments as data sources dry up a new study from has found.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"AI scrapers running out of space as restrictions close the net\" \/>\n<meta property=\"og:description\" content=\"AI scrapers increasingly face hostile online environments as data sources dry up a new study from has found.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\" \/>\n<meta property=\"og:site_name\" content=\"ReadWrite\" \/>\n<meta property=\"article:published_time\" content=\"2024-07-23T07:57:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/07\/crawlers.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"Graeme Hanna\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/graeme818\" \/>\n<meta name=\"twitter:site\" content=\"@rww\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Graeme Hanna\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\"},\"author\":{\"name\":\"Graeme Hanna\",\"@id\":\"https:\/\/readwrite.com\/#\/schema\/person\/f756f3d13ed8828218e589054348d304\"},\"headline\":\"AI scrapers running out of space as restrictions close the net\",\"datePublished\":\"2024-07-23T07:57:32+00:00\",\"dateModified\":\"2024-07-23T07:57:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\"},\"wordCount\":352,\"publisher\":{\"@id\":\"https:\/\/readwrite.com\/#organization\"},\"articleSection\":[\"AI\"],\"inLanguage\":\"en-US\",\"copyrightYear\":\"2024\",\"copyrightHolder\":{\"@id\":\"https:\/\/readwrite.com\/#organization\"}},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\",\"url\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\",\"name\":\"AI scrapers running out of space as restrictions close the net\",\"isPartOf\":{\"@id\":\"https:\/\/readwrite.com\/#website\"},\"datePublished\":\"2024-07-23T07:57:32+00:00\",\"dateModified\":\"2024-07-23T07:57:32+00:00\",\"description\":\"AI scrapers increasingly face hostile online environments as data sources dry up a new study from has found.\",\"breadcrumb\":{\"@id\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/readwrite.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI scrapers running out of space as restrictions close the net\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/readwrite.com\/#website\",\"url\":\"https:\/\/readwrite.com\/\",\"name\":\"ReadWrite\",\"description\":\"Crypto, Gaming & Emerging Tech News\",\"publisher\":{\"@id\":\"https:\/\/readwrite.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/readwrite.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/readwrite.com\/#organization\",\"name\":\"ReadWrite\",\"url\":\"https:\/\/readwrite.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/readwrite.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/03\/Logo_OnLight.svg\",\"contentUrl\":\"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/03\/Logo_OnLight.svg\",\"width\":232,\"height\":41,\"caption\":\"ReadWrite\"},\"image\":{\"@id\":\"https:\/\/readwrite.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/twitter.com\/rww\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/readwrite.com\/#\/schema\/person\/f756f3d13ed8828218e589054348d304\",\"name\":\"Graeme Hanna\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/readwrite.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/01\/graeme-hanna_avatar-96x96.jpg\",\"contentUrl\":\"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/01\/graeme-hanna_avatar-96x96.jpg\",\"caption\":\"Graeme Hanna\"},\"description\":\"Graeme Hanna is a full-time, freelance writer with significant experience in online news as well as content writing. Since January 2021, he has contributed as a football and news writer for several mainstream UK titles including The Glasgow Times, Rangers Review, Manchester Evening News, MyLondon, Give Me Sport, and the Belfast News Letter. Graeme has worked across several briefs including news and feature writing in addition to other significant work experience in professional services. Now a contributing news writer at ReadWrite.com, he is involved with pitching relevant content for publication as well as writing engaging tech news stories.\",\"sameAs\":[\"https:\/\/muckrack.com\/graeme-hanna\",\"https:\/\/www.linkedin.com\/in\/graemehanna\/\",\"https:\/\/twitter.com\/https:\/\/twitter.com\/graeme818\"],\"url\":\"https:\/\/readwrite.com\/author\/graeme-hanna\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"AI scrapers running out of space as restrictions close the net","description":"AI scrapers increasingly face hostile online environments as data sources dry up a new study from has found.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/","og_locale":"en_US","og_type":"article","og_title":"AI scrapers running out of space as restrictions close the net","og_description":"AI scrapers increasingly face hostile online environments as data sources dry up a new study from has found.","og_url":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/","og_site_name":"ReadWrite","article_published_time":"2024-07-23T07:57:32+00:00","og_image":[{"width":1280,"height":720,"url":"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/07\/crawlers.webp","type":"image\/webp"}],"author":"Graeme Hanna","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/graeme818","twitter_site":"@rww","twitter_misc":{"Written by":"Graeme Hanna","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/#article","isPartOf":{"@id":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/"},"author":{"name":"Graeme Hanna","@id":"https:\/\/readwrite.com\/#\/schema\/person\/f756f3d13ed8828218e589054348d304"},"headline":"AI scrapers running out of space as restrictions close the net","datePublished":"2024-07-23T07:57:32+00:00","dateModified":"2024-07-23T07:57:32+00:00","mainEntityOfPage":{"@id":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/"},"wordCount":352,"publisher":{"@id":"https:\/\/readwrite.com\/#organization"},"articleSection":["AI"],"inLanguage":"en-US","copyrightYear":"2024","copyrightHolder":{"@id":"https:\/\/readwrite.com\/#organization"}},{"@type":"WebPage","@id":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/","url":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/","name":"AI scrapers running out of space as restrictions close the net","isPartOf":{"@id":"https:\/\/readwrite.com\/#website"},"datePublished":"2024-07-23T07:57:32+00:00","dateModified":"2024-07-23T07:57:32+00:00","description":"AI scrapers increasingly face hostile online environments as data sources dry up a new study from has found.","breadcrumb":{"@id":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/readwrite.com\/ai-scrapers-running-out-of-space-as-restrictions-close-the-net\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/readwrite.com\/"},{"@type":"ListItem","position":2,"name":"AI scrapers running out of space as restrictions close the net"}]},{"@type":"WebSite","@id":"https:\/\/readwrite.com\/#website","url":"https:\/\/readwrite.com\/","name":"ReadWrite","description":"Crypto, Gaming & Emerging Tech News","publisher":{"@id":"https:\/\/readwrite.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/readwrite.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/readwrite.com\/#organization","name":"ReadWrite","url":"https:\/\/readwrite.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/readwrite.com\/#\/schema\/logo\/image\/","url":"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/03\/Logo_OnLight.svg","contentUrl":"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/03\/Logo_OnLight.svg","width":232,"height":41,"caption":"ReadWrite"},"image":{"@id":"https:\/\/readwrite.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/twitter.com\/rww"]},{"@type":"Person","@id":"https:\/\/readwrite.com\/#\/schema\/person\/f756f3d13ed8828218e589054348d304","name":"Graeme Hanna","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/readwrite.com\/#\/schema\/person\/image\/","url":"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/01\/graeme-hanna_avatar-96x96.jpg","contentUrl":"https:\/\/readwrite.com\/wp-content\/uploads\/2024\/01\/graeme-hanna_avatar-96x96.jpg","caption":"Graeme Hanna"},"description":"Graeme Hanna is a full-time, freelance writer with significant experience in online news as well as content writing. Since January 2021, he has contributed as a football and news writer for several mainstream UK titles including The Glasgow Times, Rangers Review, Manchester Evening News, MyLondon, Give Me Sport, and the Belfast News Letter. Graeme has worked across several briefs including news and feature writing in addition to other significant work experience in professional services. Now a contributing news writer at ReadWrite.com, he is involved with pitching relevant content for publication as well as writing engaging tech news stories.","sameAs":["https:\/\/muckrack.com\/graeme-hanna","https:\/\/www.linkedin.com\/in\/graemehanna\/","https:\/\/twitter.com\/https:\/\/twitter.com\/graeme818"],"url":"https:\/\/readwrite.com\/author\/graeme-hanna\/"}]}},"modified_by":"Sam Shedden","_links":{"self":[{"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/posts\/356264"}],"collection":[{"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/users\/26531"}],"replies":[{"embeddable":true,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/comments?post=356264"}],"version-history":[{"count":3,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/posts\/356264\/revisions"}],"predecessor-version":[{"id":356406,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/posts\/356264\/revisions\/356406"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/media\/356266"}],"wp:attachment":[{"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/media?parent=356264"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/categories?post=356264"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/tags?post=356264"},{"taxonomy":"table_tags","embeddable":true,"href":"https:\/\/readwrite.com\/wp-json\/wp\/v2\/table_tags?post=356264"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}